Awesome

Lightship 🚢

(Please read Best practices section.)

Abstracts readiness, liveness and startup checks and graceful shutdown of Node.js services running in Kubernetes.

{"gitdown": "contents"}

Behaviour

Creates a HTTP service used to check container probes.

Refer to the following Kubernetes documentation for information about the readiness and liveness checks:

Local-mode

If Lightship detects that it is running in a non-Kubernetes environment (e.g. your local machine) then...

It starts the HTTP service on any available HTTP port. This is done to avoid port collision when multiple services using Lightship are being developed on the same machine.
shutdownDelay defaults to 0, i.e. immediately proceed to execute the shutdown routine.

Detection of the local-mode can be overridden by setting {detectKubernetes: false}.

`/health`

/health endpoint describes the current state of a Node.js service.

The endpoint responds:

200 status code, message "SERVER_IS_READY" when server is accepting new connections.
500 status code, message "SERVER_IS_NOT_READY" when server is initialising.
500 status code, message "SERVER_IS_SHUTTING_DOWN" when server is shutting down.

Used for human inspection.

`/live`

The endpoint responds:

200 status code, message "SERVER_IS_NOT_SHUTTING_DOWN".
500 status code, message "SERVER_IS_SHUTTING_DOWN".

Used to configure liveness probe.

`/ready`

The endpoint responds:

200 status code, message "SERVER_IS_READY".
500 status code, message "SERVER_IS_NOT_READY".

Used to configure readiness probe.

Timeouts

Lightship has two timeout configurations: gracefulShutdownTimeout and shutdownHandlerTimeout.

gracefulShutdownTimeout (default: 30 seconds) is a number of milliseconds Lightship waits for Node.js process to exit gracefully after it receives a shutdown signal (either via process or by calling lightship.shutdown()) before killing the process using process.exit(1). This timeout should be sufficiently big to allow Node.js process to complete tasks (if any) that are active at the time that the shutdown signal is received (e.g. complete serving responses to all HTTP requests) (Note: You must explicitly inform Lightship about active tasks using beacons).

shutdownHandlerTimeout (default: 5 seconds) is a number of milliseconds Lightship waits for shutdown handlers (see registerShutdownHandler) to complete before killing the process using process.exit(1).

If after all beacons are dead and all shutdown handlers are resolved Node.js process does not exit gracefully, then Lightship will force terminate the process with an error. Refer to How to detect what is holding the Node.js process alive?.

Usage

Use createLightship to create an instance of Lightship.

import {
  createLightship
} from 'lightship';

const configuration: ConfigurationInput = {};

const lightship: Lightship = await createLightship(configuration);

The following types describe the configuration shape and the resulting Lightship instance interface.

/**
 * A teardown function called when shutdown is initialized.
 */
type ShutdownHandler = () => Promise<void> | void;

/**
 * @property detectKubernetes Run Lightship in local mode when Kubernetes is not detected. Default: true.
 * @property gracefulShutdownTimeout A number of milliseconds before forcefull termination if process does not gracefully exit. The timer starts when `lightship.shutdown()` is called. This includes the time allowed to live beacons. Default: 30000.
 * @property port The port on which the Lightship service listens. This port must be different than your main service port, if any. The default port is 9000. The default can be overwritten using LIGHTSHIP_PORT environment variable.
 * @property shutdownDelay Delays the shutdown handler by X milliseconds. This value should match `readinessProbe.periodSeconds`. Default 5000.
 * @property shutdownHandlerTimeout A number of milliseconds before forcefull termination if shutdown handlers do not complete. The timer starts when the first shutdown handler is called. Default: 5000.
 * @property signals An a array of [signal events]{@link https://nodejs.org/api/process.html#process_signal_events}. Default: [SIGTERM].
 * @property terminate Method used to terminate Node.js process. Default: `() => { process.exit(1) };`.
 */
export type ConfigurationInput = {|
  +detectKubernetes?: boolean,
  +gracefulShutdownTimeout?: number,
  +port?: number,
  +shutdownDelay?: number,
  +shutdownHandlerTimeout?: number,
  +signals?: $ReadOnlyArray<string>,
  +terminate?: () => void,
|};

/**
 * @property queueBlockingTask Forces service state to SERVER_IS_NOT_READY until all promises are resolved.
 * @property registerShutdownHandler Registers teardown functions that are called when shutdown is initialized. All registered shutdown handlers are executed in the order they have been registered. After all shutdown handlers have been executed, Lightship asks `process.exit()` to terminate the process synchronously.
 * @property shutdown Changes server state to SERVER_IS_SHUTTING_DOWN and initialises the shutdown of the application.
 * @property signalNotReady Changes server state to SERVER_IS_NOT_READY.
 * @property signalReady Changes server state to SERVER_IS_READY.
 * @property whenFirstReady Resolves the first time the service goes from `SERVER_IS_NOT_READY` to `SERVER_IS_READY` state.
 */
type Lightship = {|
  +createBeacon: (context?: BeaconContext) => BeaconController,
  +isServerReady: () => boolean,
  +isServerShuttingDown: () => boolean,
  +queueBlockingTask: (blockingTask: Promise<any>) => void,
  +registerShutdownHandler: (shutdownHandler: ShutdownHandler) => void,
  +server: http$Server,
  +shutdown: () => Promise<void>,
  +signalNotReady: () => void,
  +signalReady: () => void,
  +whenFirstReady: () => Promise<void>,
|};

Kubernetes container probe configuration

This is an example of a reasonable container probe configuration to use with Lightship.

readinessProbe:
  httpGet:
    path: /ready
    port: 9000
  failureThreshold: 1
  initialDelaySeconds: 5
  periodSeconds: 5
  successThreshold: 1
  timeoutSeconds: 5
livenessProbe:
  httpGet:
    path: /live
    port: 9000
  failureThreshold: 3
  initialDelaySeconds: 10
  # Allow sufficient amount of time (90 seconds = periodSeconds * failureThreshold)
  # for the registered shutdown handlers to run to completion.
  periodSeconds: 30
  successThreshold: 1
  # Setting a very low timeout value (e.g. 1 second) can cause false-positive
  # checks and service interruption.
  timeoutSeconds: 5

# As per Kubernetes documentation (https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#when-should-you-use-a-startup-probe),
# startup probe should point to the same endpoint as the liveness probe.
#
# Startup probe is only needed when container is taking longer to start than
# `initialDelaySeconds + failureThreshold × periodSeconds` of the liveness probe.
startupProbe:
  httpGet:
    path: /live
    port: 9000
  failureThreshold: 3
  initialDelaySeconds: 10
  periodSeconds: 30
  successThreshold: 1
  timeoutSeconds: 5

Logging

lightship is using Roarr to implement logging.

Set ROARR_LOG=true environment variable to enable logging.

Queueing service blocking tasks

Your service may not be ready until some asynchronous operation is complete, e.g. waiting for webpack-dev-middleware#waitUntilValid. In this case, use queueBlockingTask to queue blocking tasks. This way, Lightship status will be set to SERVER_IS_NOT_READY until all blocking tasks are resolved (and signalReady has been called).

import express from 'express';
import {
  createLightship
} from 'lightship';

const lightship = await createLightship();

lightship.queueBlockingTask(new Promise((resolve) => {
  setTimeout(() => {
    // Lightship service status will be `SERVER_IS_NOT_READY` until all promises
    // submitted to `queueBlockingTask` are resolved.
    resolve();
  }, 1000);
}));

const app = express();

app.get('/', (req, res) => {
  res.send('Hello, World!');
});

const server = app.listen(8080, () => {
  // All signals will be queued until after all blocking tasks are resolved.
  lightship.signalReady();
});

Waiting for the server to become ready

whenFirstReady can be used to wait until the first time the service becomes ready.

The promise returned by whenFirstReady is resolved only once. Use this function to delay execution of tasks that depend on the server to be ready.

import express from 'express';
import {
  createLightship
} from 'lightship';

const lightship = await createLightship();

const app = express();

app.get('/', (req, res) => {
  res.send('Hello, World!');
});

const server = app.listen(8080, () => {
  lightship.signalReady();
});

(async () => {
  // `whenFirstReady` returns a promise that is resolved the first time that
  // the service goes from `SERVER_IS_NOT_READY` to `SERVER_IS_READY` state.
  await lightship.whenFirstReady();

  await runIntegrationTests();
})();

Usage examples

Using with Express.js

Suppose that you have Express.js application that simply respond "Hello, World!".

import express from 'express';

const app = express();

app.get('/', (req, res) => {
  res.send('Hello, World!');
});

app.listen(8080);

To create a liveness and readiness check, simply create an instance of Lightship and use registerShutdownHandler to register a server shutdown handler, e.g.

import express from 'express';
import {
  createLightship
} from 'lightship';

const app = express();

app.get('/', (req, res) => {
  res.send('Hello, World!');
});

const server = app
  .listen(8080, () => {
    // Lightship default state is "SERVER_IS_NOT_READY". Therefore, you must signal
    // that the server is now ready to accept connections.
    lightship.signalReady();
  })
  .on('error', () => {
    lightship.shutdown();
  });;

const lightship = await createLightship();

lightship.registerShutdownHandler(() => {
  server.close();
});

Suppose that a requirement has been added that you need to ensure that you do not say "Hello, World!" more often than 100 times per minute.

Use signalNotReady method to change server state to "SERVER_IS_NOT_READY" and use signalReady to revert the server state to "SERVER_IS_READY".

import express from 'express';
import {
  createLightship
} from 'lightship';

const app = express();

const minute = 60 * 1000;

let runningTotal = 0;

app.get('/', (req, res) => {
  runningTotal++;

  setTimeout(() => {
    runningTotal--;

    if (runningTotal < 100) {
      lightship.signalReady();
    } else {
      lightship.signalNotReady();
    }
  }, minute);

  res.send('Hello, World!');
});

const server = app.listen(8080);

const lightship = await createLightship();

lightship.registerShutdownHandler(() => {
  server.close();
});

// Lightship default state is "SERVER_IS_NOT_READY". Therefore, you must signal
// that the server is now ready to accept connections.
lightship.signalReady();

How quick Kubernetes observes that the server state has changed depends on the probe configuration, specifically periodSeconds, successThreshold and failureThreshold, i.e. expect requests to continue coming through for a while after the server state has changed.

Suppose that a requirement has been added that the server must shutdown after saying "Hello, World!" 1000 times.

Use shutdown method to change server state to "SERVER_IS_SHUTTING_DOWN", e.g.

import express from 'express';
import delay from 'delay';
import {
  createLightship
} from 'lightship';

const app = express();

const minute = 60 * 1000;

let total = 0;
let runningTotal = 0;

app.get('/', (req, res) => {
  total++;
  runningTotal++;

  if (total === 1000) {
    lightship.shutdown();
  }

  setTimeout(() => {
    runningTotal--;

    if (runningTotal < 100) {
      lightship.signalReady();
    } else {
      lightship.signalNotReady();
    }
  }, minute);

  res.send('Hello, World!');
});

const server = app.listen(8080);

const lightship = await createLightship();

lightship.registerShutdownHandler(async () => {
  // Allow sufficient amount of time to allow all of the existing
  // HTTP requests to finish before terminating the service.
  await delay(minute);

  server.close();
});

// Lightship default state is "SERVER_IS_NOT_READY". Therefore, you must signal
// that the server is now ready to accept connections.
lightship.signalReady();

Do not call process.exit() in a shutdown handler – Lighthouse calls process.exit() after all registered shutdown handlers have run to completion.

If for whatever reason a registered shutdown handler hangs, then (subject to the Pod's restart policy) Kubernetes will forcefully restart the Container after the livenessProbe deems the service to be failed.

Beacons

Beacons are used to delay the registered shutdown handler routine.

A beacon can be created using createBeacon() method, e.g.

const lightship = await createLightship();

const beacon = lightship.createBeacon();

Beacon is live upon creation. Shutdown handlers are suspended until there are no live beacons.

To signal that a beacon is dead, use die() method:

beacon.die();
// This beacon is now dead.

After beacon has been killed, it cannot be revived again.

Use beacons to suspend the registered shutdown handler routine when you are processing a job queue, e.g.

for (const job of jobs) {
  if (lightship.isServerShuttingDown()) {
    log.info('detected that the service is shutting down; terminating the event loop');

    break;
  }

  const beacon = lightship.createBeacon();

  // Do the job.

  await beacon.die();
}

Additionally, you can provide beacons with context, e.g.

for (const job of jobs) {
  if (lightship.isServerShuttingDown()) {
    log.info('detected that the service is shutting down; terminating the event loop');

    break;
  }

  const beacon = lightship.createBeacon({
    jobId: job.id
  });

  // Do the job.

  await beacon.die();
}

The logs will include messages describing the beacons that are holding the connection, e.g.

{"context":{"package":"lightship","namespace":"factories/createLightship","logLevel":30,"beacons":[{"context":{"id":1}}]},"message":"program termination is on hold because there are live beacons","sequence":2,"time":1563892493825,"version":"1.0.0"}

Best practices

Add a delay before stop handling incoming requests

It is important that you do not cease to handle new incoming requests immediatelly after receiving the shutdown signal. This is because there is a high probability of the SIGTERM signal being sent well before the iptables rules are updated on all nodes. The result is that the pod may still receive client requests after it has received the termination signal. If the app stops accepting connections immediately, it causes clients to receive "connection refused" types of errors.

Properly shutting down an application includes these steps:

Wait for a few seconds, then stop accepting new connections,
Close all keep-alive connections that aren't in the middle of a request,
Wait for all active requests to finish, and then
Shut down completely.

See Handling Client Requests Properly with Kubernetes for more information.

FAQ

What is the reason that my liveness/ readiness endpoints are intermittently failing?

You may discover that your service health checks are failing intermittently, e.g.

Warning  Unhealthy  4m17s (x3 over 4m27s)   kubelet, f95a4d94-jwfr  Liveness probe failed: Get http://10.24.7.155:9000/live: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
Warning  Unhealthy  3m28s (x15 over 4m38s)  kubelet, f95a4d94-jwfr  Readiness probe failed: Get http://10.24.7.155:9000/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

This may happen if you are perfoming event-loop blocking tasks for extended durations, e.g.

const startTime = Date.now();

let index0 = 1000;

while (index0--) {
  let index1 = 1000;

  while (index1--) {
    console.log(index0 + ':' + index1);
  }
}

console.log(Date.now() - startTime);

If executed, the above operation would block the event-loop for couple of seconds (e.g. 8 seconds on my machine). During this time Lightship is going to be unresponsive.

Your options are:

Use worker_threads to execute the event-loop blocking task in the background.
Refactor the code into synchronous chunks.

What is the reason for having separate `/live` and `/ready` endpoints?

Distinct endpoints are needed if you want your Container to be able to take itself down for maintenance (as done in the Using with Express.js usage example). Otherwise, you can use /health.

How to detect what is holding the Node.js process alive?

You may get a log message saying that your process did not exit on its own, e.g.

[2019-11-10T21:11:45.452Z] DEBUG (20) (@lightship) (#factories/createLightship): all shutdown handlers have run to completion; proceeding to terminate the Node.js process
[2019-11-10T21:11:46.455Z] WARN (40) (@lightship) (#factories/createLightship): process did not exit on its own; investigate what is keeping the event loop active

This means that there is some work that is scheduled to happen (e.g. a referenced setTimeout).

In order to understand what is keeping your Node.js process from exiting on its own, you need to identify all active handles and requests. This can be done with a help of utilities such as wtfnode and why-is-node-running, e.g.

import whyIsNodeRunning from 'why-is-node-running';
import express from 'express';
import {
  createLightship
} from 'lightship';

const app = express();

app.get('/', (req, res) => {
  res.send('Hello, World!');
});

const server = app.listen(8080);

const lightship = await createLightship();

lightship.registerShutdownHandler(() => {
  server.close();

  whyIsNodeRunning();
});

lightship.signalReady();

In the above example, calling whyIsNodeRunning will print a list of all active handles that are keeping the process alive.

Related projects

Iapetus – Prometheus metrics server.
Preoom – Retrieves & observes Kubernetes Pod resource (CPU, memory) utilisation.

Awesome

Lightship 🚢

Behaviour

Local-mode

/health

/live

/ready

Timeouts

Usage

Kubernetes container probe configuration

Logging

Queueing service blocking tasks

Waiting for the server to become ready

Usage examples

Using with Express.js

Beacons

Best practices

Add a delay before stop handling incoming requests

FAQ

What is the reason that my liveness/ readiness endpoints are intermittently failing?

What is the reason for having separate /live and /ready endpoints?

How to detect what is holding the Node.js process alive?

Related projects

`/health`

`/live`

`/ready`

What is the reason for having separate `/live` and `/ready` endpoints?