Health checks and graceful shutdown

Health checks and graceful shutdown protect availability during deploys, scaling, dependency failures, and node maintenance. They work only when they describe the real state of th…

Health checks and graceful shutdown protect availability during deploys, scaling, dependency failures, and node maintenance. They work only when they describe the real state of the process.

Health checks have different jobs

A single health endpoint cannot answer every operational question. Readiness, liveness, and startup checks exist for different reasons.

A readiness check says whether the process should receive traffic. It should fail when the instance cannot serve useful work, even if the process is still running.

A liveness check says whether the process is stuck and should be restarted. It should fail only when restart is the right recovery action.

A startup check gives slow starting applications time to initialise before liveness checks begin. It prevents a valid slow start from being treated as a dead process.

Mixing these meanings causes outages. A liveness check that depends on a slow database can restart every instance during a database incident. A readiness check that always returns success can send traffic to a process that cannot serve it.

Design readiness around serving traffic

Readiness should reflect whether the instance can accept the traffic it is about to receive.

A useful readiness check may include:

  • required configuration loaded
  • critical local resources initialised
  • worker pool ready
  • queue consumer ready when the instance is a consumer
  • dependency state when the dependency is required for every request
  • overload state when the instance is intentionally shedding traffic

Keep readiness fast and bounded. It should not perform expensive deep checks on every probe. If a dependency is optional or has a working fallback, do not fail readiness only because that dependency is unavailable.

Keep liveness narrow

Liveness should detect a process that cannot recover without restart. Examples include a deadlocked event loop, a failed main worker, or an internal state that prevents all future work.

Do not use liveness as a general dependency test. Restarting healthy application processes will not repair a failed database, a broken network route, or an upstream outage. It can make the incident worse by adding restart storms and cold starts.

When in doubt, prefer readiness failure over liveness failure. Removing an instance from traffic is usually safer than restarting it.

Use startup checks for slow initialisation

Some services need time to load caches, run migrations, warm interpreters, create connections, or build local indexes. A startup check lets the platform distinguish slow initialisation from a failed process.

After startup succeeds, liveness can begin. This is safer than setting a large liveness delay that hides real failures later in the process lifetime.

Graceful shutdown starts before termination

Graceful shutdown is a sequence, not a signal handler alone. The service should stop accepting new work, let existing work finish within a bounded time, release resources, flush telemetry, and exit.

A common sequence is:

  • receive the termination signal
  • mark readiness as false
  • stop accepting new requests or messages
  • drain in-flight work up to a deadline
  • close listeners and consumers
  • flush logs, metrics, and traces
  • close database and network connections
  • exit with a clear status

The deadline matters. A process that never exits will eventually be killed by the platform.

Account for load balancers and endpoints

Traffic may continue briefly after readiness changes. Endpoint updates, proxies, clients, and load balancers do not all react instantly.

Design shutdown to tolerate that delay. Stop advertising readiness before closing the listener. Keep serving existing connections during the drain period. Return a clear failure for new work only after the instance has been removed from normal routing, or when the shutdown deadline requires it.

Handle background workers deliberately

Workers need their own shutdown path. On termination, a worker should stop taking new jobs, finish or safely abandon the current job, and make the job visible for retry when required.

The correct behaviour depends on the job system. Some jobs are idempotent and can be retried safely. Others need explicit leases, checkpoints, or compensation. The shutdown path should match the delivery and retry semantics of the queue.

Test shutdown during deploys

A graceful shutdown path that is not tested will fail when deploys or node maintenance happen at load.

Test these cases:

  • termination while serving a long request
  • termination while holding a queue job
  • termination during dependency slowness
  • termination while telemetry export is slow
  • repeated rolling deploys under normal traffic
  • readiness failure without process restart

Measure dropped requests, duplicated work, shutdown duration, and time to remove the instance from traffic.

Common mistakes

Do not make every probe call the database. That turns a database incident into a platform restart incident.

Do not return ready before the service can serve real traffic. That creates errors during deploys and scaling.

Do not ignore termination signals. Default process behaviour may exit immediately without draining work.

Do not make shutdown unbounded. Platforms eventually send a hard kill.

Do not assume one health endpoint is enough. Separate readiness, liveness, and startup semantics.

Conclusion

Health checks should tell the platform whether to route traffic, wait for startup, or restart a broken process. Graceful shutdown should drain work before exit. Keep the checks narrow, fast, and truthful, then test them under the same conditions that happen during deploys and failures.