Queues and background jobs: the basics

Queues move work out of the synchronous request path and into worker processes. They are useful when work is slow, bursty, retryable, or not required before the user receives a re…

Queues move work out of the synchronous request path and into worker processes. They are useful when work is slow, bursty, retryable, or not required before the user receives a response. They are not a shortcut for reliability. A queue adds another distributed system, with delivery semantics, backpressure, monitoring, and failure handling.

Core terms

A producer creates a message. A broker stores or routes the message. A queue holds messages until workers can process them. A consumer or worker receives a message and performs the task. An acknowledgement tells the broker that the message has been handled.

The unit of work should be small enough to retry safely and large enough to be meaningful. A job that does too much is hard to recover. A job that does too little can create coordination overhead and excessive queue traffic.

Why use a queue

A queue is useful when the caller does not need the result immediately. Common examples include sending notifications, generating reports, importing data, resizing images, synchronising with another system, and running maintenance tasks.

Queues also smooth bursts. A web tier might receive a spike of requests, enqueue work quickly, and let workers drain the backlog at a controlled rate. This protects dependencies from sudden load, but only if the queue length, worker concurrency, and downstream capacity are managed.

Delivery is not exactly once

Most practical job systems should be treated as at-least-once delivery. A job can run more than once. A worker can crash after the side effect but before acknowledgement. A timeout can make the broker deliver the same message again. A deployment can interrupt a worker.

Make jobs idempotent so that running the same message twice does not corrupt state. Lean on stable business identifiers, database uniqueness constraints, and state checks before acting. Do not rely on the queue to prevent all duplicates. The detailed contract for idempotency keys on external calls is its own subject and is covered separately.

Acknowledgements

Acknowledgement timing changes failure behaviour. If a worker acknowledges before doing the work, a crash can lose the job. If it acknowledges after doing the work, a crash can cause redelivery and duplicate execution. Late acknowledgement is safer only when the job is idempotent.

Workers should handle shutdown deliberately. On termination, a worker should stop accepting new work, finish or abandon current work according to the broker contract, and leave the system in a state where incomplete work can be retried.

Durability

Durability needs both broker configuration and message publishing behaviour. A durable queue alone is not enough if messages are published in a non-durable way. A persistent message alone is not enough if the broker is configured to discard the queue.

Durability also has cost. Persisting messages, confirming publishes, and replicating queues increase latency and resource use. Choose durability based on business impact. A cache refresh job and a payment capture job do not need the same guarantees.

Concurrency and ordering

Increasing workers improves throughput only until another constraint becomes the bottleneck. The database, a rate-limited API, a filesystem, or the broker itself can become the limiting dependency.

Ordering is fragile under concurrency. If strict ordering matters, design for it explicitly. That may mean partitioning by key, using a single worker for a stream, or storing state transitions in a database and rejecting invalid transitions. Do not assume global ordering in a general purpose queue.

Backpressure

A growing queue is a signal that producers are adding work faster than workers can complete it. The right response might be to add workers, reduce producer rate, shed low priority work, pause a feature, or fix a slow dependency. Adding workers blindly can make the dependency fail faster.

Track queue depth, message age, processing duration, failure rate, retry rate, dead letter count, and worker saturation. Message age is often more useful than queue length because it shows user visible delay.

Retries and dead letters

Retries should be bounded and delayed. Immediate retry can create a hot loop that repeatedly fails the same job. Use backoff for transient failures. Do not retry validation errors or permanent business rule failures without a change.

After the retry limit, move the job to a dead letter queue or failure store with enough context for investigation. Operators need the job type, safe identifiers, error class, attempt count, and timestamps. They also need a documented way to replay or discard failed jobs.

Payload design

Keep job payloads small and stable. Store identifiers rather than large object graphs. A worker can load current state from the database and decide whether the job is still needed. This avoids stale serialised objects and reduces broker memory pressure.

Avoid putting secrets or unnecessary personal data in messages. Queues often have different retention, access, and logging paths from application databases.

When not to use a queue

Do not use a queue when the caller needs a confirmed result before continuing. Do not use a queue to hide slow database queries that still need to be fixed. Do not use a queue when the business operation requires a synchronous transaction and there is no safe compensating action.

A queue changes the user experience. The system must expose pending, completed, and failed states where the user or another system needs to know what happened.

Conclusion

Queues are a basic building block for responsive and resilient systems, but they require disciplined design. Treat delivery as repeatable, make jobs idempotent, control concurrency, monitor message age, and provide a clear failure path. The queue should make work manageable, not invisible.