Logging that is actually useful

Useful logging is not about writing more lines. It is about recording the events that help an engineer understand what happened, who or what was affected, and what to do next. Thi…

Useful logging is not about writing more lines. It is about recording the events that help an engineer understand what happened, who or what was affected, and what to do next. This post is about the craft of a good log event: how to structure it, which level to give it, what to record, and what to keep out.

Start with the question the log must answer

A useful log entry answers an operational question. It should help someone diagnose a fault, investigate a security event, explain a state transition, or prove that an expected action happened.

Before adding a log line, decide which question it answers. Good examples include:

  • Did the request enter the system?
  • Which dependency failed?
  • Was the operation retried?
  • Which customer visible resource was affected?
  • Was access denied, and why?

Bad logs only restate that code executed. A message such as "starting process" is rarely useful unless it marks a lifecycle transition that operators act on.

Prefer structured events

Plain text is easy to write but hard to query consistently. Structured logs give every important field a stable name, which makes them easier to filter, aggregate, alert on, and join with other signals.

Use consistent fields across services. At minimum, most application events need:

  • timestamp
  • level
  • service name
  • environment
  • event name
  • request or trace identifier
  • operation name
  • outcome
  • duration when the event represents completed work
  • error type when the event represents failure

Treat the event name as an API. Keep it stable, specific, and low cardinality. Put variable detail in fields, not in the event name.

Log outcomes, not every step

Log the boundary of meaningful work. For example, log that a request completed, that a payment authorisation failed, or that a background job was skipped because its lock was held elsewhere.

Avoid logging every internal branch. High volume debug logs hide the events that matter, increase cost, and make incident review slower. Keep debug logs temporary or disabled by default in production.

A useful production log stream should let an engineer move from symptom to cause without reading a transcript of the whole program.

Use levels consistently

Levels only help when everyone uses them the same way.

Use error when the operation failed and needs investigation or user visible handling. Use warn when the operation completed with degraded behaviour or hit an unusual condition that could become a fault. Use info for normal lifecycle and business significant events. Use debug for development detail that is not normally retained in production.

Do not log an error and then handle it completely with no degraded outcome. That creates false alarms. Do not hide failed user visible work at info level. That makes real incidents harder to find.

Include enough context to act

A log entry should stand alone. During an incident, the reader may see one event in a search result or alert payload, with no surrounding lines for context.

Include identifiers that let the reader find the affected request, job, account, resource, tenant, host, deployment, or upstream dependency. Include the decision the service made. Include the observed status code or error class when a dependency failed.

Do not include secrets, credentials, access tokens, session identifiers, personal data, or full request bodies unless there is a documented and approved reason. Logs are copied, indexed, retained, exported, and read by more systems than your application data ever is.

Make security events explicit

Security relevant events should be easy to find without guessing text fragments. OWASP recommends recording authentication successes and failures, authorisation failures, input and output validation failures, session management failures, user administration actions such as privilege changes, and access to sensitive data.

Keep these events structured and consistent. Use a stable event name, an outcome, an actor identifier, a target identifier, a source address where appropriate, and a reason code. Never log authentication passwords, access tokens, session identifiers, encryption keys, or connection strings, even when authentication fails. OWASP lists all of these as data to exclude from logs.

Security logs are not only for attack detection. They are also evidence for audit, investigation, and control validation.

Correlate with other signals

A log becomes more useful when it can be connected to a trace, request, or metric time series. Propagate a trace or request identifier through the call path and write it to every event created while handling the request.

Logs, metrics, and traces answer different questions and none replaces the others. The relationship between them is its own topic, covered in the post on metrics, logs, and traces. For an individual event, the practical step is to carry a correlation identifier so the event can be tied back to the wider picture.

Keep retention and cost visible

Logging has operational cost. High cardinality fields, verbose payloads, and noisy event streams make storage and query systems expensive. They can also slow down incident response by returning too much irrelevant data.

Set retention by use case. Short lived debug detail, operational logs, security audit events, and compliance records often need different retention periods. Make those periods explicit and review them when the service changes.

A practical review checklist

For each production log event, ask:

  • What operational question does this answer?
  • Can it be searched by stable fields?
  • Does it include the identifiers needed to act?
  • Is the level correct?
  • Could it leak secrets or sensitive data?
  • Is the event volume acceptable during failure?
  • Can it be correlated with a trace, request, or metric?

If the answer is unclear, change the log or remove it.

Conclusion

Useful logging is designed, not sprinkled through code after something fails. Record structured events at operational boundaries, protect sensitive data, keep levels consistent, and make every event answer a question an engineer will actually ask in production.