Metrics, logs and traces: the three pillars

Metrics, logs and traces are three different ways to understand a running system. They overlap, but they answer different questions and work best when designed together rather than bolted on one at a time.

What each signal is for

Metrics are measurements captured over time. They are compact, cheap to query, and well suited to dashboards, alerting, trends, and service level objectives.

Logs are records of events. They explain decisions, state changes, failures, and security relevant activity. They are useful when an engineer needs detail that is too specific or too rare to capture in a metric.

Traces show the path of work through a system. They are especially useful in distributed systems where one user request crosses many services, queues, and dependencies.

A mature setup uses all three. Metrics show that something is wrong. Traces help locate where the work slowed down or failed. Logs explain what happened at a specific point.

Metrics answer how much and how often

Metrics are the best signal for service health. A practical baseline for any user facing path is the four golden signals: latency, traffic, errors, and saturation.

Use metrics for questions such as:

What proportion of requests failed?
How long did successful requests take?
Is the queue growing?
Is the database connection pool exhausted?
Are we close to a quota or capacity limit?

Use labels carefully. Labels such as service, route, method, outcome, region, and dependency are often useful. Labels such as user identifier, request identifier, order identifier, or full URL can create unbounded cardinality and damage the monitoring system.

Metrics should be stable enough to alert on. If a metric name or label changes every release, it is not a reliable operational contract.

Logs answer what happened

Logs are best for discrete events. A useful entry records a meaningful fact: a request completed, a job failed permanently, a dependency returned an unexpected response, or a security decision was made.

Use logs for questions such as:

Which resource was affected?
What did the service decide to do?
Which dependency returned the error?
Was the request rejected by validation, authentication, or authorisation?
What retry or fallback path was used?

Keep logs structured and include the trace identifier so a log line can lead back to a trace. That is enough for this post. How to make individual log entries genuinely useful is its own topic.

Traces answer where time and failure moved

A trace follows one unit of work across process boundaries. Each span represents an operation within that trace. A trace can show that a request spent most of its time in a downstream API, a database query, a queue wait, or application code.

Use traces for questions such as:

Which service added latency?
Which dependency failed first?
Did retries make the request slower?
Did parallel calls run as expected?
Which path did this request take through the system?

Traces are only useful when context is propagated consistently across services. The W3C Trace Context standard defines the traceparent and tracestate HTTP headers for exactly this, so that every compliant tool can read and forward the same trace identity. Without propagation, traces break at service boundaries and become isolated fragments.

Correlation matters more than volume

Collecting all three signals is not enough. They must share enough context to be used together.

The most important link is a trace or request identifier. A dashboard should lead to a trace. A trace should lead to relevant logs. Logs should include the identifiers needed to find related metrics, resources, and deployments.

Standard names also help. Shared conventions reduce the translation work between teams, libraries, and tools. The goal is not to make every service identical. The goal is to make the common parts predictable.

Alert from symptoms, investigate with detail

Alerts should usually come from user visible symptoms, not from every internal cause. A high error rate, a failed availability objective, rising tail latency, or exhausted critical capacity is worth attention because it describes impact or imminent impact.

Logs and traces are usually better for investigation than paging. A single error log line rarely proves user impact. A failed internal call might be retried successfully. A slow dependency might affect only a low priority background job.

Use the alert to bring someone to the problem. Use logs and traces to help them solve it.

Avoid common design mistakes

Do not use logs as metrics. Counting log lines is fragile because log volume changes with code paths, sampling, and level configuration.

Do not use metrics as forensic records. A counter can show that failures increased, but it cannot explain the exact request, actor, or decision.

Do not use traces as a replacement for service health monitoring. Sampling, retention, and backend cost usually make traces unsuitable as the only source for alerting.

Do not collect high volume telemetry without an owner. Every signal needs a reason to exist, a retention policy, and a review path when cost or noise grows.

A practical starting set

For a typical HTTP service, start with:

request count by route, method, status class, and outcome
request latency by route and outcome
dependency latency and error count by dependency and operation
saturation metrics for CPU, memory, queues, workers, and connection pools
structured logs for request completion, permanent job failure, security decisions, and dependency failures
traces for inbound requests and significant outbound calls

Then add service specific signals only when they answer a real operational question.

Conclusion

Metrics, logs and traces work best as a connected system. Metrics show the health of the service, traces show the path of work, and logs explain the events and decisions. Design them together, correlate them with stable identifiers, and alert on symptoms that matter to users.

What each signal is for

Metrics answer how much and how often

Logs answer what happened

Traces answer where time and failure moved

Correlation matters more than volume

Alert from symptoms, investigate with detail

Avoid common design mistakes

A practical starting set

Conclusion

Related posts

Platform engineering is a product problem, not a Kubernetes problem

FinOps for engineers: cutting cloud waste without killing velocity

Health checks and graceful shutdown