Debugging methodically: a checklist

Debugging is faster when you treat it as controlled investigation. The aim is to move from symptom to cause with evidence, not to cycle through guesses until the failure disappears.

State the failure precisely

Start by writing the failure in one sentence. Include what happened, what was expected, where it happened, and how often it happens.

A weak statement is "login is broken". A useful statement is "password reset returns a 500 response after the token has expired, but only when the request includes a redirect parameter".

The precise statement controls the investigation. If the statement changes, write the new version down. Many debugging sessions fail because the team silently switches between different bugs.

Reproduce the problem

A bug that cannot be reproduced can still be investigated, but it is harder to prove that it has been fixed. Try to create the smallest repeatable case.

Check:

Input data.
Environment.
Version.
Configuration.
Time and timezone.
Network dependencies.
Feature flags.
User permissions.
Browser, runtime, or operating system.

When reproduction requires production data or timing, capture safe diagnostic facts rather than copying private data. Use synthetic data where possible.

Check the recent change set

Recent changes are not always the cause, but they are a useful starting point. Review code changes, dependency updates, configuration changes, data migrations, infrastructure changes, and scheduled jobs.

Avoid anchoring on the first suspicious change. Treat it as a hypothesis, then prove or disprove it.

Read the error and stack trace fully

Read the first error, the final error, and the stack frames between them. Do not stop at the top line if it is a wrapper error. Do not ignore the caused-by chain if the platform provides one.

A stack trace answers three questions:

Where was the error raised?
Which path reached that point?
Which layer translated or swallowed the original failure?

If the stack crosses framework or vendor code, find the last frame that belongs to the project. That is often the best place to inspect state.

Form one hypothesis at a time

A hypothesis should be testable. "The cache is wrong" is vague. "The user profile cache key omits the region, so users with the same id in different regions can collide" is testable.

Write down the observation that would disprove the hypothesis. This guards against confirmation bias, the tendency to favour evidence that supports what you already believe.

Add the smallest useful instrumentation

Use the lowest impact tool that can answer the current question. That may be a log line, a breakpoint, a watch expression, a database query, a trace, or a metric.

Use debuggers when you need runtime state. Breakpoints pause execution at a chosen line so you can inspect variables, evaluate expressions, and walk the call stack. Step controls help you follow the exact path. Watch expressions help when a value changes over time.

Use logging when timing, concurrency, deployment, or remote execution makes a live debugger unsafe or impractical. Log facts, not guesses. Include identifiers that let related events be grouped, but avoid secrets and personal data.

Narrow the boundary

Find the smallest boundary where input is correct and output is wrong, then move the boundary inward.

Common boundaries include:

HTTP request and response.
Function input and return value.
Queue message production and consumption.
Database read and write.
Cache lookup and store.
Serialisation and parsing.
Third-party request and response.

This method turns a large failure into a smaller one. Once the bad boundary is known, inspect only the code that can affect it.

Compare a passing case with a failing case

A diff between a passing and a failing case is often more useful than a large log. Compare inputs, headers, configuration, permissions, timestamps, dependency versions, data shape, and execution path.

Keep the cases as similar as possible and change one variable at a time. If several variables differ, the comparison may prove nothing.

Be careful with time, state, and concurrency

Intermittent bugs often involve mutable state, time, scheduling, retries, caches, or concurrent access. Check whether the failure depends on order.

Ask:

Does it fail on the first run or only after warm-up?
Does it fail after a cache entry expires?
Does it fail around daylight saving changes or date boundaries?
Does it fail under parallel execution?
Does a retry hide the original error?

For concurrency bugs, adding logs or breakpoints can change timing. Prefer targeted instrumentation and repeatable stress tests.

Prove the fix before cleaning up

A fix is proven when the failing reproduction passes and relevant existing tests still pass. Add a regression test when the behaviour should stay fixed.

The regression test should fail before the fix and pass after it. If that is not practical, document why and add the closest reliable coverage.

Do not delete diagnostic notes too early. They may be needed for the pull request description, incident review, or future debugging.

Write the conclusion

Close the loop with a short explanation:

Symptom.
Root cause.
Fix.
Test evidence.
Follow-up risk if any.

This prevents the same investigation from being repeated later. It also separates the real cause from the guesses that were explored along the way.

Conclusion

Methodical debugging is a discipline of precision. State the failure, reproduce it, inspect the evidence, test one hypothesis at a time, narrow the boundary, and prove the fix. The checklist is simple because the hard part is resisting guesses.