Why your CI pipeline is slow, fragile, and lying to you

A slow CI pipeline is visible. A fragile one is tolerated. A misleading one is dangerous. The worst pipelines do not just waste time. They give teams confidence that a change is safe when the checks are incomplete, noisy, or disconnected from production risk.

Speed is not the first problem

Pipeline duration matters because feedback delay changes behaviour. Engineers batch changes, defer tests, retry blindly, or merge with less context when feedback is slow. But speed alone is not the target.

A fast pipeline that skips meaningful checks is worse than a slow one. A useful pipeline answers a specific question: is this change safe enough to progress to the next stage?

That question requires a clear test strategy. Unit tests should protect local logic. Integration tests should protect contracts. Security checks should protect known classes of risk. Build steps should prove the artefact can be produced repeatably. Deployment checks should prove the artefact can move through the release path.

Fragility usually comes from hidden state

CI becomes fragile when jobs depend on mutable external state: shared databases, uncontrolled test data, floating dependencies, overloaded runners, undeclared services, or caches that are treated as correctness mechanisms.

Caching is useful for performance, but it should not be required for correctness. A cache miss should make the job slower, not different. When a clean run and a cached run produce different results, the pipeline is hiding a dependency problem. This is the same reasoning behind hermetic builds, where the same inputs are expected to produce the same output regardless of the host.

The same logic applies to test order, time zones, random ports, live third party APIs, and environment variables that exist on one runner but not another. A pipeline that only passes in one accidental environment is not a quality gate.

Retries are a signal, not a fix

Retries can reduce noise from transient infrastructure failures. They should not be used to normalise flaky tests. A flaky test is one that both passes and fails against the same source code, with no change to the code, the test, or the environment. A test that passes on the third attempt has still reported useful information: something is nondeterministic.

Track retry rate separately from failure rate. A green build that needed multiple retries should not be treated as equivalent to a clean build. Flakiness consumes attention, weakens trust, and eventually trains teams to ignore red builds.

The pipeline may be lying about coverage

Many pipelines report success without checking the riskiest parts of a change. A service change may pass tests but never exercise database migrations. A frontend change may pass build checks but never validate accessibility or browser behaviour. An infrastructure change may validate syntax but not policy impact.

The answer is not to add every possible check to every pull request. The answer is to classify changes and run the checks that match the risk. A documentation change should not wait behind a full production simulation. A permissions change should not skip policy review because unit tests passed.

CI should produce decisions

Good CI output is not a wall of logs. It is a decision record. What changed? What was checked? What was skipped, and why? Which artefact was produced? Which version of each tool ran? What evidence supports promotion?

This matters for debugging and for audit. If a production incident traces back to a change, the pipeline should show what the system believed at the time of release.

Conclusion

A good CI pipeline is not simply fast. It is deterministic, targeted, and honest. It separates performance optimisations from correctness, treats flakiness as a defect, and gives teams evidence they can use. The goal is not a green badge. The goal is trustworthy feedback.

Speed is not the first problem

Fragility usually comes from hidden state

Retries are a signal, not a fix

The pipeline may be lying about coverage

CI should produce decisions

Conclusion

Related posts

Platform engineering is a product problem, not a Kubernetes problem

Backups you can actually restore from

Blameless postmortems: a simple template