Dmytro Huz

Posted on May 25

Test Flakiness: The Silent Killer of Engineering Trust

#qa #programming #cicd #testing

“In a quiet village, a shepherd boy watched his flock. Bored, he often shouted ‘Wolf!’ when there was none. The villagers rushed to help—only to be lied to. When a real wolf finally came, no one believed him. The flock was lost. Trust, once broken, is hard to restore.”

We build software like miners digging through unstable rock. We need support structures. Git tracks our steps. Linters keep us clean. But one tool stands above all:

Our tests.

They are the final judge. The gatekeeper. The Anubis of our code. They weigh our changes and decide if we go to production—or stay behind and fix our sins.

If tests pass, we merge. If they fail, we stop.

That’s how it’s supposed to work.

What Is a Flaky Test (and Why It Kills Trust)

Even a broken clock is right twice a day.

So is a flaky test.

You push a feature. CI runs. First job: green. Second: green. Last job? Red.

You didn’t touch that part of the code. You re-run. Now it’s green.

You shrug and merge.

And just like that, trust is gone.

You’ve told yourself that failure doesn’t always mean something is broken. You’ve trained yourself to ignore CI. You’ve made peace with uncertainty.

That’s what a flaky test does. It lies. Not always. Just often enough to create doubt.

And doubt is fatal.

The one tool meant to protect your release process is now a coin toss.

Signs You’ve Normalized Flakiness

Some tests fail randomly—and everyone accepts it
Re-running jobs “usually fixes it”
Scheduled pipelines are “almost green” and that’s “fine”
Teams ignore certain red tests because they “always fail”

You may still have tests—but they don’t matter anymore. Your whole QA investment? Burned to the ground. The trust in your tests is dead.

Why It’s a Disaster

CI is no longer trusted. Once your team stops believing in CI, the whole system becomes meaningless. Test failures are ignored, builds are rerun blindly, and every green pipeline is suspect.
You waste developer time. Debugging flaky tests is expensive. You can’t reproduce the failure. Logs don’t help. So you try again. And again. You just lost an hour chasing a ghost.
Releases become unpredictable. Engineers stop merging confidently. Teams delay shipping while “rerunning the pipeline just one more time.” Your velocity drops. Your cycle time explodes.
Your best engineers disengage. No one wants to fix tests they don’t trust. People start skipping tests locally, disabling CI steps, or worse—removing tests altogether.
The culture rots. The worst part: nobody talks about it anymore. Flaky tests become background noise. And eventually, nobody cares about tests AT ALL.

Why It Happens

Async code not properly awaited. Tests pass or fail depending on timing.
Race conditions. Two parts of the system fighting over shared state or resources.
Sleep-based test logic. Hard-coded delays that may or may not be enough. They pass on a fast machine, fail on CI.
Randomness without seeding. Generated data that leads to inconsistent results.
Shared global state between tests. Test A pollutes something, test B fails mysteriously.
External dependencies. APIs, file systems, DBs—if they’re flaky, your test is flaky.
Resource constraints in CI. Tests that pass locally but fail under CI load.
Actual bugs in your product. Yep—flaky tests often reveal real, timing-sensitive bugs. You can’t ignore them.

What You Should Do

Flaky tests are a fire. You either put them out—or let them burn your team.

Track them. Tag flaky tests in your reports. Log frequency. Build a dashboard. Know your enemy.
Block on them. Don’t let a known flake “sometimes pass.” If it fails, the build fails. Period. Create pressure to fix, not ignore.
Isolate them. Can’t fix it right away? Move it to a separate suite. Mark it unstable. Reduce its blast radius—but never hide it.
Fix your worst offenders. Most pain comes from 5% of your tests. Hunt them down. Refactor them. Redesign if you must. Kill them first.
Set a zero-tolerance culture. One flaky test is too many. Be relentless. Because once your team starts accepting test lies, it stops believing any truth.

Final Word: Raise the Bar

Flaky tests are not a small annoyance. They’re a signal that quality is slipping.

Ignore that signal long enough, and you’ll end up with a pipeline nobody respects, a test suite nobody maintains, and releases nobody trusts.

You want speed? You want stability?

You want a team that moves fast and confidently?

Then it’s simple:

No flaky tests. Ever.

DEV Community