Taming AI-Generated Test Sprawl with Autoloop Coverage…

I ran pytest --co -q on ForkHub one morning and counted 489 tests. The project had maybe 12 source files. Something had gone sideways, I had started noticing this kind of test pileup in other projects too.

Every Claude Code session was generating fresh tests without checking what already existed. Fixtures got duplicated, stubs got reinvented, and the test suite was getting bloated. Claude ran a per-test coverage analysis (another thing I learned) and found that 287 of those 489 tests (78%) covered zero unique lines. Every line they touched was also covered by at least one other test. The remaining tests were dead weight.

I've since cleaned this up and added guardrails to prevent it from happening again. ForkHub is down to 119 tests with the same coverage. Here's how I got there.

How AI coding assistants create test sprawl

The problem is structural, not a quality issue with any particular AI tool. Each coding session starts fresh. The agent doesn't know what tests exist, what stubs have been defined, or what fixtures are shared. So it does the responsible thing: it writes tests for the code it's building. The trouble is that the previous session did the same thing.

I have no proof of this, but I have a hunch that it may be an artifact of asking the agents to do TDD. I think it takes the requirements and pounds out tests without thinking about it too hard.

Over time, this produces a pattern:

Stubs get reinvented per-file instead of shared. ForkHub had 6+ copies of StubGitProvider across different test files.
Fixtures get duplicated. 12 identical db() fixtures, each defined locally in separate test files.
"Safety" tests overlap heavily. Tests for edge cases that another test already covers.
Test count grows linearly with sessions, but coverage plateaus early.

The result is a test suite that feels thorough but is mostly redundant. And it slows down your feedback loop for no reason. My initial feel was "Great, I've got tons of test coverage", but that quickly changed when I realized it was getting to be too much.

Per-test coverage analysis: finding the dead weight

The key insight is that coverage.py can track which lines each individual test covers, not just aggregate coverage. This is the tool that makes the whole approach work.

cat > /tmp/.coveragerc_dyn << 'EOF'
[run]
source = src/forkhub
dynamic_context = test_function
data_file = /tmp/.coverage_contexts
EOF

coverage run --rcfile=/tmp/.coveragerc_dyn -m pytest tests/ -q -p no:cov
coverage json --rcfile=/tmp/.coveragerc_dyn --show-contexts -o /tmp/cov_contexts.json

The dynamic_context = test_function setting records which test function was running when each line was executed. The JSON output gives you a map: for every source line, you know exactly which tests cover it.

From there, the analysis is straightforward. For each test, count how many lines it covers exclusively, lines no other test touches. Tests with zero unique lines are 100% safe to delete. Removing them cannot change your coverage percentage, because every line they cover is also covered by something else.

One gotcha: use coverage run directly, not pytest --cov. The pytest-cov plugin's --cov-context=test sets a static label, not dynamic per-test contexts. You'll get empty context data and wonder what went wrong. Also pass -p no:cov to prevent pytest-cov from interfering, and write to a separate data_file to avoid conflicts with any existing .coverage database.

In ForkHub, this analysis identified 287 zero-unique tests out of 489. Nearly 60% of the suite was provably redundant.

The autoloop approach: autonomous test consolidation

I could have deleted 287 tests by hand, but I'd been experimenting with autoloop, an autonomous experiment loop based on Andrej Karpathy's autoresearch pattern. Besides, who wants to go through and manually delete a bunch of tests. The idea is simple: define a metric, let an agent propose changes, measure the result, keep improvements, revert regressions. Hill-climbing, but with an LLM as the proposal engine.

For test consolidation, the metric is test count with a coverage floor as a quality gate. The agent can consolidate, parametrize, or delete tests freely, but if coverage drops below the floor, the experiment gets reverted.

A two-phase strategy works best:

Phase 1: Intra-file consolidation. The agent merges narrow tests into parametrized tests (5 tests checking 5 enum values become 1 parametrized test), removes tests that only validate framework behavior (Pydantic defaults, Click parsing), and consolidates tests with near-identical setup. This phase typically yields a 10-15% reduction before plateauing.

Phase 2: Cross-file deduplication. Once individual files are clean, the agent looks across files for structural duplication. Duplicate test files testing different instances of the same abstraction. Copy-pasted helper functions. Shared setup that should live in conftest.py. This phase can delete entire files at once, and the wins are bigger per experiment.

The signal to switch phases: when per-experiment improvement drops below half of the early-phase average, it's time to reconfigure the strategy, not the parameters.

The gotcha: stale seeds

There's a trap that cost me some time (or rather cost Claude some time). The coverage seed (the map of which tests are zero-unique) goes stale as you delete tests.

Here's why: when you delete test A, the lines it shared with test B are now exclusively covered by test B. Test B was zero-unique before (all its lines were shared), but now it's the sole owner of those lines. Delete test B and you lose coverage.

The symptom is clear: 3 or more consecutive experiments get reverted because they dropped coverage, even though the seed said those tests were safe. When that happens, stop and regenerate the seed against the current test suite.

For ForkHub, I regenerated the seed twice across the full cleanup. The first seed found 287 zero-unique tests at 489 total. After pruning ~150 tests, the second seed found fewer zero-unique tests (the easy wins were gone), but still enough to continue making progress.

As a rule of thumb: regenerate every time the suite shrinks by 10-15%.

Prevention: cheaper than cleanup

The autoloop cleanup took about an hour of compute. Prevention costs nothing. Three changes keep the problem from recurring:

The first two are standard pytest hygiene that any project should have: a tests/stubs.py for shared stubs and a tests/conftest.py for shared fixtures. ForkHub had these, but Claude Code wasn't using them. Each session defined its own local StubGitProvider and db() fixture instead of importing the shared ones. The infrastructure existed; the agent didn't know to look for it.

The fix that made the difference was adding guardrails to CLAUDE.md, the project instructions file that every Claude Code session reads automatically:

Dos:
- Use shared stubs from tests/stubs.py and fixtures from tests/conftest.py
- Check existing tests before adding new ones; parameterize or extend instead of duplicating

Don'ts:
- Don't define local StubGitProvider, db fixtures, or factory helpers in test files
- Don't add tests that overlap with existing ones; search test files first

This is the cheapest intervention. It costs nothing to add and it addresses the root cause: the agent generates tests in isolation because it doesn't know what test infrastructure already exists. Telling it where to look solves the problem.

Results

ForkHub went from 489 tests to 119. A 76% reduction with zero coverage loss. The remaining tests are individually more load-bearing; unique line coverage per test increased from an average of 2.3 to 8.7 lines.

The approach has worked on other projects too, with similar ratios of redundancy. The pattern is consistent: AI-assisted TDD accumulates duplicate coverage over time, per-test coverage analysis identifies the redundancy mathematically, and autoloop prunes it autonomously.

If you're using AI coding assistants and practicing TDD (which you should be), test sprawl is going to happen. The fix isn't writing fewer tests. It's giving the agent a coverage map so it knows which tests already exist, and a shared test infrastructure so it doesn't reinvent stubs every session.

I didn't notice the bloat until I had 489 tests on 12 source files. But the cleanup taught me more about how coverage works at the per-test level than years of writing tests ever did, and I got to do a fun experiment with the autoresearch idea along the way.

The autoloop plugin I used for this is part of my Claude Code plugin marketplace.

Taming AI-Generated Test Sprawl with Autoloop Coverage Analysis

How AI coding assistants create test sprawl

Per-test coverage analysis: finding the dead weight

The autoloop approach: autonomous test consolidation

The gotcha: stale seeds

Prevention: cheaper than cleanup

Results

Related