Skip to content

Test suite fails intermittently due to file system race conditions #48

@gregpriday

Description

@gregpriday

Summary

Integration and unit tests that use temporary file systems fail intermittently when run in parallel (default Jest configuration). Tests pass consistently when run with --maxWorkers=1 (serial execution), indicating race conditions in test setup/teardown or shared resource access.

Current Behavior

When running npm test (parallel execution), tests randomly fail with empty file arrays:

Observed failures (varies between runs):

  • tests/integration/pipeline.test.js - "should respect gitignore patterns"
    • Expected index.js to be in results, but received empty array []
  • tests/unit/pipeline/stages/FileDiscoveryStage.copytreeinclude.test.js - "should transform bare directory name to recursive pattern"
    • Expected .claude/config.json to be in results, but received empty array []

Reproducibility:

  • Parallel execution (npm test): Fails 1-2 tests in ~50% of runs
  • Serial execution (npm test -- --maxWorkers=1): Always passes (934/934 tests)
  • Single test execution: Always passes

Expected Behavior

All 934 tests should pass consistently in both parallel and serial execution modes without flakiness.

Steps to Reproduce

  1. Checkout develop branch at commit 9293115
  2. Run npm install
  3. Run npm test (parallel execution)
  4. Observe 1-2 random test failures in file discovery tests
  5. Run npm test -- --maxWorkers=1 (serial execution)
  6. Observe all 934 tests pass

Environment

  • OS: macOS 14.x (Darwin 25.0.0)
  • Runtime: Node.js 20.x
  • Test Framework: Jest 29.7.0
  • Branch: develop
  • Commit: 9293115

Evidence

Error Output (example from parallel run):

FAIL tests/integration/pipeline.test.js
  ● Pipeline Integration Tests › gitignore support › should respect gitignore patterns

    expect(received).toContain(expected) // indexOf

    Expected value: "index.js"
    Received array: []

      216 |       expect(filePaths).not.toContain('debug.log');
      217 |       expect(filePaths).not.toContain('node_modules/package.js');
    > 218 |       expect(filePaths).toContain('index.js');
          |                         ^
      219 |     });

      at Object.toContain (tests/integration/pipeline.test.js:218:25)

Successful Serial Run:

$ npm test -- --maxWorkers=1
Test Suites: 56 passed, 56 total
Tests:       934 passed, 934 total
Time:        36.124 s

Affected Files:

Root Cause (Hypothesis)

The issue appears to be related to insufficient test isolation in file system operations:

  1. Timestamp collisions: Both tests use Date.now() for temp directory names. When tests run in parallel, multiple tests can execute within the same millisecond, creating the same directory path.

  2. Async cleanup timing: afterEach() hooks use fs.remove(tempDir) which may not complete before another test tries to create files in a directory with the same path.

  3. Shared temp directory space: Tests write to os.tmpdir() without proper namespacing or locks, allowing concurrent tests to interfere with each other.

Evidence supporting this theory:

  • Tests pass when forced to run serially (--maxWorkers=1)
  • Failures are nondeterministic (different tests fail on different runs)
  • Failures always result in empty file arrays (suggesting files weren't discovered because directory state was unexpected)

Impact

  • Priority-1: According to CLAUDE.md, test flakiness is explicitly a Priority-1 issue
  • CI/CD Risk: Flaky tests can block legitimate PRs in CI pipelines
  • Developer Experience: Developers may waste time investigating false positives
  • Test Coverage Erosion: Teams may start ignoring test failures if they're unreliable

Tasks

  • Audit all tests using os.tmpdir() + Date.now() - Identify scope of problem

  • Replace Date.now() with proper unique IDs - Use more robust uniqueness guarantees

    • Option 1: crypto.randomUUID() (Node.js 14.17+)
    • Option 2: Existing test helper tests/helpers/tempfs.js if it provides collision-safe temp dirs
    • Option 3: Incrementing counter with test name prefix
  • Ensure proper async cleanup - Wait for cleanup to complete before next test

    • Add await to all fs.remove() calls in afterEach()
    • Consider using beforeAll/afterAll if temp dirs can be reused per suite
  • Add test isolation validation - Prevent future regressions

    • Create a test that verifies temp directory uniqueness across parallel runs
    • Add linting rule or test to detect Date.now() usage in test file setup
  • Document testing best practices - Update tests/CLAUDE.md and tests/README.md

    • Add section on temp directory creation patterns
    • Document the collision risk with Date.now()
    • Provide approved patterns for test isolation
  • Run extended test suite - Verify fix eliminates flakiness

    • Run npm test 10 times consecutively without --maxWorkers=1
    • All runs should pass 934/934 tests

Acceptance Criteria

  • All 934 tests pass consistently in parallel mode (10 consecutive runs with 0 failures)
  • No use of Date.now() for temp directory naming in test files
  • All afterEach() cleanup operations properly awaited
  • Test isolation documented in tests/README.md
  • CI pipeline runs tests in parallel mode and passes reliably

Additional Context

Why Date.now() is insufficient:

  • Millisecond precision insufficient for concurrent test execution
  • Jest's default concurrency (based on CPU cores) can easily execute multiple tests within 1ms
  • Modern CPUs can execute hundreds of thousands of operations per millisecond

Recommended alternatives:

// ✅ Recommended: UUID v4 (cryptographically random)
import { randomUUID } from 'crypto';
tempDir = path.join(os.tmpdir(), `copytree-test-${randomUUID()}`);

// ✅ Alternative: Combine multiple sources of uniqueness
tempDir = path.join(os.tmpdir(), `copytree-test-${process.pid}-${Date.now()}-${Math.random()}`);

// ✅ Best: Use existing helper if available
import { tmpPath } from '../helpers/tempfs.js';
tempDir = tmpPath(`test-${randomUUID()}`);

Related testing infrastructure:

Workaround for immediate use:

# Force serial execution to avoid flakiness
npm test -- --maxWorkers=1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingpriority-1 ⏱️High impact or deadline-driven. Workaround exists but risk is high; prioritize next.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions