Skip to content

Conversation

@mscheltienne
Copy link

@cboulay This might be the origin of this extra sample as described in labstreaminglayer/pylsl#67 and xdf-modules/pyxdf#150

This PR and the text below is AI-generated / guided, I'm not a C-dev and I don't have the bandwidth to learn enough about the language- but it seems like a plausible explanation. Can you have a look?

Problem

When an LSL outlet is destroyed while inlets are still connected, an extra sample containing garbage data is transmitted to the inlets. This causes two issues:

  1. Extra sample in recorded data: The inlet receives one more sample than was actually pushed by the application
  2. Corrupted clock offsets: The extra sample can trigger a clock offset measurement with garbage values, which corrupts timestamp synchronization in downstream tools like pyxdf.

Root cause:

In tcp_server::end_serving(), a "ping" sample is pushed to wake up blocked transfer threads:

send_buffer_->push_sample(factory_->new_sample(lsl_clock(), true));

The problem is that factory_->new_sample() creates a real sample object with:

  • A valid timestamp (lsl_clock())
  • pushthrough = true (forces immediate transmission)
  • Uninitialized channel data (whatever was in the freelist memory)

The transfer thread checks if (!samp) continue; to skip "blank" samples, but this only checks if the pointer is null—the ping sample is a valid object, so it gets serialized and transmitted.

Fix

Push an empty sample_p() instead of creating a real sample:

send_buffer_->push_sample(sample_p());

This works because:

  • The empty sample_p (null intrusive_ptr) still triggers the condition variable notification, waking up blocked threads
  • The existing check if (!samp) continue; correctly skips null samples
  • No garbage data is ever created or transmitted

@cboulay cboulay requested a review from Copilot January 6, 2026 16:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where destroying an LSL outlet while inlets are connected causes an extra sample containing garbage data to be transmitted. The fix changes the shutdown mechanism to push an empty sample pointer instead of creating a real sample object with uninitialized data.

Key changes:

  • Replaced factory_->new_sample(lsl_clock(), true) with sample_p() in the shutdown notification mechanism
  • Updated comments to clarify the shutdown behavior and how transfer threads handle empty sample pointers

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JesseLivezey
Copy link

JesseLivezey commented Jan 6, 2026

It looks like this PR would basically revert the last change to that line.
11b6207#diff-ff9b0374db1e81b8933209a91937220612c1e11bd27afdc27c69882abf6eb648L76

Any idea why it was changed back then?

@cboulay
Copy link
Collaborator

cboulay commented Jan 6, 2026

Let's see if we can summon @mgrivich to ask.

@mscheltienne
Copy link
Author

A commit from 10 years ago, we need one hell of a summoning ritual 😆

@mscheltienne
Copy link
Author

mscheltienne commented Jan 7, 2026

The CI failure on the last commit was suspicious so I did some more claude-guided debugging. To summarize the AI findings and exploration:

  • It seems that reverting the line in question re-introduced a bug fixed by the commit 10 years ago.
  • Moreover, I was able to reproduce the sporadic CI failure locally once in every ~25 runs (macbook arm64).
  • From there, I had claude explore and analyze the library to formulate hypothesis around the failure and its link with the changes in this PR and in the commit from 10 years ago.
  • Then, to test the hypothesis and to take a test-driven approach, I asked it to create new test cases that would exhibit the failure clearly. It took a bit of time- but it landed on a set of tests that were systematically reproducing the bug (note: I only read the description of those tests, my understanding of C is too rudimendary to properly read the code, I would highly recommend that someone reads them thoroughly).
  • Finally, it introduced a fix and validated that the existing and new tests are now all passing.

Below are intermediary AI-generated files that I used to track progress and restart the agent from a fresh state when running out of context. They are lengthy, but describe the steps taken:

Document 1 # Investigation: PR #246 and consumer_queue Undefined Behavior

This document summarizes the investigation into the crash introduced by PR #246 and the underlying bug in consumer_queue::move_or_drop.

Table of Contents

  1. Executive Summary
  2. The Original Problem
  3. Historical Context
  4. Root Cause Analysis
  5. The Crash Mechanism
  6. CI Failures and Local Reproduction
  7. Test Cases Added
  8. Proposed Fix
  9. Build and Test Instructions
  10. Files of Interest
  11. Related Issues and PRs

Executive Summary

PR #246 attempted to fix an issue where an extra "garbage" sample is transmitted to inlets when an LSL outlet is destroyed. The fix changed tcp_server::end_serving() to push a null sample_p() instead of a real sample to wake up blocked transfer threads.

However, this change exposed a latent bug in consumer_queue::move_or_drop that has existed for years. The bug causes undefined behavior when:

  1. A consumer queue's buffer overflows (triggering sample dropping)
  2. A subsequent push assigns to the "dropped" slot

The 2016 commit (11b6207) by Matthew Grivich worked around this bug by always pushing valid samples. PR #246's approach of pushing null samples triggers the bug more reliably.

The recommended fix is to correct the underlying bug in consumer_queue.h, not to revert PR #246.


The Original Problem

Symptom

When an LSL outlet is destroyed while inlets are connected, the inlets receive an extra sample containing uninitialized/garbage data. This causes:

  1. Extra samples in recorded data - One additional sample appears at the end of streams
  2. Corrupted clock offsets - The garbage sample's timestamp corrupts clock synchronization calculations in downstream tools like pyxdf

User Reports

  • pylsl#67: Users reported extra samples and clock offset issues
  • pyxdf#150: Attempted to work around the issue by filtering outlier clock offsets

PR #246 Approach

The fix in src/tcp_server.cpp:201-204 changed:

// Before (creates a real sample with garbage channel data):
send_buffer_->push_sample(factory_->new_sample(lsl_clock(), true));

// After (creates a null pointer, no data transmitted):
send_buffer_->push_sample(sample_p());

The intent was that transfer threads would wake up (via condition variable notification from the push), check if (!samp) continue; and skip the null sample without transmitting anything.


Historical Context

Commit 11b6207 (September 29, 2016)

Author: Matthew Grivich
Message: "Fixed double pointer delete error when destroying a cf_string stream."

This commit changed the exact same line from sample_p() to factory_->new_sample(lsl_clock(), true).

Key insight: The original code 10 years ago used sample_p() (null), but it caused crashes specifically with string-format streams (cf_string). The fix was to use a real sample instead.

PR #246 is effectively reverting this 10-year-old bug fix.

Why the 2016 Fix Worked

The 2016 fix avoided the underlying bug by:

  1. Always pushing valid, reference-counted samples
  2. Ensuring that when buffers overflow and samples are dropped, the memory management is well-defined
  3. The "garbage" data in the sample was a side effect, but the code was stable

Root Cause Analysis

The Bug Location

File: src/consumer_queue.h:160

inline static void move_or_drop(sample_p &src) { src.~sample_p(); }

Why This Is Undefined Behavior

The function explicitly calls the destructor on sample_p (which is lslboost::intrusive_ptr<sample>). After this call:

  1. The destructor decrements the reference count and potentially deallocates the sample
  2. BUT: The memory location (src) still contains the old pointer value as garbage
  3. The object is now in a destroyed/undefined state
  4. Any subsequent use of this object is undefined behavior

When This Gets Triggered

In consumer_queue::push_sample() (line 58-73):

template <class T> void push_sample(T &&sample) {
    while (!try_push(std::forward<T>(sample))) {
        // buffer full, drop oldest sample
        try_pop();  // <-- Calls move_or_drop on oldest slot
    }
    // ... notify condition variable
}

When the buffer is full:

  1. try_pop() is called to drop the oldest sample
  2. try_pop() calls move_or_drop(item->value) with no destination argument
  3. move_or_drop calls item->value.~sample_p() - destructs the intrusive_ptr
  4. item->value is now in an undefined state (destroyed but memory contains garbage)

Then in try_push() (line 117-127):

template <class T> bool try_push(T &&sample) {
    // ... find write slot
    copy_or_move(item.value, std::forward<T>(sample));  // <-- Assigns to destroyed object!
    // ...
}

The copy_or_move function does assignment:

inline static void copy_or_move(sample_p &dst, sample_p &&src) { dst = std::move(src); }

The Assignment Crash

The intrusive_ptr assignment operator (from lslboost/boost/smart_ptr/intrusive_ptr.hpp:154-158):

intrusive_ptr & operator=(intrusive_ptr const & rhs)
{
    this_type(rhs).swap(*this);
    return *this;
}

The swap exchanges pointers. After swap:

  • The temporary holds the "old" pointer from *this (which is garbage)
  • *this holds the new pointer (correct)

When the temporary destructs:

~intrusive_ptr()
{
    if( px != 0 ) intrusive_ptr_release( px );  // <-- Crashes on garbage pointer!
}

The destructor tries to call intrusive_ptr_release on garbage memory, causing:

  • SIGSEGV (segmentation fault) on Unix
  • Access violation on Windows

The Crash Mechanism

Step-by-Step Crash Sequence

  1. Initial state: Ring buffer slot item.value holds a valid sample_p pointing to sample X (refcount = N)

  2. Buffer overflow: push_sample can't find space, calls try_pop() to drop oldest

  3. Explicit destructor: move_or_drop(item->value) calls item->value.~sample_p()

    • Refcount of sample X decremented (now N-1)
    • Sample X may be reclaimed to factory freelist
    • item->value.px_ still contains the old pointer bits (not zeroed!)
  4. Object now destroyed: item.value is in undefined state - memory contains garbage

  5. Subsequent push: try_push finds the slot "free" (via seq_state tracking)

  6. Assignment: copy_or_move(item.value, new_sample)item.value = new_sample

  7. Swap in assignment: Creates temporary, swaps pointers

    • Temporary now holds garbage pointer from item.value
    • item.value now holds correct new pointer
  8. Temporary destructs: if (px != 0) intrusive_ptr_release(px)

    • px is garbage (old, possibly-deallocated pointer)
    • CRASH: Accessing garbage memory

Why It's Intermittent

  • Only triggers when buffer becomes full (overflow scenario)
  • Depends on timing of pushes relative to pops
  • Memory contents after destruction vary by platform
  • With more inlets (more threads), probability increases
  • Different platforms have different memory allocator behaviors

CI Failures and Local Reproduction

GitHub Actions Workflow Failure

Run: https://github.com/sccn/liblsl/actions/runs/20750638718/job/59600638857?pr=246

Platform Result
Windows 32-bit ✅ Passed
Windows x64 Failed
Ubuntu 24.04 ✅ Passed
Ubuntu 22.04 ✅ Passed
macOS-latest ✅ Passed

Failure details:

  • Exit code: 127 (abnormal process termination)
  • Test: pushpull - char benchmark
  • Pattern: Test running normally, then sudden crash mid-benchmark

Local Reproduction (macOS arm64)

The crash was reproduced locally:

$ for i in {1..10}; do ./testing/lsl_test_exported 2>&1 | grep -E "(SIGSEGV|passed|failed)"; done
All tests passed (730 assertions in 27 test cases)
All tests passed (730 assertions in 27 test cases)
...
  SIGSEGV - Segmentation violation signal
test cases:  23 |  22 passed | 1 failed

Observations:


Test Cases Added

New regression tests were added in testing/int/consumer_queue_regression.cpp:

Test 1: Buffer Overflow Reuse

TEST_CASE("consumer_queue buffer overflow reuse", "[queue][regression]")

Tests basic buffer overflow and slot reuse. Fills a small buffer, then pushes more samples causing overflow, verifying no crash on reuse.

Test 2: Null Sample After Overflow

TEST_CASE("consumer_queue null sample after overflow", "[queue][regression]")

Specifically tests PR #246 scenario: pushing sample_p() (null) after buffer has had overflow-induced drops.

Test 3: String Format Overflow

TEST_CASE("consumer_queue string format overflow", "[queue][regression][string]")

Tests the original 2016 bug scenario with cf_string format samples. String samples have embedded std::string objects requiring proper destruction.

Test 4: Multi Queue Overflow

TEST_CASE("consumer_queue multi queue overflow", "[queue][regression]")

Simulates the multi-inlet scenario with 10 separate queues, each receiving samples and null "ping" samples.

Test 5: Rapid Lifecycle

TEST_CASE("consumer_queue rapid lifecycle", "[queue][regression]")

Tests rapid create-fill-destroy cycles simulating benchmark test patterns where outlets are quickly created and destroyed.

Test 6: Mixed Valid and Null

TEST_CASE("consumer_queue mixed valid null overflow", "[queue][regression]")

Tests the specific sequence: valid samples fill buffer → overflow drops → null sample pushed.

Test Limitations

These unit tests don't reliably trigger the crash because:

  1. The bug requires specific timing conditions
  2. The full integration tests involve TCP connections, multiple threads, and async I/O
  3. The bug is a race condition that's hard to hit in single-threaded unit tests

The tests document the expected behavior and will catch the bug if the fix is applied and later regressed.


Proposed Fix

Option A: Fix consumer_queue::move_or_drop (RECOMMENDED)

File: src/consumer_queue.h:160

// Instead of:
inline static void move_or_drop(sample_p &src) { src.~sample_p(); }

// Use:
inline static void move_or_drop(sample_p &src) { src.reset(); }

Or equivalently:

inline static void move_or_drop(sample_p &src) { src = sample_p(); }

Why this works:

  • reset() properly decrements the refcount AND sets the internal pointer to null
  • The slot is left in a valid state (holding null pointer)
  • Subsequent assignments work correctly because the "old" value is null, not garbage

Advantages:

Trade-offs:

  • Slightly more overhead than explicit destructor (assignment vs raw destructor)
  • The overhead is negligible for this code path (only on buffer overflow)

Option B: Keep Original Code (NOT RECOMMENDED)

Revert PR #246 and keep the original workaround. This leaves:

  • The "garbage sample" bug unfixed
  • The underlying undefined behavior still present
  • Potential for future crashes if code changes

Option C: Mark "ping" sample specially (More invasive)

Add a flag to samples to mark them as "skip on receive". Requires:

  • Changes to sample class
  • Changes to inlet code
  • More complexity

Build and Test Instructions

Prerequisites

  • CMake 3.23+
  • C++14 compatible compiler
  • Git

Building

# Clone and checkout the branch
git clone https://github.com/sccn/liblsl.git
cd liblsl
git checkout fix-wakeup  # or the relevant branch

# Create build directory
mkdir build && cd build

# Configure with tests enabled
cmake .. -DCMAKE_BUILD_TYPE=Release -DLSL_UNITTESTS=ON -DLSL_BENCHMARKS=ON

# Build
cmake --build . --config Release -j$(nproc)

Running Tests

# Run internal tests (includes regression tests)
./testing/lsl_test_internal

# Run just the regression tests
./testing/lsl_test_internal "[regression]"

# Run exported tests (integration tests, can trigger crash)
./testing/lsl_test_exported

# Run specific benchmark that triggers crash
./testing/lsl_test_exported "pushpull - std::string"

# Run tests multiple times to catch intermittent crash
for i in {1..20}; do
    echo "=== Run $i ==="
    ./testing/lsl_test_exported 2>&1 | grep -E "(SIGSEGV|passed|failed)" | head -2
done

Verifying the Fix

  1. Apply the fix to src/consumer_queue.h:160
  2. Rebuild
  3. Run the full test suite multiple times
  4. Verify no crashes occur

Files of Interest

Core Files

File Description
src/tcp_server.cpp Contains end_serving() with the PR #246 change
src/consumer_queue.h Contains the buggy move_or_drop function
src/consumer_queue.cpp Consumer queue implementation
src/send_buffer.cpp Distributes samples to consumer queues
src/sample.h Sample class and intrusive_ptr release function
lslboost/boost/smart_ptr/intrusive_ptr.hpp Boost intrusive_ptr implementation

Test Files

File Description
testing/int/consumer_queue_regression.cpp New regression tests
testing/int/samples.cpp Existing consumer_queue tests
testing/ext/bench_pushpull.cpp Benchmark that triggers crash

Key Code Locations


Related Issues and PRs

Link Description
sccn/liblsl#246 PR attempting to fix extra sample issue
labstreaminglayer/pylsl#67 User report of extra samples and clock issues
xdf-modules/pyxdf#150 Workaround for corrupted clock offsets
Commit 11b6207 2016 fix for "double pointer delete error when destroying cf_string stream"

Conclusion

The crash in PR #246 is caused by undefined behavior in consumer_queue::move_or_drop, not by the approach of pushing null samples. The recommended fix is to correct the move_or_drop function to leave slots in a valid state after dropping samples.

This fix:

  1. Resolves the underlying 10-year-old bug
  2. Makes PR Fix extra garbage sample transmitted when outlet is destroyed #246's approach safe
  3. Eliminates the "garbage sample" issue
  4. Prevents future crashes from the same root cause

Document created: 2026-01-07
Investigation by: Claude Code (Anthropic)

Document 2 # Investigation Part 2: Reliable Reproduction and Refined Analysis

This document builds upon INVESTIGATION_PR246.md with new findings from deeper analysis and the development of tests that reliably reproduce the crash.

Table of Contents

  1. Summary of New Findings
  2. Alternative Hypotheses Examined
  3. Refined Understanding of the Bug
  4. Why Original Tests Were Insufficient
  5. New Test Cases
  6. Reproduction Results
  7. Recommended Next Steps

Summary of New Findings

Key Discovery

We have developed a test that crashes 100% of the time, confirming the bug hypothesis. The test "rapid queue lifecycle with send_buffer" in testing/int/sendbuffer.cpp reliably triggers a SIGSEGV.

Critical Insight: The Bug Requires Shared Samples

The original regression tests in consumer_queue_regression.cpp were insufficient because they:

  • Used separate factories for each test
  • Pushed samples to individual queues (not shared via send_buffer)
  • Did not exercise the code path where multiple queues hold references to the same sample

The bug manifests as refcount corruption when:

  1. The same sample is distributed to multiple consumer_queues via send_buffer
  2. Buffer overflow causes move_or_drop to be called
  3. The double-release corrupts the shared sample's refcount

Alternative Hypotheses Examined

Before confirming the move_or_drop hypothesis, we examined several alternative explanations:

Hypothesis A: Factory Lifetime Race Condition

Theory: The factory might be destroyed while samples still hold references to it, causing crashes when s->factory_->reclaim_sample(s) accesses freed memory.

Investigation:

  • Examined ownership model: stream_outlet_impl owns factory directly (not via shared_ptr)
  • Consumer queues hold send_buffer_p (shared_ptr) which keeps send_buffer alive
  • Factory destruction order could theoretically cause issues

Conclusion: While this is a valid concern for the full TCP stack, it doesn't explain crashes in isolated unit tests where factory lifetime is well-controlled. Rejected as primary cause.

Hypothesis B: Consumer Registration Race Condition

Theory: A race between push_sample iterating over consumers and consumer destruction/unregistration could cause use-after-free.

Investigation:

  • Both operations hold consumers_mut_ mutex
  • Recent commit 61cab1e added early return in unregister_consumer to prevent double-free
  • The mutex ensures iteration is safe

Conclusion: The mutex protection is adequate. Rejected.

Hypothesis C: String Sample Specific Memory Corruption

Theory: The 2016 commit message mentions "double pointer delete error when destroying cf_string stream," suggesting string samples have unique memory issues.

Investigation:

  • String samples have embedded std::string objects constructed via placement new
  • Sample destructor explicitly destroys these strings: val.~basic_string<char>()
  • BUT intrusive_ptr_release calls reclaim_sample, not delete - sample destructor is NOT called
  • Strings remain in memory when samples are recycled to freelist

Conclusion: String samples don't have special destruction issues in this context. The 2016 bug was the same move_or_drop issue, just more visible with strings. Partially confirmed as related symptom.

Hypothesis D: Freelist Corruption via ABA Problem

Theory: The Vyukov freelist could have ABA issues causing double-allocation of the same sample.

Investigation:

  • Examined factory::pop_freelist() and reclaim_sample()
  • The freelist uses atomic operations correctly
  • Samples are never freed (returned to freelist), eliminating classic ABA

Conclusion: Freelist implementation is correct. Rejected.

Hypothesis E: Double-Release Causing Refcount Corruption (CONFIRMED)

Theory: The explicit destructor in move_or_drop causes each overflow to decrement refcount twice instead of once, corrupting shared samples.

Investigation:

  • Traced through code paths with shared samples
  • Confirmed that copy_or_move after move_or_drop performs a second release
  • Created tests that reliably trigger the corruption

Conclusion: CONFIRMED as the root cause.


Refined Understanding of the Bug

The Double-Release Mechanism

The original investigation correctly identified the location (consumer_queue.h:160) but the mechanism needed refinement.

Step-by-step trace with shared samples:

Initial state:
- Sample S is pushed to 3 consumer_queues via send_buffer
- Sample S refcount = 3 (one reference per queue)

Queue 1 overflows (buffer full, new sample being pushed):

1. try_pop() is called to drop oldest sample
2. move_or_drop(item->value) called where item->value points to S
3. Explicit destructor: item->value.~sample_p()
   - Calls intrusive_ptr_release(S)
   - S.refcount: 3 → 2
   - item->value.px still contains pointer to S (NOT zeroed!)

4. try_push() assigns new sample to slot:
   copy_or_move(item.value, new_sample)
   → item.value = new_sample

5. Assignment uses copy-and-swap:
   - Creates temporary with new_sample (refcount +1)
   - swap(): temp.px = item.value.px (which is S!), item.value.px = new_sample
   - Temporary destructor: intrusive_ptr_release(temp.px)
   - This releases S AGAIN!
   - S.refcount: 2 → 1

Result after one overflow: S.refcount = 1, but 2 queues still "hold" S
Expected: S.refcount = 2

Why This Causes Crashes

After multiple overflow events:

  • Sample refcount becomes 0 or negative while queues still reference it
  • Sample is reclaimed to freelist
  • Sample may be reused for a new allocation
  • Two different consumers now reference the same sample memory
  • Memory corruption ensues when either:
    • Both try to modify the sample
    • One releases while the other still uses it
    • The corrupted refcount causes invalid memory access

The Single-Sample Case vs Shared-Sample Case

Single queue (original tests):

  • Sample has refcount = 1
  • First release: refcount 1 → 0, sample reclaimed
  • Second release: operates on reclaimed sample (refcount 0 → -1)
  • Sample is in freelist, memory is valid, no immediate crash
  • Bug exists but rarely manifests visibly

Shared via send_buffer (new tests):

  • Sample has refcount = N (number of queues)
  • Each overflow causes refcount -= 2 instead of -= 1
  • Refcount corruption accumulates
  • When refcount hits 0 prematurely, crash occurs

Why Original Tests Were Insufficient

The tests in consumer_queue_regression.cpp had fundamental design flaws:

Problem 1: No Sample Sharing

// Original test pattern:
lsl::factory fac(...);
lsl::consumer_queue queue(...);
queue.push_sample(fac.new_sample(...));  // Each sample only in one queue

The bug requires samples shared across multiple queues. Without send_buffer, each sample has refcount = 1, and the double-release doesn't cause visible corruption.

Problem 2: Sequential, Single-Threaded Execution

The real-world scenario involves:

  • Multiple inlet threads pulling samples
  • Concurrent push operations
  • Timing-dependent race conditions

Single-threaded tests can't exercise these code paths.

Problem 3: No Queue Lifecycle Churn

The crash in the benchmark occurred during outlet destruction when:

  • Consumer queues are destroyed
  • Samples with corrupted refcounts trigger crashes

Tests that don't destroy queues while samples exist don't trigger this.


New Test Cases

File: testing/int/sendbuffer.cpp

Test: "rapid queue lifecycle with send_buffer" (CRASHES 100%)

TEST_CASE("rapid queue lifecycle with send_buffer", "[queue][regression][send_buffer]") {
    const int buffer_size = 4;
    const int num_iterations = 50;

    lsl::factory fac(lsl_channel_format_t::cft_int32, 2, buffer_size);
    auto sendbuf = std::make_shared<lsl::send_buffer>(buffer_size);

    for (int iter = 0; iter < num_iterations; ++iter) {
        // Create queues that share samples via send_buffer
        std::vector<std::shared_ptr<lsl::consumer_queue>> queues;
        for (int i = 0; i < 5; ++i) {
            queues.push_back(sendbuf->new_consumer(buffer_size));
        }

        // Push samples causing overflow (triggers move_or_drop)
        for (int i = 0; i < buffer_size * 3; ++i) {
            auto sample = fac.new_sample(static_cast<double>(i), true);
            sendbuf->push_sample(sample);  // Same sample to all 5 queues
        }

        // Push null (PR #246 scenario)
        sendbuf->push_sample(lsl::sample_p());

        // Queues destroyed here - corrupted refcounts cause crash
    }
}

Why it works:

  1. Uses send_buffer to share samples across 5 queues
  2. Small buffer (4) forces constant overflow
  3. Rapid create/destroy cycles accumulate refcount corruption
  4. Queue destruction triggers sample cleanup with corrupted refcounts
  5. SIGSEGV occurs reliably

Test: "string sample send_buffer overflow"

Tests the original 2016 bug scenario with cf_string format samples, using proper sample sharing via send_buffer.

Test: "multi-threaded send_buffer stress"

Tests concurrent push operations from multiple threads, maximizing race condition probability.


Reproduction Results

Test Execution

$ ./testing/lsl_test_internal "rapid queue lifecycle with send_buffer"
# Output:
# FAILED: due to fatal error condition: SIGSEGV - Segmentation violation signal
# test cases: 1 | 1 failed

Consistency

Run Result
1 SIGSEGV (crash)
2 SIGSEGV (crash)
3 SIGSEGV (crash)
4 SIGSEGV (crash)
5 SIGSEGV (crash)

100% crash rate - the test reliably exhibits the bug.

Comparison with Original Tests

Test Type Crash Rate Reason
Original consumer_queue_regression.cpp 0% No sample sharing
Original pushpull benchmark ~5-10% Timing-dependent
New rapid queue lifecycle with send_buffer 100% Forced refcount corruption

Recommended Next Steps

1. Apply the Fix

In src/consumer_queue.h:160, change:

// FROM:
inline static void move_or_drop(sample_p &src) { src.~sample_p(); }

// TO:
inline static void move_or_drop(sample_p &src) { src.reset(); }

2. Verify the Fix

After applying the fix:

$ ./testing/lsl_test_internal "[send_buffer]"
# Expected: All tests pass

3. Run Full Test Suite

$ ./testing/lsl_test_internal
$ ./testing/lsl_test_exported
# Run multiple times to ensure stability

4. Clean Up Test Files

  • Remove testing/int/consumer_queue_regression.cpp (tests don't exercise the actual bug)
  • Keep testing/int/sendbuffer.cpp (tests properly exercise the bug)

Conclusion

The investigation has successfully:

  1. Confirmed the root cause as the move_or_drop explicit destructor bug
  2. Examined and rejected alternative hypotheses through careful analysis
  3. Identified why original tests were insufficient (no sample sharing)
  4. Created a test that reliably reproduces the crash (100% rate)
  5. Validated the proposed fix mechanism

The key insight is that the bug requires shared samples via send_buffer to manifest as a crash. Single-queue tests only corrupt refcounts without visible effects. The new test forces the corruption to accumulate until it causes a crash.


Document created: 2025-01-07
Investigation by: Claude Code (Anthropic)

@mscheltienne
Copy link
Author

I've pushed a commit that reverts the fix, it should fail CIs. For some reason, this one requires approval before running @cboulay

Once CI fail 🤞, I will revert this commit to re-apply the additional fix.

@mscheltienne
Copy link
Author

mscheltienne commented Jan 8, 2026

As expected:

I re-applied the fix. Good for review :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants