-
Notifications
You must be signed in to change notification settings - Fork 78
Fix extra garbage sample transmitted when outlet is destroyed #246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a bug where destroying an LSL outlet while inlets are connected causes an extra sample containing garbage data to be transmitted. The fix changes the shutdown mechanism to push an empty sample pointer instead of creating a real sample object with uninitialized data.
Key changes:
- Replaced
factory_->new_sample(lsl_clock(), true)withsample_p()in the shutdown notification mechanism - Updated comments to clarify the shutdown behavior and how transfer threads handle empty sample pointers
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
It looks like this PR would basically revert the last change to that line. Any idea why it was changed back then? |
|
Let's see if we can summon @mgrivich to ask. |
|
A commit from 10 years ago, we need one hell of a summoning ritual 😆 |
|
The CI failure on the last commit was suspicious so I did some more claude-guided debugging. To summarize the AI findings and exploration:
Below are intermediary AI-generated files that I used to track progress and restart the agent from a fresh state when running out of context. They are lengthy, but describe the steps taken: Document 1# Investigation: PR #246 and consumer_queue Undefined BehaviorThis document summarizes the investigation into the crash introduced by PR #246 and the underlying bug in Table of Contents
Executive SummaryPR #246 attempted to fix an issue where an extra "garbage" sample is transmitted to inlets when an LSL outlet is destroyed. The fix changed However, this change exposed a latent bug in
The 2016 commit (11b6207) by Matthew Grivich worked around this bug by always pushing valid samples. PR #246's approach of pushing null samples triggers the bug more reliably. The recommended fix is to correct the underlying bug in The Original ProblemSymptomWhen an LSL outlet is destroyed while inlets are connected, the inlets receive an extra sample containing uninitialized/garbage data. This causes:
User Reports
PR #246 ApproachThe fix in // Before (creates a real sample with garbage channel data):
send_buffer_->push_sample(factory_->new_sample(lsl_clock(), true));
// After (creates a null pointer, no data transmitted):
send_buffer_->push_sample(sample_p());The intent was that transfer threads would wake up (via condition variable notification from the push), check Historical ContextCommit 11b6207 (September 29, 2016)Author: Matthew Grivich This commit changed the exact same line from Key insight: The original code 10 years ago used PR #246 is effectively reverting this 10-year-old bug fix. Why the 2016 Fix WorkedThe 2016 fix avoided the underlying bug by:
Root Cause AnalysisThe Bug LocationFile: inline static void move_or_drop(sample_p &src) { src.~sample_p(); }Why This Is Undefined BehaviorThe function explicitly calls the destructor on
When This Gets TriggeredIn template <class T> void push_sample(T &&sample) {
while (!try_push(std::forward<T>(sample))) {
// buffer full, drop oldest sample
try_pop(); // <-- Calls move_or_drop on oldest slot
}
// ... notify condition variable
}When the buffer is full:
Then in template <class T> bool try_push(T &&sample) {
// ... find write slot
copy_or_move(item.value, std::forward<T>(sample)); // <-- Assigns to destroyed object!
// ...
}The inline static void copy_or_move(sample_p &dst, sample_p &&src) { dst = std::move(src); }The Assignment CrashThe intrusive_ptr & operator=(intrusive_ptr const & rhs)
{
this_type(rhs).swap(*this);
return *this;
}The
When the temporary destructs: ~intrusive_ptr()
{
if( px != 0 ) intrusive_ptr_release( px ); // <-- Crashes on garbage pointer!
}The destructor tries to call
The Crash MechanismStep-by-Step Crash Sequence
Why It's Intermittent
CI Failures and Local ReproductionGitHub Actions Workflow FailureRun: https://github.com/sccn/liblsl/actions/runs/20750638718/job/59600638857?pr=246
Failure details:
Local Reproduction (macOS arm64)The crash was reproduced locally: Observations:
Test Cases AddedNew regression tests were added in Test 1: Buffer Overflow ReuseTEST_CASE("consumer_queue buffer overflow reuse", "[queue][regression]")Tests basic buffer overflow and slot reuse. Fills a small buffer, then pushes more samples causing overflow, verifying no crash on reuse. Test 2: Null Sample After OverflowTEST_CASE("consumer_queue null sample after overflow", "[queue][regression]")Specifically tests PR #246 scenario: pushing Test 3: String Format OverflowTEST_CASE("consumer_queue string format overflow", "[queue][regression][string]")Tests the original 2016 bug scenario with Test 4: Multi Queue OverflowTEST_CASE("consumer_queue multi queue overflow", "[queue][regression]")Simulates the multi-inlet scenario with 10 separate queues, each receiving samples and null "ping" samples. Test 5: Rapid LifecycleTEST_CASE("consumer_queue rapid lifecycle", "[queue][regression]")Tests rapid create-fill-destroy cycles simulating benchmark test patterns where outlets are quickly created and destroyed. Test 6: Mixed Valid and NullTEST_CASE("consumer_queue mixed valid null overflow", "[queue][regression]")Tests the specific sequence: valid samples fill buffer → overflow drops → null sample pushed. Test LimitationsThese unit tests don't reliably trigger the crash because:
The tests document the expected behavior and will catch the bug if the fix is applied and later regressed. Proposed FixOption A: Fix
|
| File | Description |
|---|---|
src/tcp_server.cpp |
Contains end_serving() with the PR #246 change |
src/consumer_queue.h |
Contains the buggy move_or_drop function |
src/consumer_queue.cpp |
Consumer queue implementation |
src/send_buffer.cpp |
Distributes samples to consumer queues |
src/sample.h |
Sample class and intrusive_ptr release function |
lslboost/boost/smart_ptr/intrusive_ptr.hpp |
Boost intrusive_ptr implementation |
Test Files
| File | Description |
|---|---|
testing/int/consumer_queue_regression.cpp |
New regression tests |
testing/int/samples.cpp |
Existing consumer_queue tests |
testing/ext/bench_pushpull.cpp |
Benchmark that triggers crash |
Key Code Locations
- PR Fix extra garbage sample transmitted when outlet is destroyed #246 change:
src/tcp_server.cpp:201-204 - Bug location:
src/consumer_queue.h:160 - intrusive_ptr assignment:
lslboost/boost/smart_ptr/intrusive_ptr.hpp:154-158 - intrusive_ptr destructor:
lslboost/boost/smart_ptr/intrusive_ptr.hpp:98-101
Related Issues and PRs
| Link | Description |
|---|---|
| sccn/liblsl#246 | PR attempting to fix extra sample issue |
| labstreaminglayer/pylsl#67 | User report of extra samples and clock issues |
| xdf-modules/pyxdf#150 | Workaround for corrupted clock offsets |
| Commit 11b6207 | 2016 fix for "double pointer delete error when destroying cf_string stream" |
Conclusion
The crash in PR #246 is caused by undefined behavior in consumer_queue::move_or_drop, not by the approach of pushing null samples. The recommended fix is to correct the move_or_drop function to leave slots in a valid state after dropping samples.
This fix:
- Resolves the underlying 10-year-old bug
- Makes PR Fix extra garbage sample transmitted when outlet is destroyed #246's approach safe
- Eliminates the "garbage sample" issue
- Prevents future crashes from the same root cause
Document created: 2026-01-07
Investigation by: Claude Code (Anthropic)
Document 2
# Investigation Part 2: Reliable Reproduction and Refined AnalysisThis document builds upon INVESTIGATION_PR246.md with new findings from deeper analysis and the development of tests that reliably reproduce the crash.
Table of Contents
- Summary of New Findings
- Alternative Hypotheses Examined
- Refined Understanding of the Bug
- Why Original Tests Were Insufficient
- New Test Cases
- Reproduction Results
- Recommended Next Steps
Summary of New Findings
Key Discovery
We have developed a test that crashes 100% of the time, confirming the bug hypothesis. The test "rapid queue lifecycle with send_buffer" in testing/int/sendbuffer.cpp reliably triggers a SIGSEGV.
Critical Insight: The Bug Requires Shared Samples
The original regression tests in consumer_queue_regression.cpp were insufficient because they:
- Used separate factories for each test
- Pushed samples to individual queues (not shared via
send_buffer) - Did not exercise the code path where multiple queues hold references to the same sample
The bug manifests as refcount corruption when:
- The same sample is distributed to multiple consumer_queues via
send_buffer - Buffer overflow causes
move_or_dropto be called - The double-release corrupts the shared sample's refcount
Alternative Hypotheses Examined
Before confirming the move_or_drop hypothesis, we examined several alternative explanations:
Hypothesis A: Factory Lifetime Race Condition
Theory: The factory might be destroyed while samples still hold references to it, causing crashes when s->factory_->reclaim_sample(s) accesses freed memory.
Investigation:
- Examined ownership model:
stream_outlet_implowns factory directly (not via shared_ptr) - Consumer queues hold
send_buffer_p(shared_ptr) which keeps send_buffer alive - Factory destruction order could theoretically cause issues
Conclusion: While this is a valid concern for the full TCP stack, it doesn't explain crashes in isolated unit tests where factory lifetime is well-controlled. Rejected as primary cause.
Hypothesis B: Consumer Registration Race Condition
Theory: A race between push_sample iterating over consumers and consumer destruction/unregistration could cause use-after-free.
Investigation:
- Both operations hold
consumers_mut_mutex - Recent commit 61cab1e added early return in
unregister_consumerto prevent double-free - The mutex ensures iteration is safe
Conclusion: The mutex protection is adequate. Rejected.
Hypothesis C: String Sample Specific Memory Corruption
Theory: The 2016 commit message mentions "double pointer delete error when destroying cf_string stream," suggesting string samples have unique memory issues.
Investigation:
- String samples have embedded
std::stringobjects constructed via placement new - Sample destructor explicitly destroys these strings:
val.~basic_string<char>() - BUT
intrusive_ptr_releasecallsreclaim_sample, notdelete- sample destructor is NOT called - Strings remain in memory when samples are recycled to freelist
Conclusion: String samples don't have special destruction issues in this context. The 2016 bug was the same move_or_drop issue, just more visible with strings. Partially confirmed as related symptom.
Hypothesis D: Freelist Corruption via ABA Problem
Theory: The Vyukov freelist could have ABA issues causing double-allocation of the same sample.
Investigation:
- Examined
factory::pop_freelist()andreclaim_sample() - The freelist uses atomic operations correctly
- Samples are never freed (returned to freelist), eliminating classic ABA
Conclusion: Freelist implementation is correct. Rejected.
Hypothesis E: Double-Release Causing Refcount Corruption (CONFIRMED)
Theory: The explicit destructor in move_or_drop causes each overflow to decrement refcount twice instead of once, corrupting shared samples.
Investigation:
- Traced through code paths with shared samples
- Confirmed that
copy_or_moveaftermove_or_dropperforms a second release - Created tests that reliably trigger the corruption
Conclusion: CONFIRMED as the root cause.
Refined Understanding of the Bug
The Double-Release Mechanism
The original investigation correctly identified the location (consumer_queue.h:160) but the mechanism needed refinement.
Step-by-step trace with shared samples:
Initial state:
- Sample S is pushed to 3 consumer_queues via send_buffer
- Sample S refcount = 3 (one reference per queue)
Queue 1 overflows (buffer full, new sample being pushed):
1. try_pop() is called to drop oldest sample
2. move_or_drop(item->value) called where item->value points to S
3. Explicit destructor: item->value.~sample_p()
- Calls intrusive_ptr_release(S)
- S.refcount: 3 → 2
- item->value.px still contains pointer to S (NOT zeroed!)
4. try_push() assigns new sample to slot:
copy_or_move(item.value, new_sample)
→ item.value = new_sample
5. Assignment uses copy-and-swap:
- Creates temporary with new_sample (refcount +1)
- swap(): temp.px = item.value.px (which is S!), item.value.px = new_sample
- Temporary destructor: intrusive_ptr_release(temp.px)
- This releases S AGAIN!
- S.refcount: 2 → 1
Result after one overflow: S.refcount = 1, but 2 queues still "hold" S
Expected: S.refcount = 2
Why This Causes Crashes
After multiple overflow events:
- Sample refcount becomes 0 or negative while queues still reference it
- Sample is reclaimed to freelist
- Sample may be reused for a new allocation
- Two different consumers now reference the same sample memory
- Memory corruption ensues when either:
- Both try to modify the sample
- One releases while the other still uses it
- The corrupted refcount causes invalid memory access
The Single-Sample Case vs Shared-Sample Case
Single queue (original tests):
- Sample has refcount = 1
- First release: refcount 1 → 0, sample reclaimed
- Second release: operates on reclaimed sample (refcount 0 → -1)
- Sample is in freelist, memory is valid, no immediate crash
- Bug exists but rarely manifests visibly
Shared via send_buffer (new tests):
- Sample has refcount = N (number of queues)
- Each overflow causes refcount -= 2 instead of -= 1
- Refcount corruption accumulates
- When refcount hits 0 prematurely, crash occurs
Why Original Tests Were Insufficient
The tests in consumer_queue_regression.cpp had fundamental design flaws:
Problem 1: No Sample Sharing
// Original test pattern:
lsl::factory fac(...);
lsl::consumer_queue queue(...);
queue.push_sample(fac.new_sample(...)); // Each sample only in one queueThe bug requires samples shared across multiple queues. Without send_buffer, each sample has refcount = 1, and the double-release doesn't cause visible corruption.
Problem 2: Sequential, Single-Threaded Execution
The real-world scenario involves:
- Multiple inlet threads pulling samples
- Concurrent push operations
- Timing-dependent race conditions
Single-threaded tests can't exercise these code paths.
Problem 3: No Queue Lifecycle Churn
The crash in the benchmark occurred during outlet destruction when:
- Consumer queues are destroyed
- Samples with corrupted refcounts trigger crashes
Tests that don't destroy queues while samples exist don't trigger this.
New Test Cases
File: testing/int/sendbuffer.cpp
Test: "rapid queue lifecycle with send_buffer" (CRASHES 100%)
TEST_CASE("rapid queue lifecycle with send_buffer", "[queue][regression][send_buffer]") {
const int buffer_size = 4;
const int num_iterations = 50;
lsl::factory fac(lsl_channel_format_t::cft_int32, 2, buffer_size);
auto sendbuf = std::make_shared<lsl::send_buffer>(buffer_size);
for (int iter = 0; iter < num_iterations; ++iter) {
// Create queues that share samples via send_buffer
std::vector<std::shared_ptr<lsl::consumer_queue>> queues;
for (int i = 0; i < 5; ++i) {
queues.push_back(sendbuf->new_consumer(buffer_size));
}
// Push samples causing overflow (triggers move_or_drop)
for (int i = 0; i < buffer_size * 3; ++i) {
auto sample = fac.new_sample(static_cast<double>(i), true);
sendbuf->push_sample(sample); // Same sample to all 5 queues
}
// Push null (PR #246 scenario)
sendbuf->push_sample(lsl::sample_p());
// Queues destroyed here - corrupted refcounts cause crash
}
}Why it works:
- Uses
send_bufferto share samples across 5 queues - Small buffer (4) forces constant overflow
- Rapid create/destroy cycles accumulate refcount corruption
- Queue destruction triggers sample cleanup with corrupted refcounts
- SIGSEGV occurs reliably
Test: "string sample send_buffer overflow"
Tests the original 2016 bug scenario with cf_string format samples, using proper sample sharing via send_buffer.
Test: "multi-threaded send_buffer stress"
Tests concurrent push operations from multiple threads, maximizing race condition probability.
Reproduction Results
Test Execution
$ ./testing/lsl_test_internal "rapid queue lifecycle with send_buffer"
# Output:
# FAILED: due to fatal error condition: SIGSEGV - Segmentation violation signal
# test cases: 1 | 1 failedConsistency
| Run | Result |
|---|---|
| 1 | SIGSEGV (crash) |
| 2 | SIGSEGV (crash) |
| 3 | SIGSEGV (crash) |
| 4 | SIGSEGV (crash) |
| 5 | SIGSEGV (crash) |
100% crash rate - the test reliably exhibits the bug.
Comparison with Original Tests
| Test Type | Crash Rate | Reason |
|---|---|---|
Original consumer_queue_regression.cpp |
0% | No sample sharing |
Original pushpull benchmark |
~5-10% | Timing-dependent |
New rapid queue lifecycle with send_buffer |
100% | Forced refcount corruption |
Recommended Next Steps
1. Apply the Fix
In src/consumer_queue.h:160, change:
// FROM:
inline static void move_or_drop(sample_p &src) { src.~sample_p(); }
// TO:
inline static void move_or_drop(sample_p &src) { src.reset(); }2. Verify the Fix
After applying the fix:
$ ./testing/lsl_test_internal "[send_buffer]"
# Expected: All tests pass3. Run Full Test Suite
$ ./testing/lsl_test_internal
$ ./testing/lsl_test_exported
# Run multiple times to ensure stability4. Clean Up Test Files
- Remove
testing/int/consumer_queue_regression.cpp(tests don't exercise the actual bug) - Keep
testing/int/sendbuffer.cpp(tests properly exercise the bug)
Conclusion
The investigation has successfully:
- Confirmed the root cause as the
move_or_dropexplicit destructor bug - Examined and rejected alternative hypotheses through careful analysis
- Identified why original tests were insufficient (no sample sharing)
- Created a test that reliably reproduces the crash (100% rate)
- Validated the proposed fix mechanism
The key insight is that the bug requires shared samples via send_buffer to manifest as a crash. Single-queue tests only corrupt refcounts without visible effects. The new test forces the corruption to accumulate until it causes a crash.
Document created: 2025-01-07
Investigation by: Claude Code (Anthropic)
This reverts commit 3067ff0.
|
I've pushed a commit that reverts the fix, it should fail CIs. For some reason, this one requires approval before running @cboulay Once CI fail 🤞, I will revert this commit to re-apply the additional fix. |
|
As expected:
I re-applied the fix. Good for review :) |
@cboulay This might be the origin of this extra sample as described in labstreaminglayer/pylsl#67 and xdf-modules/pyxdf#150
This PR and the text below is AI-generated / guided, I'm not a C-dev and I don't have the bandwidth to learn enough about the language- but it seems like a plausible explanation. Can you have a look?
Problem
When an LSL outlet is destroyed while inlets are still connected, an extra sample containing garbage data is transmitted to the inlets. This causes two issues:
pyxdf.Root cause:
In
tcp_server::end_serving(), a "ping" sample is pushed to wake up blocked transfer threads:The problem is that
factory_->new_sample()creates a real sample object with:lsl_clock())pushthrough = true(forces immediate transmission)The transfer thread checks
if (!samp) continue; to skip "blank" samples, but this only checks if the pointer is null—the ping sample is a valid object, so it gets serialized and transmitted.Fix
Push an empty
sample_p()instead of creating a real sample:This works because:
sample_p(null intrusive_ptr) still triggers the condition variable notification, waking up blocked threadsif (!samp) continue; correctly skips null samples