fix: improve otel span map cleanup performance#1200
fix: improve otel span map cleanup performance#1200
Conversation
Previously the span map was holding all span information in a tree structure. When a span finished, MarkFinished acquired a global write lock and recursively traversed the tree to clean up. The fix usees the already existin spanRecorder to keep track of the spans in the transaction. Instead of linking each child, we keep track of the root transaction and override the ParentSpanID so that the spanRecorder reconstructs the span hierarchy.
Semver Impact of This PR🟢 Patch (bug fixes) 📋 Changelog PreviewThis is how your changes will appear in the changelog. Breaking Changes 🛠
New Features ✨
Bug Fixes 🐛
Internal Changes 🔧Deps
Other
🤖 This preview updates automatically when you update the PR. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
d018672 to
d4f8f67
Compare
| if entry.activeCount.Add(-1) <= 0 { | ||
| // CompareAndSwap(CAS) is used to prevent a race between Set and MarkFinished. | ||
| // The race has two windows: | ||
| // 1. MarkFinished decremented activeCount to 0 but hasn't CAS'd yet -> Set Adds(1), and CAS fails keeping the | ||
| // transaction, since we just added a new span. | ||
| // 2. MarkFinished already CAS'd -> Set will store on the transaction marked for deletion (better than | ||
| // creating a new orphaned span). | ||
| if entry.activeCount.CompareAndSwap(0, -1) { | ||
| ssm.transactions.CompareAndDelete(traceID, entry) | ||
| } |
There was a problem hiding this comment.
@sl0thentr0py @Litarnus would appreciate if you can verify my logic here. Added the fix due to this comment and I think it makes sense.
There was a problem hiding this comment.
It looks good to me, your comments describe it accurately imo
| func (ssm *SentrySpanMap) Set(spanID otelTrace.SpanID, span *sentry.Span, traceID otelTrace.TraceID) { | ||
| t := &TransactionEntry{root: span} | ||
| t.activeCount.Store(1) | ||
| t.spans.Store(spanID, span) | ||
| t.knownSpanIDs.Store(spanID, struct{}{}) | ||
|
|
||
| if existing, loaded := ssm.transactions.LoadOrStore(traceID, t); loaded { | ||
| existing.activeCount.Add(1) | ||
| existing.spans.Store(spanID, span) | ||
| existing.knownSpanIDs.Store(spanID, struct{}{}) | ||
| } |
There was a problem hiding this comment.
Bug: A race condition between MarkFinished and Set can cause a span to be added to a transaction entry just as it's being deleted, making the span undiscoverable.
Severity: HIGH
Suggested Fix
The logic for adding a span in Set needs to handle the case where a transaction entry is being deleted. One approach is to re-check the activeCount after loading an existing entry. If the count is negative (indicating deletion is in progress), the Set operation should retry by attempting to create and store a new transaction entry instead of modifying the one marked for deletion.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.
Location: otel/span_map.go#L53-L63
Potential issue: A race condition exists between the `MarkFinished` and `Set` functions.
A thread in `MarkFinished` can prepare a transaction entry for deletion after its last
span finishes. Concurrently, a second thread in `Set` can add a new span to this same
transaction entry before it's actually removed from the map. The first thread's
`CompareAndDelete` operation will still succeed, removing the entry from the
`ssm.transactions` map. As a result, the newly added span becomes inaccessible to
subsequent `Get()` calls, leading to lost spans and broken trace continuity.
| } | ||
|
|
||
| // Len returns the number of spans on the map. | ||
| func (ssm *SentrySpanMap) Len() int { |
There was a problem hiding this comment.
Maybe we should add a comment that this is very expensive and used for testing to discourage anyone from using it by accident
| if entry.activeCount.Add(-1) <= 0 { | ||
| // CompareAndSwap(CAS) is used to prevent a race between Set and MarkFinished. | ||
| // The race has two windows: | ||
| // 1. MarkFinished decremented activeCount to 0 but hasn't CAS'd yet -> Set Adds(1), and CAS fails keeping the | ||
| // transaction, since we just added a new span. | ||
| // 2. MarkFinished already CAS'd -> Set will store on the transaction marked for deletion (better than | ||
| // creating a new orphaned span). | ||
| if entry.activeCount.CompareAndSwap(0, -1) { | ||
| ssm.transactions.CompareAndDelete(traceID, entry) | ||
| } |
There was a problem hiding this comment.
It looks good to me, your comments describe it accurately imo
Description
Previously the span map was holding all span information in a tree structure. When a span finished, MarkFinished acquired a global write lock and recursively traversed the tree to clean up. The fix usees the already existin spanRecorder to keep track of the spans in the transaction. Instead of linking each child, we keep track of the root transaction and override the ParentSpanID so that the spanRecorder reconstructs the span hierarchy.
Issues
Changelog Entry Instructions
To add a custom changelog entry, uncomment the section above. Supports:
For more details: custom changelog entries
Reminders
feat:,fix:,ref:,meta:)