Skip to content

Comments

fix: improve otel span map cleanup performance#1200

Open
giortzisg wants to merge 8 commits intomasterfrom
fix/otel-span-map
Open

fix: improve otel span map cleanup performance#1200
giortzisg wants to merge 8 commits intomasterfrom
fix/otel-span-map

Conversation

@giortzisg
Copy link
Contributor

@giortzisg giortzisg commented Feb 16, 2026

Description

Previously the span map was holding all span information in a tree structure. When a span finished, MarkFinished acquired a global write lock and recursively traversed the tree to clean up. The fix usees the already existin spanRecorder to keep track of the spans in the transaction. Instead of linking each child, we keep track of the root transaction and override the ParentSpanID so that the spanRecorder reconstructs the span hierarchy.

Issues

Changelog Entry Instructions

To add a custom changelog entry, uncomment the section above. Supports:

  • Single entry: just write text
  • Multiple entries: use bullet points
  • Nested bullets: indent 4+ spaces

For more details: custom changelog entries

Reminders

Previously the span map was holding all span information in a tree structure. When a span finished, MarkFinished acquired a global write lock and recursively traversed the tree to clean up. The fix usees the already existin spanRecorder to keep track of the spans in the transaction. Instead of linking each child, we keep track of the root transaction and override the ParentSpanID so that the spanRecorder reconstructs the span hierarchy.
@giortzisg giortzisg requested a review from Litarnus February 16, 2026 13:49
@giortzisg giortzisg self-assigned this Feb 16, 2026
@github-actions
Copy link

github-actions bot commented Feb 16, 2026

Semver Impact of This PR

🟢 Patch (bug fixes)

📋 Changelog Preview

This is how your changes will appear in the changelog.
Entries from this PR are highlighted with a left border (blockquote style).


Breaking Changes 🛠

  • Add support for go 1.26 by giortzisg in #1193
    • bump minimum supported go version to 1.24
  • Send uint64 overflowing attributes as numbers. by giortzisg in #1198
    • The SDK was converting overflowing uint64 attributes to strings for slog and logrus integrations. To eliminate double types for these attributes, the SDK now sends the overflowing attribute as is, and lets the server handle the overflow appropriately.
    • It is expected that overflowing unsigned integers would now get dropped, instead of converted to strings.

New Features ✨

  • Add zap logging integration by giortzisg in #1184
  • Log specific message for RequestEntityTooLarge by giortzisg in #1185

Bug Fixes 🐛

  • Improve otel span map cleanup performance by giortzisg in #1200
  • Ensure correct signal delivery on multi-client setups by giortzisg in #1190

Internal Changes 🔧

Deps

  • Bump golang.org/x/crypto to 0.48.0 by giortzisg in #1196
  • Use go1.24.0 by giortzisg in #1195
  • Bump github.com/gofiber/fiber/v2 from 2.52.9 to 2.52.11 in /fiber by dependabot in #1191
  • Bump getsentry/craft from 2.19.0 to 2.20.1 by dependabot in #1187

Other

  • Add omitzero and remove custom serialization by giortzisg in #1197
  • Rename Telemetry Processor components by giortzisg in #1186

🤖 This preview updates automatically when you update the PR.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@linear
Copy link

linear bot commented Feb 20, 2026

Comment on lines +77 to 86
if entry.activeCount.Add(-1) <= 0 {
// CompareAndSwap(CAS) is used to prevent a race between Set and MarkFinished.
// The race has two windows:
// 1. MarkFinished decremented activeCount to 0 but hasn't CAS'd yet -> Set Adds(1), and CAS fails keeping the
// transaction, since we just added a new span.
// 2. MarkFinished already CAS'd -> Set will store on the transaction marked for deletion (better than
// creating a new orphaned span).
if entry.activeCount.CompareAndSwap(0, -1) {
ssm.transactions.CompareAndDelete(traceID, entry)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sl0thentr0py @Litarnus would appreciate if you can verify my logic here. Added the fix due to this comment and I think it makes sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, your comments describe it accurately imo

Comment on lines +53 to 63
func (ssm *SentrySpanMap) Set(spanID otelTrace.SpanID, span *sentry.Span, traceID otelTrace.TraceID) {
t := &TransactionEntry{root: span}
t.activeCount.Store(1)
t.spans.Store(spanID, span)
t.knownSpanIDs.Store(spanID, struct{}{})

if existing, loaded := ssm.transactions.LoadOrStore(traceID, t); loaded {
existing.activeCount.Add(1)
existing.spans.Store(spanID, span)
existing.knownSpanIDs.Store(spanID, struct{}{})
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: A race condition between MarkFinished and Set can cause a span to be added to a transaction entry just as it's being deleted, making the span undiscoverable.
Severity: HIGH

Suggested Fix

The logic for adding a span in Set needs to handle the case where a transaction entry is being deleted. One approach is to re-check the activeCount after loading an existing entry. If the count is negative (indicating deletion is in progress), the Set operation should retry by attempting to create and store a new transaction entry instead of modifying the one marked for deletion.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent.
Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not
valid.

Location: otel/span_map.go#L53-L63

Potential issue: A race condition exists between the `MarkFinished` and `Set` functions.
A thread in `MarkFinished` can prepare a transaction entry for deletion after its last
span finishes. Concurrently, a second thread in `Set` can add a new span to this same
transaction entry before it's actually removed from the map. The first thread's
`CompareAndDelete` operation will still succeed, removing the entry from the
`ssm.transactions` map. As a result, the newly added span becomes inaccessible to
subsequent `Get()` calls, leading to lost spans and broken trace continuity.

}

// Len returns the number of spans on the map.
func (ssm *SentrySpanMap) Len() int {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add a comment that this is very expensive and used for testing to discourage anyone from using it by accident

Comment on lines +77 to 86
if entry.activeCount.Add(-1) <= 0 {
// CompareAndSwap(CAS) is used to prevent a race between Set and MarkFinished.
// The race has two windows:
// 1. MarkFinished decremented activeCount to 0 but hasn't CAS'd yet -> Set Adds(1), and CAS fails keeping the
// transaction, since we just added a new span.
// 2. MarkFinished already CAS'd -> Set will store on the transaction marked for deletion (better than
// creating a new orphaned span).
if entry.activeCount.CompareAndSwap(0, -1) {
ssm.transactions.CompareAndDelete(traceID, entry)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks good to me, your comments describe it accurately imo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sentry-go/otel: SentrySpanMap.tryCleanupSpan holds write lock during recursive cleanup, causing severe contention

2 participants