Skip to content

Conversation

@pitabwire
Copy link
Owner

@pitabwire pitabwire commented Feb 4, 2026

Summary

  • Resource leak fixes: Close connections on error paths in JetStream KV, Redis, and Valkey cache backends
  • Data race fixes: Use atomic.Int64 for event test counters, add mutex protection for shared state, fix concurrent map access in security manager
  • Graceful shutdown improvements: Properly drain worker pools, add configurable shutdown timeouts, ensure clean service teardown
  • Zombie trace suppression: Strip trace context from queue Receive() polls to prevent gocloud.dev internal spans, replace parent-child trace propagation with span links for message processing to eliminate long-running zombie root traces
  • OTel schema fix: Update semconv from v1.37.0 to v1.39.0 to resolve schema URL conflict with otel SDK v1.40.0
  • Lint and race detector fixes: Resolve all lint errors and data races found by -race flag
  • API change: workerpool.NewManager now returns (Manager, error) instead of panicking on failure

Test plan

  • go build ./... compiles cleanly
  • go test ./... -race -count=1 passes all tests with race detector
  • Verify reduced trace volume in staging environment
  • Confirm span links appear correctly in trace viewer (Jaeger/Grafana)

🤖 Generated with Claude Code

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @pitabwire, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and reliability of the framework by addressing several critical issues related to concurrency, resource management, and error handling. The changes aim to prevent common failure modes such as race conditions, goroutine leaks, and application crashes during initialization, promoting a more stable and predictable service operation. By transitioning from abrupt panics to graceful startup error reporting, the framework now offers better diagnostic capabilities and resilience.

Highlights

  • Concurrency Safety: Implemented sync.RWMutex to protect the eventRegistry map, preventing race conditions during concurrent read/write operations by queue subscriber goroutines.
  • Resource Management: Addressed goroutine leaks by introducing a Close() method to the security.Manager interface and its implementations, ensuring background JWKS refresh goroutines are properly shut down. TokenAuthenticator.Stop() is now idempotent using sync.Once to prevent double-close panics.
  • Graceful Error Handling: Replaced all panic(), .Fatal(), and .Panic() calls during service initialization with a AddStartupError() pattern, allowing the service to report errors gracefully before accepting requests.
  • gRPC Robustness: Resolved a potential gRPC deadlock by replacing an uninitialized errorChannel with a reportError callback in grpcDriver, routing gRPC errors through the service's existing sendStopError mechanism.
  • Connection Cleanup: Enhanced resource leak prevention by ensuring Redis clients are closed on Ping() failures and NATS connections are closed on all error paths during JetStream KV cache initialization.
  • API Refinement: Modified workerpool.NewManager to return (Manager, error) instead of panicking, providing a more robust initialization process.
Changelog
  • cache/jetstreamkv/jetstream.go
    • Ensured NATS connection is closed on various error paths during JetStream KV cache initialization.
  • cache/redis/redis.go
    • Added client closure on Redis Ping() failure to prevent resource leaks.
  • events/manager.go
    • Introduced sync.RWMutex to protect the eventRegistry map, resolving potential race conditions during concurrent access.
  • options_datastore.go
    • Replaced util.Log(...).Fatal(...) calls with s.AddStartupError(err) for graceful error handling during datastore manager and connection initialization.
    • Removed an unused import.
  • options_queue.go
    • Converted panic() calls during publisher and subscriber registration validation to s.AddStartupError() for non-fatal startup error reporting.
  • options_queue_test.go
    • Updated TestOptionEarlyFailure to verify AddStartupError behavior instead of panics.
    • Added helper functions contains and searchString for error message validation.
  • options_telemetry.go
    • Replaced util.Log(...).Fatal(...) with s.AddStartupError(err) for telemetry initialization errors.
  • options_worker.go
    • Modified the WithWorkerPoolOptions function to handle the new (Manager, error) return signature of workerpool.NewManager and report errors via s.AddStartupError.
  • queue/validation_test.go
    • Updated test cases to reflect the new (Manager, error) return signature of workerpool.NewManager.
  • security/interfaces.go
    • Added a Close() method to the security.Manager interface for proper resource cleanup.
  • security/manager/security_manager.go
    • Implemented the Close() method to delegate closure to the authenticator if it implements the closer interface.
  • security/openid/jwt_token_authenticator.go
    • Implemented a Close() method to stop the background JWKS refresh goroutine.
    • Made TokenAuthenticator.Stop() idempotent using sync.Once to prevent multiple closures.
  • server.go
    • Replaced the errorChannel field in grpcDriver with a reportError callback function, preventing deadlocks with uninitialized channels and routing gRPC errors through the service's main error handling.
  • service.go
    • Integrated securityManager.Close() into the service's cleanup methods.
    • Updated workerpool.NewManager call to handle its new error return.
    • Replaced Fatal and Panic calls with AddStartupError for JWT registration and event queue setup.
    • Passed the service's sendStopError callback to the grpcDriver for unified error reporting.
  • workerpool/manager.go
    • Changed NewManager signature to return (Manager, error) instead of Manager.
    • Replaced log.WithError(err).Panic(...) with a graceful error return.
Activity
  • The pull request author, pitabwire, has provided a detailed summary of the changes and a test plan in the PR description. No other review comments or activities are visible in the provided context.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of fixes that significantly enhance the robustness and reliability of the framework. The changes address critical issues such as race conditions, resource leaks, goroutine leaks, and panics during startup, replacing them with more graceful error handling mechanisms. The introduction of sync.RWMutex for map access, a Close() method for graceful shutdown of background tasks, and the consistent use of a startup error aggregation pattern are all excellent improvements. The API change in workerpool.NewManager to return an error instead of panicking is also a welcome change that aligns with Go best practices. The test updates are thorough and correctly validate the new behaviors. I have one minor suggestion to simplify a test file by using a standard library function. Overall, this is a high-quality contribution that makes the framework much more stable.

Comment on lines 91 to 102
func contains(s, substr string) bool {
return len(s) >= len(substr) && searchString(s, substr)
}

func searchString(s, substr string) bool {
for i := 0; i <= len(s)-len(substr); i++ {
if s[i:i+len(substr)] == substr {
return true
}
}
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These helper functions contains and searchString reimplement the functionality of the standard library's strings.Contains. It's more idiomatic and simpler to use the standard library function directly. You can remove these functions and use strings.Contains where contains is called (on line 79). Remember to add "strings" to your imports.

@pitabwire pitabwire changed the title fix: resolve critical robustness issues across framework fix: critical robustness improvements and zombie trace suppression Feb 4, 2026
- Close connections on error paths in JetStream KV, Redis, and Valkey cache backends
- Use atomic.Int64 for event test counters, add mutex protection for shared state
- Add Close() to security.Manager to stop JWKS background refresh goroutine
- Replace panic/Fatal calls during init with graceful AddStartupError pattern
- Replace uninitialized gRPC errorChannel with reportError callback
- Change workerpool.NewManager to return (Manager, error) instead of panicking
- Strip trace context from queue Receive() polls to suppress gocloud.dev internal spans
- Replace parent-child trace propagation with span links for message processing
- Update semconv from v1.37.0 to v1.39.0 to resolve OTel schema URL conflict
- Use strings.Contains instead of custom helpers in test

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@pitabwire pitabwire force-pushed the fix/critical-robustness-improvements branch from facd4f8 to 7c9ee55 Compare February 4, 2026 13:28
@pitabwire pitabwire merged commit 4cbf392 into main Feb 4, 2026
2 checks passed
@pitabwire pitabwire deleted the fix/critical-robustness-improvements branch February 4, 2026 13:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant