Skip to content

[Q2-P1] Add opt-in SDK telemetry for proactive issue detection #24

@karlwaldman

Description

@karlwaldman

Problem

We only learn about SDK issues when customers report them. We're reactive instead of proactive.

Solution

Add opt-in telemetry to detect issues before customers report them:

# oilpriceapi/telemetry.py

class Telemetry:
    """Opt-in telemetry for SDK health monitoring."""

    def __init__(self, enabled=False):
        self.enabled = enabled or os.getenv('OILPRICEAPI_TELEMETRY') == 'true'

    def log_timeout_event(self, endpoint, timeout, duration):
        """Log timeout events."""
        if not self.enabled:
            return

        self._send({
            'event': 'timeout',
            'endpoint': endpoint,
            'timeout': timeout,
            'duration': duration,
            'sdk_version': __version__,
            'timestamp': datetime.utcnow().isoformat()
        })

    def log_error(self, error_type, endpoint, message):
        """Log error events."""
        if not self.enabled:
            return

        self._send({
            'event': 'error',
            'error_type': error_type,
            'endpoint': endpoint,
            'message': message,
            'sdk_version': __version__
        })

Usage

# Enable telemetry (opt-in)
client = OilPriceAPI(api_key='...', enable_telemetry=True)

# Or via environment variable
export OILPRICEAPI_TELEMETRY=true

Data Collected (Privacy-Preserving)

YES (collected):

  • SDK version
  • Endpoint called
  • Error types
  • Timeout events
  • Response times (buckets)
  • Python version

NO (NOT collected):

  • API keys
  • Request parameters
  • Response data
  • User identifying information
  • IP addresses

Benefits

  1. Early detection: See issues before customers report
  2. Version adoption: Track which versions are in use
  3. Error patterns: Identify common failure modes
  4. Performance trends: Track response times across users

Implementation

# In client.py
class OilPriceAPI:
    def __init__(self, ..., enable_telemetry=False):
        self.telemetry = Telemetry(enabled=enable_telemetry)

    def request(self, method, path, ...):
        start = time.time()
        try:
            response = self._client.request(...)
            duration = time.time() - start

            if duration > timeout:
                self.telemetry.log_timeout_event(path, timeout, duration)

            return response
        except Exception as e:
            self.telemetry.log_error(type(e).__name__, path, str(e))
            raise

Acceptance Criteria

  • Telemetry module created
  • Opt-in by default (explicit enable required)
  • Privacy policy documented
  • Data retention policy defined
  • Dashboard created for telemetry data
  • Documentation updated with telemetry info

Estimated Effort

Time: 6-8 hours

Success Metrics

  • Detect issues within 1 hour of first occurrence
  • 20%+ of users opt into telemetry
  • Identify patterns before customer reports

Metadata

Metadata

Assignees

No one assigned

    Labels

    monitoringMonitoring and observabilitypriority: highShould be fixed soonquadrant: q2Important, Not Urgent (Schedule)type: featureNew feature implementation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions