Skip to content

Conversation

@yodem
Copy link
Collaborator

@yodem yodem commented Feb 4, 2026

Description

Add Braintrust automations for backing up logs and tagging/pushing logs to datasets. Two Kubernetes CronJobs run in production: one weekly backup of Braintrust logs to a GCS bucket, and one daily job that tags logs with Claude (using Braintrust tags whose description contains dataset-tagging) and pushes them into Braintrust datasets that declare [[relevant_tags: ["a","b"]]] in their description. A dedicated Braintrust container image is built and pushed via a new GitHub Actions workflow (dev on branch/PR, prod on tag), and production Helm deploy is updated to inject the Braintrust image and related config.

Code Changes

  • .github/workflows/braintrust-container.yaml (new)

    • Workflow to build and push the Braintrust image to GAR.
    • Triggers on push/PR to paths under scripts/scheduled/braintrust_*.py, requirements-braintrust.txt, build/braintrust/**, and this workflow file; also on tags v* and workflow_dispatch.
    • build-dev: branch/PR → dev GCP project; image tagged with branch, sha, and date.
    • build-prod: tag push → prod GCP project; image tagged with tag, sha, and semver.
    • Uses QEMU, Docker Buildx, Google auth (workload identity), and docker/metadata-action for tags.
  • build/braintrust/Dockerfile (new)

    • Python 3.11-slim image; installs requirements-braintrust.txt, non-root user braintrust, copies braintrust_backup_logs.py and braintrust_tag_and_push.py into /app/scripts, ENTRYPOINT ["python"].
  • build/ci/production-helm-deploy.sh

    • Sets BRAINTRUST_IMAGE and uses yq to set cronJobs.braintrust.image.repository and cronJobs.braintrust.image.tag in the generated values before helm upgrade.
  • build/ci/production-values.yaml

    • Adds cronJobs.braintrust with backupLogs (schedule 0 1 * * 0, bucket braintrust-logs, prefix logs/) and tagAndPush (schedule 0 2 * * *), plus placeholder image.repository/tag.
    • Adds secrets.braintrust and secrets.anthropic refs for production.
  • helm-chart/sefaria/templates/cronjob/braintrust-backup-logs.yaml (new)

    • CronJob when cronJobs.braintrust.backupLogs.enabled is true.
    • Init container: runs braintrust_backup_logs.py (Braintrust API key + project ID from secret), writes CSV to shared volume.
    • Main container: google/cloud-sdk uploads CSV from shared volume to gs://${BUCKET}/${PREFIX}; concurrencyPolicy: Forbid, resource limits, single success/fail history.
  • helm-chart/sefaria/templates/cronjob/braintrust-tag-and-push.yaml (new)

    • CronJob when cronJobs.braintrust.tagAndPush.enabled is true.
    • Single container runs braintrust_tag_and_push.py all with Braintrust + Anthropic secrets and optional PVC for shared storage; concurrencyPolicy: Forbid, memory/cpu limits.
  • helm-chart/sefaria/values.yaml

    • New secrets.braintrust and secrets.anthropic (refs only; no data in repo).
    • New cronJobs.braintrust block: enabled: false, default image repo/tag, backupLogs and tagAndPush with schedules, service accounts, bucket/prefix, and tagAndPush.usePvc/pvcName.
  • requirements-braintrust.txt (new)

    • Pins: braintrust==0.15.0, langchain-anthropic==0.3.22, requests, structlog, google-cloud-logging, google-cloud-storage.
  • scripts/scheduled/braintrust_backup_logs.py (new)

    • Reads BRAINTRUST_API_KEY and BRAINTRUST_PROJECT_ID (validated as UUID); queries Braintrust BTQL for last 7 days of project logs, writes CSV to /tmp/logs_backup_<date>.csv; structured logging; exits 0 when no logs.
  • scripts/scheduled/braintrust_tag_and_push.py (new)

    • Init path: Fetch all Braintrust tags → keep only those whose description contains dataset-tagging; query all logs from last 24h via BTQL; tag each log with Claude (1–3 tags from allowed set); save tagged logs to shared storage (JSONL). Push path: Load tagged logs; fetch all datasets; filter datasets by [[relevant_tags: ["a","b"]]] in description; match logs to datasets by tag intersection; insert into Braintrust datasets with deduplication by log ID. Supports init / push / all. Uses BRAINTRUST_MAX_LOGS, BRAINTRUST_TAGGING_DELAY, BRAINTRUST_MAX_TAGS_IN_PROMPT, BRAINTRUST_SHARED_STORAGE; validates project_id as UUID.

Notes

  • Secrets: Production needs braintrust-secret-production (e.g. api-key, project-id) and anthropic-api-key-production (e.g. api-key) created and wired to the chart. GCP service accounts braintrust-backup-logs and braintrust-tag-push must exist with the right GCS/Braintrust permissions.
  • GCS: Backup CronJob expects bucket braintrust-logs and prefix logs/ (configurable in values). Ensure the backup job’s service account can write to that bucket.
  • Tagging: Tag-and-push uses Claude (claude-3-5-haiku-20241022), rate-limited by BRAINTRUST_TAGGING_DELAY and capped by BRAINTRUST_MAX_LOGS (default 500). Dataset tagging only considers tags whose Braintrust description contains dataset-tagging and datasets that declare [[relevant_tags: [...]]] in their description.
  • Breaking: None for existing services; Braintrust cronjobs are off by default in the default values and enabled only in production values.
  • Branch: Aligns with chore/SC-41159 (research automations for tagging/logged data).

@mergify
Copy link

mergify bot commented Feb 4, 2026

🧪 CI Insights

Here's what we observed from your CI run for 4ca483f.

🟢 All jobs passed!

But CI Insights is watching 👀

…dataset matching

Implement Braintrust integration with two-phase automation:

**Phase 1 (init_step):**
- Query available tags, filter for dataset-tagging in description
- Query all logs from last 24h
- Use Claude to assign relevant tags (constrained to filtered tags)
- Save tagged logs with deduplication support

**Phase 2 (push_step):**
- Load previously tagged logs
- Fetch all datasets and extract relevant_tags from descriptions
- Match logs to datasets based on tag intersection
- Insert logs to datasets with deduplication and error escalation

**Infrastructure:**
- Braintrust API integration with proper error handling
- Claude API for intelligent tag assignment
- Kubernetes cronjobs with shared storage
- Docker image with Python 3.11
- Helm templates for ConfigMaps and cronjobs
- Backup cronjob for weekly log archival to GCS

**Security & Reliability:**
- Critical fixes for JSON parsing, regex patterns, and type validation
- Error escalation for failed insertions
- Retry logic for transient failures
- Structured logging with structlog

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@yodem yodem force-pushed the chore/sc-41159/research-automations-for-tagging-logged branch from 577eea2 to 675de90 Compare February 4, 2026 12:43
@yodem yodem changed the title chore(sc-41159): Add Braintrust evaluation cronjob infrastructure wit… chore(sc-41159): Add Braintrust evaluation cronjob infrastructure Feb 4, 2026
@yodem yodem requested a review from Copilot February 4, 2026 13:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds infrastructure for automated Braintrust evaluation by introducing two scheduled cronjobs: one that tags logs with Claude and pushes them to Braintrust datasets (daily at 2 AM), and one that backs up logs to CSV in GCS (weekly on Sundays). The implementation includes Python automation scripts, Helm chart templates for Kubernetes cronjobs, a dedicated Docker image, and configuration for secrets management.

Changes:

  • Added two Python scripts for Braintrust automation: log tagging/pushing and backup functionality
  • Created Kubernetes CronJob infrastructure with Helm templates, ConfigMaps, and values configuration
  • Added Dockerfile.braintrust for running the automation scripts
  • Updated requirements.txt to include braintrust==0.15.0 dependency

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 19 comments.

Show a summary per file
File Description
scripts/scheduled/braintrust_tag_and_push.py Main automation script that fetches tags, queries logs, tags them with Claude, and pushes to datasets
scripts/scheduled/braintrust_backup_logs.py Backup script that queries logs from Braintrust and writes them to CSV for GCS storage
helm-chart/sefaria/values.yaml Adds configuration for braintrust and anthropic secrets, plus cronjob settings
helm-chart/sefaria/templates/cronjob/braintrust-tag-and-push.yaml CronJob template for daily tag-and-push automation
helm-chart/sefaria/templates/cronjob/braintrust-backup-logs.yaml CronJob template for weekly log backups with init container pattern
helm-chart/sefaria/templates/configmap/braintrust-tag-and-push-script.yaml ConfigMap that mounts the tag-and-push script
helm-chart/sefaria/templates/configmap/braintrust-backup-logs-script.yaml ConfigMap that mounts the backup script
Dockerfile.braintrust Python 3.11 container image with required dependencies for both cronjobs
requirements.txt Adds braintrust library version 0.15.0

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +247 to +253
for idx, log in enumerate(logs):
tags = tag_log_with_claude(client, log, available_tags)
log["relevant_tags"] = tags
tagged_logs.append(log)

if (idx + 1) % 10 == 0:
logger.info("tagging_progress", processed=idx + 1, total=len(logs))
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script tags ALL logs from the last 24 hours with Claude without any rate limiting or batching controls. If there are hundreds or thousands of logs, this could:

  1. Hit Claude API rate limits, causing failures
  2. Incur significant API costs
  3. Take a very long time to complete

Consider adding:

  1. Rate limiting between API calls (e.g., time.sleep(0.1) between requests)
  2. A configurable maximum number of logs to process
  3. Retry logic with exponential backoff for rate limit errors
  4. Batching if Claude supports batch processing

For example:

import time

for idx, log in enumerate(logs):
    tags = tag_log_with_claude(client, log, available_tags)
    log["relevant_tags"] = tags
    tagged_logs.append(log)
    
    # Add small delay to avoid rate limiting
    if (idx + 1) % 10 == 0:
        time.sleep(1)  # Brief pause every 10 requests
        logger.info("tagging_progress", processed=idx + 1, total=len(logs))

Copilot uses AI. Check for mistakes.
Comment on lines 306 to 310
unique_logs = {}
for log in logs:
log_id = log.get("id")
if log_id:
unique_logs[log_id] = log
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs without an ID are silently dropped during deduplication. The code only adds logs to unique_logs if they have an ID (if log_id:), which means logs without IDs are discarded without any warning or logging.

Consider handling logs without IDs more explicitly:

unique_logs = {}
logs_without_id = []
for log in logs:
    log_id = log.get("id")
    if log_id:
        unique_logs[log_id] = log
    else:
        logs_without_id.append(log)

if logs_without_id:
    logger.warning("logs_without_id_found", count=len(logs_without_id))
    # Optionally include them in the final list
    # unique_logs.update({id(log): log for log in logs_without_id})

logs = list(unique_logs.values())

This ensures visibility into potential data quality issues.

Suggested change
unique_logs = {}
for log in logs:
log_id = log.get("id")
if log_id:
unique_logs[log_id] = log
unique_logs = {}
logs_without_id = []
for log in logs:
log_id = log.get("id")
if log_id:
unique_logs[log_id] = log
else:
logs_without_id.append(log)
if logs_without_id:
logger.warning("logs_without_id_found", count=len(logs_without_id))

Copilot uses AI. Check for mistakes.
Comment on lines +548 to +556
try:
existing_ids = set()

for row in dataset:
log_id = row.get("id") or (row.get("input", {}).get("id") if isinstance(row.get("input"), dict) else None)
if log_id:
existing_ids.add(str(log_id))

return existing_ids
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_existing_log_ids_in_dataset function iterates over the entire dataset to build a deduplication set. For large datasets with thousands or millions of records, this could:

  1. Be very slow and consume significant memory
  2. Hit API rate limits or timeouts
  3. Make the cronjob take hours to complete

Consider:

  1. Adding pagination or limits if the Braintrust API supports it
  2. Using a more efficient deduplication approach (e.g., checking existence per-log via API if available)
  3. Caching the existing IDs between runs if the dataset doesn't change frequently
  4. Adding a warning if the dataset size exceeds a threshold:
MAX_DATASET_SIZE = 10000  # Configurable
row_count = 0
for row in dataset:
    row_count += 1
    if row_count > MAX_DATASET_SIZE:
        logger.warning("large_dataset_detected", 
                      dataset_id=dataset.id if hasattr(dataset, 'id') else 'unknown',
                      size=f">{MAX_DATASET_SIZE}")
    # ... rest of logic

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +38 to +42
volumeMounts:
- mountPath: /shared/braintrust
name: shared-storage
command: ["python"]
args: ["/app/scripts/braintrust_tag_and_push.py", "all"]
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The braintrust image runs as a non-root user (see build/braintrust/Dockerfile). Ensure the shared-storage volume mount is writable in all modes (especially PVCs) by setting an fsGroup or adding an initContainer to adjust ownership/permissions.

Copilot uses AI. Check for mistakes.
Comment on lines +245 to +249
logger.info("tagging_all_logs", total_logs=len(logs), available_tags_count=len(available_tags))
client = get_claude_client()
tagged_logs = []

for idx, log in enumerate(logs):
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script tags every log from the last 24 hours with a synchronous Claude call per log and no rate limiting / max-log cap. That can easily exceed cronjob runtime/cost and hit provider rate limits; consider batching with backoff/retry and making max logs/time budget configurable.

Copilot uses AI. Check for mistakes.
syncMongoProductionData:
enabled: false
braintrust:
enabled: false
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The values key cronJobs.braintrust.enabled is set but not referenced by any templates (only backupLogs.enabled/tagAndPush.enabled are checked). Either wire this parent flag into the CronJob template conditions or remove it to avoid configuration drift/confusion.

Suggested change
enabled: false

Copilot uses AI. Check for mistakes.
Comment on lines 9 to 13
spec:
schedule: "{{ .Values.cronJobs.braintrust.backupLogs.schedule }}"
jobTemplate:
spec:
backoffLimit: 1
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concurrencyPolicy is set, so a slow export/upload could overlap with the next schedule and upload duplicate backups. Consider setting concurrencyPolicy: Forbid for predictable weekly backups.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3
braintrust==0.15.0
langchain-anthropic==0.3.22
requests>=2.31.0
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions updating requirements.txt to add braintrust, but this PR adds a separate requirements-braintrust.txt and does not modify requirements.txt. Either update the description or ensure the intended dependency file is updated so installation/build expectations are clear.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +22
- name: braintrust-tag-and-push
image: "{{ .Values.cronJobs.braintrust.image.repository }}:{{ .Values.cronJobs.braintrust.image.tag }}"
env:
- name: BRAINTRUST_API_KEY
valueFrom:
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions deploying scripts via ConfigMaps/init containers, but the chart changes shown execute scripts baked into the image and there are no new braintrust script ConfigMaps under helm-chart/sefaria/templates/configmap. Either add the missing templates or update the PR description/docs to match the implemented approach.

Copilot uses AI. Check for mistakes.
Comment on lines +170 to +173
model="claude-3-5-haiku-20241022",
temperature=0,
max_tokens=256,
api_key=api_key
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChatAnthropic is initialized with parameter names (model/max_tokens/api_key) that differ from other usages in this repo (model_name/max_tokens_to_sample) and may not match langchain-anthropic==0.3.22, causing a runtime TypeError. Align the constructor args with the version used in this codebase (and rely on standard env-based API key handling if supported).

Suggested change
model="claude-3-5-haiku-20241022",
temperature=0,
max_tokens=256,
api_key=api_key
model_name="claude-3-5-haiku-20241022",
temperature=0,
max_tokens_to_sample=256,

Copilot uses AI. Check for mistakes.
Comment on lines +562 to +566
for row in dataset:
log_id = row.get("id") or (row.get("input", {}).get("id") if isinstance(row.get("input"), dict) else None)
if log_id:
existing_ids.add(str(log_id))

Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deduplication scans the entire dataset via iteration, which will become slow/expensive as datasets grow and can time out the cronjob. Consider a scalable alternative (query only recent rows, maintain an external pushed-ID index, or use a server-side unique key/dedup mechanism if Braintrust supports it).

Copilot uses AI. Check for mistakes.
Comment on lines 9 to 13
spec:
schedule: "{{ .Values.cronJobs.braintrust.tagAndPush.schedule }}"
jobTemplate:
spec:
backoffLimit: 1
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No concurrencyPolicy is set, so if a run takes longer than the schedule interval, multiple runs can overlap and double-tag/double-push logs. Consider setting concurrencyPolicy: Forbid (and optionally startingDeadlineSeconds) for safer operation.

Copilot uses AI. Check for mistakes.
Comment on lines +196 to +200
braintrust:
enabled: true
image:
repository:
tag:
Copy link

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says the Braintrust cronjobs are disabled by default, but production-values.yaml enables them (and both sub-jobs) for production. Either keep them disabled here until explicitly enabled, or update the PR description to reflect that production will run them after deploy.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants