-
-
Notifications
You must be signed in to change notification settings - Fork 311
chore(sc-41159): Add Braintrust evaluation cronjob infrastructure #3076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
chore(sc-41159): Add Braintrust evaluation cronjob infrastructure #3076
Conversation
🧪 CI InsightsHere's what we observed from your CI run for 4ca483f. 🟢 All jobs passed!But CI Insights is watching 👀 |
…dataset matching Implement Braintrust integration with two-phase automation: **Phase 1 (init_step):** - Query available tags, filter for dataset-tagging in description - Query all logs from last 24h - Use Claude to assign relevant tags (constrained to filtered tags) - Save tagged logs with deduplication support **Phase 2 (push_step):** - Load previously tagged logs - Fetch all datasets and extract relevant_tags from descriptions - Match logs to datasets based on tag intersection - Insert logs to datasets with deduplication and error escalation **Infrastructure:** - Braintrust API integration with proper error handling - Claude API for intelligent tag assignment - Kubernetes cronjobs with shared storage - Docker image with Python 3.11 - Helm templates for ConfigMaps and cronjobs - Backup cronjob for weekly log archival to GCS **Security & Reliability:** - Critical fixes for JSON parsing, regex patterns, and type validation - Error escalation for failed insertions - Retry logic for transient failures - Structured logging with structlog Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
577eea2 to
675de90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds infrastructure for automated Braintrust evaluation by introducing two scheduled cronjobs: one that tags logs with Claude and pushes them to Braintrust datasets (daily at 2 AM), and one that backs up logs to CSV in GCS (weekly on Sundays). The implementation includes Python automation scripts, Helm chart templates for Kubernetes cronjobs, a dedicated Docker image, and configuration for secrets management.
Changes:
- Added two Python scripts for Braintrust automation: log tagging/pushing and backup functionality
- Created Kubernetes CronJob infrastructure with Helm templates, ConfigMaps, and values configuration
- Added
Dockerfile.braintrustfor running the automation scripts - Updated
requirements.txtto includebraintrust==0.15.0dependency
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 19 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/scheduled/braintrust_tag_and_push.py | Main automation script that fetches tags, queries logs, tags them with Claude, and pushes to datasets |
| scripts/scheduled/braintrust_backup_logs.py | Backup script that queries logs from Braintrust and writes them to CSV for GCS storage |
| helm-chart/sefaria/values.yaml | Adds configuration for braintrust and anthropic secrets, plus cronjob settings |
| helm-chart/sefaria/templates/cronjob/braintrust-tag-and-push.yaml | CronJob template for daily tag-and-push automation |
| helm-chart/sefaria/templates/cronjob/braintrust-backup-logs.yaml | CronJob template for weekly log backups with init container pattern |
| helm-chart/sefaria/templates/configmap/braintrust-tag-and-push-script.yaml | ConfigMap that mounts the tag-and-push script |
| helm-chart/sefaria/templates/configmap/braintrust-backup-logs-script.yaml | ConfigMap that mounts the backup script |
| Dockerfile.braintrust | Python 3.11 container image with required dependencies for both cronjobs |
| requirements.txt | Adds braintrust library version 0.15.0 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
helm-chart/sefaria/templates/configmap/braintrust-backup-logs-script.yaml
Outdated
Show resolved
Hide resolved
| for idx, log in enumerate(logs): | ||
| tags = tag_log_with_claude(client, log, available_tags) | ||
| log["relevant_tags"] = tags | ||
| tagged_logs.append(log) | ||
|
|
||
| if (idx + 1) % 10 == 0: | ||
| logger.info("tagging_progress", processed=idx + 1, total=len(logs)) |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script tags ALL logs from the last 24 hours with Claude without any rate limiting or batching controls. If there are hundreds or thousands of logs, this could:
- Hit Claude API rate limits, causing failures
- Incur significant API costs
- Take a very long time to complete
Consider adding:
- Rate limiting between API calls (e.g.,
time.sleep(0.1)between requests) - A configurable maximum number of logs to process
- Retry logic with exponential backoff for rate limit errors
- Batching if Claude supports batch processing
For example:
import time
for idx, log in enumerate(logs):
tags = tag_log_with_claude(client, log, available_tags)
log["relevant_tags"] = tags
tagged_logs.append(log)
# Add small delay to avoid rate limiting
if (idx + 1) % 10 == 0:
time.sleep(1) # Brief pause every 10 requests
logger.info("tagging_progress", processed=idx + 1, total=len(logs))| unique_logs = {} | ||
| for log in logs: | ||
| log_id = log.get("id") | ||
| if log_id: | ||
| unique_logs[log_id] = log |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logs without an ID are silently dropped during deduplication. The code only adds logs to unique_logs if they have an ID (if log_id:), which means logs without IDs are discarded without any warning or logging.
Consider handling logs without IDs more explicitly:
unique_logs = {}
logs_without_id = []
for log in logs:
log_id = log.get("id")
if log_id:
unique_logs[log_id] = log
else:
logs_without_id.append(log)
if logs_without_id:
logger.warning("logs_without_id_found", count=len(logs_without_id))
# Optionally include them in the final list
# unique_logs.update({id(log): log for log in logs_without_id})
logs = list(unique_logs.values())This ensures visibility into potential data quality issues.
| unique_logs = {} | |
| for log in logs: | |
| log_id = log.get("id") | |
| if log_id: | |
| unique_logs[log_id] = log | |
| unique_logs = {} | |
| logs_without_id = [] | |
| for log in logs: | |
| log_id = log.get("id") | |
| if log_id: | |
| unique_logs[log_id] = log | |
| else: | |
| logs_without_id.append(log) | |
| if logs_without_id: | |
| logger.warning("logs_without_id_found", count=len(logs_without_id)) |
| try: | ||
| existing_ids = set() | ||
|
|
||
| for row in dataset: | ||
| log_id = row.get("id") or (row.get("input", {}).get("id") if isinstance(row.get("input"), dict) else None) | ||
| if log_id: | ||
| existing_ids.add(str(log_id)) | ||
|
|
||
| return existing_ids |
Copilot
AI
Feb 4, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_existing_log_ids_in_dataset function iterates over the entire dataset to build a deduplication set. For large datasets with thousands or millions of records, this could:
- Be very slow and consume significant memory
- Hit API rate limits or timeouts
- Make the cronjob take hours to complete
Consider:
- Adding pagination or limits if the Braintrust API supports it
- Using a more efficient deduplication approach (e.g., checking existence per-log via API if available)
- Caching the existing IDs between runs if the dataset doesn't change frequently
- Adding a warning if the dataset size exceeds a threshold:
MAX_DATASET_SIZE = 10000 # Configurable
row_count = 0
for row in dataset:
row_count += 1
if row_count > MAX_DATASET_SIZE:
logger.warning("large_dataset_detected",
dataset_id=dataset.id if hasattr(dataset, 'id') else 'unknown',
size=f">{MAX_DATASET_SIZE}")
# ... rest of logic
helm-chart/sefaria/templates/configmap/braintrust-tag-and-push-script.yaml
Outdated
Show resolved
Hide resolved
… jobs for log backup and tagging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| volumeMounts: | ||
| - mountPath: /shared/braintrust | ||
| name: shared-storage | ||
| command: ["python"] | ||
| args: ["/app/scripts/braintrust_tag_and_push.py", "all"] |
Copilot
AI
Feb 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The braintrust image runs as a non-root user (see build/braintrust/Dockerfile). Ensure the shared-storage volume mount is writable in all modes (especially PVCs) by setting an fsGroup or adding an initContainer to adjust ownership/permissions.
| logger.info("tagging_all_logs", total_logs=len(logs), available_tags_count=len(available_tags)) | ||
| client = get_claude_client() | ||
| tagged_logs = [] | ||
|
|
||
| for idx, log in enumerate(logs): |
Copilot
AI
Feb 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script tags every log from the last 24 hours with a synchronous Claude call per log and no rate limiting / max-log cap. That can easily exceed cronjob runtime/cost and hit provider rate limits; consider batching with backoff/retry and making max logs/time budget configurable.
| syncMongoProductionData: | ||
| enabled: false | ||
| braintrust: | ||
| enabled: false |
Copilot
AI
Feb 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The values key cronJobs.braintrust.enabled is set but not referenced by any templates (only backupLogs.enabled/tagAndPush.enabled are checked). Either wire this parent flag into the CronJob template conditions or remove it to avoid configuration drift/confusion.
| enabled: false |
| spec: | ||
| schedule: "{{ .Values.cronJobs.braintrust.backupLogs.schedule }}" | ||
| jobTemplate: | ||
| spec: | ||
| backoffLimit: 1 |
Copilot
AI
Feb 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No concurrencyPolicy is set, so a slow export/upload could overlap with the next schedule and upload duplicate backups. Consider setting concurrencyPolicy: Forbid for predictable weekly backups.
| braintrust==0.15.0 | ||
| langchain-anthropic==0.3.22 | ||
| requests>=2.31.0 |
Copilot
AI
Feb 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description mentions updating requirements.txt to add braintrust, but this PR adds a separate requirements-braintrust.txt and does not modify requirements.txt. Either update the description or ensure the intended dependency file is updated so installation/build expectations are clear.
| - name: braintrust-tag-and-push | ||
| image: "{{ .Values.cronJobs.braintrust.image.repository }}:{{ .Values.cronJobs.braintrust.image.tag }}" | ||
| env: | ||
| - name: BRAINTRUST_API_KEY | ||
| valueFrom: |
Copilot
AI
Feb 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description mentions deploying scripts via ConfigMaps/init containers, but the chart changes shown execute scripts baked into the image and there are no new braintrust script ConfigMaps under helm-chart/sefaria/templates/configmap. Either add the missing templates or update the PR description/docs to match the implemented approach.
| model="claude-3-5-haiku-20241022", | ||
| temperature=0, | ||
| max_tokens=256, | ||
| api_key=api_key |
Copilot
AI
Feb 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChatAnthropic is initialized with parameter names (model/max_tokens/api_key) that differ from other usages in this repo (model_name/max_tokens_to_sample) and may not match langchain-anthropic==0.3.22, causing a runtime TypeError. Align the constructor args with the version used in this codebase (and rely on standard env-based API key handling if supported).
| model="claude-3-5-haiku-20241022", | |
| temperature=0, | |
| max_tokens=256, | |
| api_key=api_key | |
| model_name="claude-3-5-haiku-20241022", | |
| temperature=0, | |
| max_tokens_to_sample=256, |
| for row in dataset: | ||
| log_id = row.get("id") or (row.get("input", {}).get("id") if isinstance(row.get("input"), dict) else None) | ||
| if log_id: | ||
| existing_ids.add(str(log_id)) | ||
|
|
Copilot
AI
Feb 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deduplication scans the entire dataset via iteration, which will become slow/expensive as datasets grow and can time out the cronjob. Consider a scalable alternative (query only recent rows, maintain an external pushed-ID index, or use a server-side unique key/dedup mechanism if Braintrust supports it).
| spec: | ||
| schedule: "{{ .Values.cronJobs.braintrust.tagAndPush.schedule }}" | ||
| jobTemplate: | ||
| spec: | ||
| backoffLimit: 1 |
Copilot
AI
Feb 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No concurrencyPolicy is set, so if a run takes longer than the schedule interval, multiple runs can overlap and double-tag/double-push logs. Consider setting concurrencyPolicy: Forbid (and optionally startingDeadlineSeconds) for safer operation.
| braintrust: | ||
| enabled: true | ||
| image: | ||
| repository: | ||
| tag: |
Copilot
AI
Feb 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR description says the Braintrust cronjobs are disabled by default, but production-values.yaml enables them (and both sub-jobs) for production. Either keep them disabled here until explicitly enabled, or update the PR description to reflect that production will run them after deploy.
…proved logging and tagging
Description
Add Braintrust automations for backing up logs and tagging/pushing logs to datasets. Two Kubernetes CronJobs run in production: one weekly backup of Braintrust logs to a GCS bucket, and one daily job that tags logs with Claude (using Braintrust tags whose description contains
dataset-tagging) and pushes them into Braintrust datasets that declare[[relevant_tags: ["a","b"]]]in their description. A dedicated Braintrust container image is built and pushed via a new GitHub Actions workflow (dev on branch/PR, prod on tag), and production Helm deploy is updated to inject the Braintrust image and related config.Code Changes
.github/workflows/braintrust-container.yaml(new)scripts/scheduled/braintrust_*.py,requirements-braintrust.txt,build/braintrust/**, and this workflow file; also on tagsv*andworkflow_dispatch.docker/metadata-actionfor tags.build/braintrust/Dockerfile(new)requirements-braintrust.txt, non-root userbraintrust, copiesbraintrust_backup_logs.pyandbraintrust_tag_and_push.pyinto/app/scripts,ENTRYPOINT ["python"].build/ci/production-helm-deploy.shBRAINTRUST_IMAGEand usesyqto setcronJobs.braintrust.image.repositoryandcronJobs.braintrust.image.tagin the generated values beforehelm upgrade.build/ci/production-values.yamlcronJobs.braintrustwithbackupLogs(schedule0 1 * * 0, bucketbraintrust-logs, prefixlogs/) andtagAndPush(schedule0 2 * * *), plus placeholderimage.repository/tag.secrets.braintrustandsecrets.anthropicrefs for production.helm-chart/sefaria/templates/cronjob/braintrust-backup-logs.yaml(new)cronJobs.braintrust.backupLogs.enabledis true.braintrust_backup_logs.py(Braintrust API key + project ID from secret), writes CSV to shared volume.google/cloud-sdkuploads CSV from shared volume togs://${BUCKET}/${PREFIX};concurrencyPolicy: Forbid, resource limits, single success/fail history.helm-chart/sefaria/templates/cronjob/braintrust-tag-and-push.yaml(new)cronJobs.braintrust.tagAndPush.enabledis true.braintrust_tag_and_push.py allwith Braintrust + Anthropic secrets and optional PVC for shared storage;concurrencyPolicy: Forbid, memory/cpu limits.helm-chart/sefaria/values.yamlsecrets.braintrustandsecrets.anthropic(refs only; no data in repo).cronJobs.braintrustblock:enabled: false, default image repo/tag,backupLogsandtagAndPushwith schedules, service accounts, bucket/prefix, andtagAndPush.usePvc/pvcName.requirements-braintrust.txt(new)braintrust==0.15.0,langchain-anthropic==0.3.22,requests,structlog,google-cloud-logging,google-cloud-storage.scripts/scheduled/braintrust_backup_logs.py(new)BRAINTRUST_API_KEYandBRAINTRUST_PROJECT_ID(validated as UUID); queries Braintrust BTQL for last 7 days of project logs, writes CSV to/tmp/logs_backup_<date>.csv; structured logging; exits 0 when no logs.scripts/scheduled/braintrust_tag_and_push.py(new)dataset-tagging; query all logs from last 24h via BTQL; tag each log with Claude (1–3 tags from allowed set); save tagged logs to shared storage (JSONL). Push path: Load tagged logs; fetch all datasets; filter datasets by[[relevant_tags: ["a","b"]]]in description; match logs to datasets by tag intersection; insert into Braintrust datasets with deduplication by log ID. Supportsinit/push/all. UsesBRAINTRUST_MAX_LOGS,BRAINTRUST_TAGGING_DELAY,BRAINTRUST_MAX_TAGS_IN_PROMPT,BRAINTRUST_SHARED_STORAGE; validates project_id as UUID.Notes
braintrust-secret-production(e.g.api-key,project-id) andanthropic-api-key-production(e.g.api-key) created and wired to the chart. GCP service accountsbraintrust-backup-logsandbraintrust-tag-pushmust exist with the right GCS/Braintrust permissions.braintrust-logsand prefixlogs/(configurable in values). Ensure the backup job’s service account can write to that bucket.claude-3-5-haiku-20241022), rate-limited byBRAINTRUST_TAGGING_DELAYand capped byBRAINTRUST_MAX_LOGS(default 500). Dataset tagging only considers tags whose Braintrust description containsdataset-taggingand datasets that declare[[relevant_tags: [...]]]in their description.