chore(sc-41159): Add Braintrust evaluation cronjob infrastructure #3076

yodem · 2026-02-04T12:40:03Z

Description

Add Braintrust automations for backing up logs and tagging/pushing logs to datasets. Two Kubernetes CronJobs run in production: one weekly backup of Braintrust logs to a GCS bucket, and one daily job that tags logs with Claude (using Braintrust tags whose description contains dataset-tagging) and pushes them into Braintrust datasets that declare [[relevant_tags: ["a","b"]]] in their description. A dedicated Braintrust container image is built and pushed via a new GitHub Actions workflow (dev on branch/PR, prod on tag), and production Helm deploy is updated to inject the Braintrust image and related config.

Code Changes

.github/workflows/braintrust-container.yaml (new)
- Workflow to build and push the Braintrust image to GAR.
- Triggers on push/PR to paths under scripts/scheduled/braintrust_*.py, requirements-braintrust.txt, build/braintrust/**, and this workflow file; also on tags v* and workflow_dispatch.
- build-dev: branch/PR → dev GCP project; image tagged with branch, sha, and date.
- build-prod: tag push → prod GCP project; image tagged with tag, sha, and semver.
- Uses QEMU, Docker Buildx, Google auth (workload identity), and docker/metadata-action for tags.
build/braintrust/Dockerfile (new)
- Python 3.11-slim image; installs requirements-braintrust.txt, non-root user braintrust, copies braintrust_backup_logs.py and braintrust_tag_and_push.py into /app/scripts, ENTRYPOINT ["python"].
build/ci/production-helm-deploy.sh
- Sets BRAINTRUST_IMAGE and uses yq to set cronJobs.braintrust.image.repository and cronJobs.braintrust.image.tag in the generated values before helm upgrade.
build/ci/production-values.yaml
- Adds cronJobs.braintrust with backupLogs (schedule 0 1 * * 0, bucket braintrust-logs, prefix logs/) and tagAndPush (schedule 0 2 * * *), plus placeholder image.repository/tag.
- Adds secrets.braintrust and secrets.anthropic refs for production.
helm-chart/sefaria/templates/cronjob/braintrust-backup-logs.yaml (new)
- CronJob when cronJobs.braintrust.backupLogs.enabled is true.
- Init container: runs braintrust_backup_logs.py (Braintrust API key + project ID from secret), writes CSV to shared volume.
- Main container: google/cloud-sdk uploads CSV from shared volume to gs://${BUCKET}/${PREFIX}; concurrencyPolicy: Forbid, resource limits, single success/fail history.
helm-chart/sefaria/templates/cronjob/braintrust-tag-and-push.yaml (new)
- CronJob when cronJobs.braintrust.tagAndPush.enabled is true.
- Single container runs braintrust_tag_and_push.py all with Braintrust + Anthropic secrets and optional PVC for shared storage; concurrencyPolicy: Forbid, memory/cpu limits.
helm-chart/sefaria/values.yaml
- New secrets.braintrust and secrets.anthropic (refs only; no data in repo).
- New cronJobs.braintrust block: enabled: false, default image repo/tag, backupLogs and tagAndPush with schedules, service accounts, bucket/prefix, and tagAndPush.usePvc/pvcName.
requirements-braintrust.txt (new)
- Pins: braintrust==0.15.0, langchain-anthropic==0.3.22, requests, structlog, google-cloud-logging, google-cloud-storage.
scripts/scheduled/braintrust_backup_logs.py (new)
- Reads BRAINTRUST_API_KEY and BRAINTRUST_PROJECT_ID (validated as UUID); queries Braintrust BTQL for last 7 days of project logs, writes CSV to /tmp/logs_backup_<date>.csv; structured logging; exits 0 when no logs.
scripts/scheduled/braintrust_tag_and_push.py (new)
- Init path: Fetch all Braintrust tags → keep only those whose description contains dataset-tagging; query all logs from last 24h via BTQL; tag each log with Claude (1–3 tags from allowed set); save tagged logs to shared storage (JSONL). Push path: Load tagged logs; fetch all datasets; filter datasets by [[relevant_tags: ["a","b"]]] in description; match logs to datasets by tag intersection; insert into Braintrust datasets with deduplication by log ID. Supports init / push / all. Uses BRAINTRUST_MAX_LOGS, BRAINTRUST_TAGGING_DELAY, BRAINTRUST_MAX_TAGS_IN_PROMPT, BRAINTRUST_SHARED_STORAGE; validates project_id as UUID.

Notes

Secrets: Production needs braintrust-secret-production (e.g. api-key, project-id) and anthropic-api-key-production (e.g. api-key) created and wired to the chart. GCP service accounts braintrust-backup-logs and braintrust-tag-push must exist with the right GCS/Braintrust permissions.
GCS: Backup CronJob expects bucket braintrust-logs and prefix logs/ (configurable in values). Ensure the backup job’s service account can write to that bucket.
Tagging: Tag-and-push uses Claude (claude-3-5-haiku-20241022), rate-limited by BRAINTRUST_TAGGING_DELAY and capped by BRAINTRUST_MAX_LOGS (default 500). Dataset tagging only considers tags whose Braintrust description contains dataset-tagging and datasets that declare [[relevant_tags: [...]]] in their description.
Breaking: None for existing services; Braintrust cronjobs are off by default in the default values and enabled only in production values.
Branch: Aligns with chore/SC-41159 (research automations for tagging/logged data).

mergify · 2026-02-04T12:42:20Z

🧪 CI Insights

Here's what we observed from your CI run for 4ca483f.

🟢 All jobs passed!

But CI Insights is watching 👀

…dataset matching Implement Braintrust integration with two-phase automation: **Phase 1 (init_step):** - Query available tags, filter for dataset-tagging in description - Query all logs from last 24h - Use Claude to assign relevant tags (constrained to filtered tags) - Save tagged logs with deduplication support **Phase 2 (push_step):** - Load previously tagged logs - Fetch all datasets and extract relevant_tags from descriptions - Match logs to datasets based on tag intersection - Insert logs to datasets with deduplication and error escalation **Infrastructure:** - Braintrust API integration with proper error handling - Claude API for intelligent tag assignment - Kubernetes cronjobs with shared storage - Docker image with Python 3.11 - Helm templates for ConfigMaps and cronjobs - Backup cronjob for weekly log archival to GCS **Security & Reliability:** - Critical fixes for JSON parsing, regex patterns, and type validation - Error escalation for failed insertions - Retry logic for transient failures - Structured logging with structlog Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds infrastructure for automated Braintrust evaluation by introducing two scheduled cronjobs: one that tags logs with Claude and pushes them to Braintrust datasets (daily at 2 AM), and one that backs up logs to CSV in GCS (weekly on Sundays). The implementation includes Python automation scripts, Helm chart templates for Kubernetes cronjobs, a dedicated Docker image, and configuration for secrets management.

Changes:

Added two Python scripts for Braintrust automation: log tagging/pushing and backup functionality
Created Kubernetes CronJob infrastructure with Helm templates, ConfigMaps, and values configuration
Added Dockerfile.braintrust for running the automation scripts
Updated requirements.txt to include braintrust==0.15.0 dependency

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 19 comments.

Show a summary per file

File	Description
scripts/scheduled/braintrust_tag_and_push.py	Main automation script that fetches tags, queries logs, tags them with Claude, and pushes to datasets
scripts/scheduled/braintrust_backup_logs.py	Backup script that queries logs from Braintrust and writes them to CSV for GCS storage
helm-chart/sefaria/values.yaml	Adds configuration for braintrust and anthropic secrets, plus cronjob settings
helm-chart/sefaria/templates/cronjob/braintrust-tag-and-push.yaml	CronJob template for daily tag-and-push automation
helm-chart/sefaria/templates/cronjob/braintrust-backup-logs.yaml	CronJob template for weekly log backups with init container pattern
helm-chart/sefaria/templates/configmap/braintrust-tag-and-push-script.yaml	ConfigMap that mounts the tag-and-push script
helm-chart/sefaria/templates/configmap/braintrust-backup-logs-script.yaml	ConfigMap that mounts the backup script
Dockerfile.braintrust	Python 3.11 container image with required dependencies for both cronjobs
requirements.txt	Adds braintrust library version 0.15.0

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

scripts/scheduled/braintrust_backup_logs.py

helm-chart/sefaria/templates/configmap/braintrust-backup-logs-script.yaml

scripts/scheduled/braintrust_tag_and_push.py

Copilot · 2026-02-04T13:34:22Z

scripts/scheduled/braintrust_tag_and_push.py

+    for idx, log in enumerate(logs):
+        tags = tag_log_with_claude(client, log, available_tags)
+        log["relevant_tags"] = tags
+        tagged_logs.append(log)
+
+        if (idx + 1) % 10 == 0:
+            logger.info("tagging_progress", processed=idx + 1, total=len(logs))


The script tags ALL logs from the last 24 hours with Claude without any rate limiting or batching controls. If there are hundreds or thousands of logs, this could:

Hit Claude API rate limits, causing failures

Incur significant API costs

Take a very long time to complete

Consider adding:

Rate limiting between API calls (e.g., time.sleep(0.1) between requests)

A configurable maximum number of logs to process

Retry logic with exponential backoff for rate limit errors

Batching if Claude supports batch processing

For example:

import time for idx, log in enumerate(logs): tags = tag_log_with_claude(client, log, available_tags) log["relevant_tags"] = tags tagged_logs.append(log) # Add small delay to avoid rate limiting if (idx + 1) % 10 == 0: time.sleep(1) # Brief pause every 10 requests logger.info("tagging_progress", processed=idx + 1, total=len(logs))

scripts/scheduled/braintrust_tag_and_push.py

Copilot · 2026-02-04T13:34:25Z

scripts/scheduled/braintrust_tag_and_push.py

+        unique_logs = {}
+        for log in logs:
+            log_id = log.get("id")
+            if log_id:
+                unique_logs[log_id] = log


Logs without an ID are silently dropped during deduplication. The code only adds logs to unique_logs if they have an ID (if log_id:), which means logs without IDs are discarded without any warning or logging.

Consider handling logs without IDs more explicitly:

unique_logs = {} logs_without_id = [] for log in logs: log_id = log.get("id") if log_id: unique_logs[log_id] = log else: logs_without_id.append(log) if logs_without_id: logger.warning("logs_without_id_found", count=len(logs_without_id)) # Optionally include them in the final list # unique_logs.update({id(log): log for log in logs_without_id}) logs = list(unique_logs.values())

This ensures visibility into potential data quality issues.

Suggested change

unique_logs = {}

for log in logs:

log_id = log.get("id")

if log_id:

unique_logs[log_id] = log

unique_logs = {}

logs_without_id = []

for log in logs:

log_id = log.get("id")

if log_id:

unique_logs[log_id] = log

else:

logs_without_id.append(log)

if logs_without_id:

logger.warning("logs_without_id_found", count=len(logs_without_id))

scripts/scheduled/braintrust_tag_and_push.py

Copilot · 2026-02-04T13:34:26Z

scripts/scheduled/braintrust_tag_and_push.py

+    try:
+        existing_ids = set()
+
+        for row in dataset:
+            log_id = row.get("id") or (row.get("input", {}).get("id") if isinstance(row.get("input"), dict) else None)
+            if log_id:
+                existing_ids.add(str(log_id))
+
+        return existing_ids


The get_existing_log_ids_in_dataset function iterates over the entire dataset to build a deduplication set. For large datasets with thousands or millions of records, this could:

Be very slow and consume significant memory

Hit API rate limits or timeouts

Make the cronjob take hours to complete

Consider:

Adding pagination or limits if the Braintrust API supports it

Using a more efficient deduplication approach (e.g., checking existence per-log via API if available)

Caching the existing IDs between runs if the dataset doesn't change frequently

Adding a warning if the dataset size exceeds a threshold:

MAX_DATASET_SIZE = 10000 # Configurable row_count = 0 for row in dataset: row_count += 1 if row_count > MAX_DATASET_SIZE: logger.warning("large_dataset_detected", dataset_id=dataset.id if hasattr(dataset, 'id') else 'unknown', size=f">{MAX_DATASET_SIZE}") # ... rest of logic

scripts/scheduled/braintrust_backup_logs.py

Dockerfile.braintrust

helm-chart/sefaria/templates/configmap/braintrust-tag-and-push-script.yaml

requirements.txt

Dockerfile.braintrust

… jobs for log backup and tagging

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-08T10:14:19Z

helm-chart/sefaria/templates/cronjob/braintrust-tag-and-push.yaml

+            volumeMounts:
+            - mountPath: /shared/braintrust
+              name: shared-storage
+            command: ["python"]
+            args: ["/app/scripts/braintrust_tag_and_push.py", "all"]


The braintrust image runs as a non-root user (see build/braintrust/Dockerfile). Ensure the shared-storage volume mount is writable in all modes (especially PVCs) by setting an fsGroup or adding an initContainer to adjust ownership/permissions.

Copilot · 2026-02-08T10:14:19Z

scripts/scheduled/braintrust_tag_and_push.py

+    logger.info("tagging_all_logs", total_logs=len(logs), available_tags_count=len(available_tags))
+    client = get_claude_client()
+    tagged_logs = []
+
+    for idx, log in enumerate(logs):


This script tags every log from the last 24 hours with a synchronous Claude call per log and no rate limiting / max-log cap. That can easily exceed cronjob runtime/cost and hit provider rate limits; consider batching with backoff/retry and making max logs/time budget configurable.

Copilot · 2026-02-08T10:14:19Z

helm-chart/sefaria/values.yaml

  syncMongoProductionData:
    enabled: false
+  braintrust:
+    enabled: false


The values key cronJobs.braintrust.enabled is set but not referenced by any templates (only backupLogs.enabled/tagAndPush.enabled are checked). Either wire this parent flag into the CronJob template conditions or remove it to avoid configuration drift/confusion.

Suggested change

enabled: false

Copilot · 2026-02-08T10:14:20Z

helm-chart/sefaria/templates/cronjob/braintrust-backup-logs.yaml

+spec:
+  schedule: "{{ .Values.cronJobs.braintrust.backupLogs.schedule }}"
+  jobTemplate:
+    spec:
+      backoffLimit: 1


No concurrencyPolicy is set, so a slow export/upload could overlap with the next schedule and upload duplicate backups. Consider setting concurrencyPolicy: Forbid for predictable weekly backups.

Copilot · 2026-02-08T10:14:20Z

requirements-braintrust.txt

+braintrust==0.15.0
+langchain-anthropic==0.3.22
+requests>=2.31.0


PR description mentions updating requirements.txt to add braintrust, but this PR adds a separate requirements-braintrust.txt and does not modify requirements.txt. Either update the description or ensure the intended dependency file is updated so installation/build expectations are clear.

Copilot · 2026-02-08T10:14:21Z

helm-chart/sefaria/templates/cronjob/braintrust-tag-and-push.yaml

+          - name: braintrust-tag-and-push
+            image: "{{ .Values.cronJobs.braintrust.image.repository }}:{{ .Values.cronJobs.braintrust.image.tag }}"
+            env:
+            - name: BRAINTRUST_API_KEY
+              valueFrom:


PR description mentions deploying scripts via ConfigMaps/init containers, but the chart changes shown execute scripts baked into the image and there are no new braintrust script ConfigMaps under helm-chart/sefaria/templates/configmap. Either add the missing templates or update the PR description/docs to match the implemented approach.

Copilot · 2026-02-08T10:14:21Z

scripts/scheduled/braintrust_tag_and_push.py

+        model="claude-3-5-haiku-20241022",
+        temperature=0,
+        max_tokens=256,
+        api_key=api_key


ChatAnthropic is initialized with parameter names (model/max_tokens/api_key) that differ from other usages in this repo (model_name/max_tokens_to_sample) and may not match langchain-anthropic==0.3.22, causing a runtime TypeError. Align the constructor args with the version used in this codebase (and rely on standard env-based API key handling if supported).

Suggested change

model="claude-3-5-haiku-20241022",

temperature=0,

max_tokens=256,

api_key=api_key

model_name="claude-3-5-haiku-20241022",

temperature=0,

max_tokens_to_sample=256,

Copilot · 2026-02-08T10:14:21Z

scripts/scheduled/braintrust_tag_and_push.py

+        for row in dataset:
+            log_id = row.get("id") or (row.get("input", {}).get("id") if isinstance(row.get("input"), dict) else None)
+            if log_id:
+                existing_ids.add(str(log_id))
+


Deduplication scans the entire dataset via iteration, which will become slow/expensive as datasets grow and can time out the cronjob. Consider a scalable alternative (query only recent rows, maintain an external pushed-ID index, or use a server-side unique key/dedup mechanism if Braintrust supports it).

Copilot · 2026-02-08T10:14:21Z

helm-chart/sefaria/templates/cronjob/braintrust-tag-and-push.yaml

+spec:
+  schedule: "{{ .Values.cronJobs.braintrust.tagAndPush.schedule }}"
+  jobTemplate:
+    spec:
+      backoffLimit: 1


No concurrencyPolicy is set, so if a run takes longer than the schedule interval, multiple runs can overlap and double-tag/double-push logs. Consider setting concurrencyPolicy: Forbid (and optionally startingDeadlineSeconds) for safer operation.

Copilot · 2026-02-08T10:14:22Z

build/ci/production-values.yaml

+  braintrust:
+    enabled: true
+    image:
+      repository:
+      tag:


PR description says the Braintrust cronjobs are disabled by default, but production-values.yaml enables them (and both sub-jobs) for production. Either keep them disabled here until explicitly enabled, or update the PR description to reflect that production will run them after deploy.

…proved logging and tagging

yodem force-pushed the chore/sc-41159/research-automations-for-tagging-logged branch from 577eea2 to 675de90 Compare February 4, 2026 12:43

yodem changed the title ~~chore(sc-41159): Add Braintrust evaluation cronjob infrastructure wit…~~ chore(sc-41159): Add Braintrust evaluation cronjob infrastructure Feb 4, 2026

yodem requested a review from Copilot February 4, 2026 13:23

Copilot started reviewing on behalf of yodem February 4, 2026 13:24 View session

Copilot AI reviewed Feb 4, 2026

View reviewed changes

BrendanGalloway reviewed Feb 4, 2026

View reviewed changes

helm-chart/sefaria/templates/configmap/braintrust-tag-and-push-script.yaml Outdated Show resolved Hide resolved

requirements.txt Outdated Show resolved Hide resolved

Dockerfile.braintrust Outdated Show resolved Hide resolved

Dockerfile.braintrust Outdated Show resolved Hide resolved

chore(sc-41159): implement Braintrust automation with Docker and cron…

5c82937

… jobs for log backup and tagging

yodem requested a review from Copilot February 8, 2026 10:06

Copilot started reviewing on behalf of yodem February 8, 2026 10:06 View session

Copilot AI reviewed Feb 8, 2026

View reviewed changes

chore: update Braintrust automation scripts and configurations for im…

4ca483f

…proved logging and tagging

Uh oh!

chore(sc-41159): Add Braintrust evaluation cronjob infrastructure #3076

Are you sure you want to change the base?

chore(sc-41159): Add Braintrust evaluation cronjob infrastructure #3076

Conversation

yodem commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Code Changes

Notes

Uh oh!

mergify bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

🟢 All jobs passed!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

yodem commented Feb 4, 2026 •

edited

Loading

mergify bot commented Feb 4, 2026 •

edited

Loading