Skip to content

Conversation

@realmarcin
Copy link
Collaborator

No description provided.

realmarcin and others added 2 commits January 13, 2026 10:57
## Rubric Updates (v2.1)

Updated rubric10.txt and rubric20.txt to address 23 stakeholder review comments:
- Add W3C PROV-O provenance support with graph-based validation
- Align with Bridge2AI AI/ML readiness criteria and bioRXiv article
- Add data sustainability indicators (DOI, governance, repositories)
- Support graph-based metadata representations (preprocessing, collection)
- Correct license terminology (replace misleading "Open"/"Public" terms)
- Clarify multimodal definitions for Bridge2AI datasets
- Expand sensitive data examples (voice, activity, retinal images)
- Add dataset merging/integration capability assessment
- Make citation mandatory for all Bridge2AI datasets
- Add hosting platform identification field
- Document review comments and responses in REVIEW_COMMENTS_RESPONSE_REPORT.md

## D4D Evaluations

Regenerated deterministic evaluations for all 4 Bridge2AI projects using claude-sonnet-4-5-20250929 (temperature=0.0):

**Corrected Rubric10 Rankings (50 points max):**
1. AI_READI: 44/50 (88.0%) - Grade A-
2. VOICE: 40/50 (80.0%) - Grade B+
3. CHORUS: 38/50 (76.0%) - Grade B
4. CM4AI: 36/50 (72.0%) - Grade C+

**Corrected Rubric20 Rankings (84 points max):**
1. CHORUS: 78/84 (92.9%) - Grade A
2. CM4AI: 68/84 (81.0%) - Grade B+
3. AI_READI: 44/84 (52.4%) - Grade D

**Score Validation:**
- Fixed LLM math errors in 3 files (CM4AI: -8 points, VOICE: -6 points, CM4AI rubric20: +1 point)
- Post-processing validation essential even at temperature=0.0

**Files Updated:**
- Evaluation JSONs (rubric10 and rubric20 for all projects)
- HTML renderings (8 files: 4 rubric10 + 4 rubric20)
- EVALUATION_SUMMARY_2026-01-12.md - Comprehensive analysis
- COMPARISON_TABLES_2026-01-12.md - Cross-rubric comparison tables

**Key Findings:**
- Universal strengths: Scientific Motivation (5/5) and Technical Transparency (5/5) across all projects
- Universal weakness: Limitations Disclosure (avg 2.3/5) - all projects missing known_limitations, known_biases, anomalies
- Schema compliance: 25% of evaluation files pass current schema validation (5/20 files)
- Rank reversal: CHORUS ranks 3rd in Rubric10 but 1st in Rubric20

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated render_evaluation_html_rubric10_semantic.py to generate files with
consistent naming convention matching rubric20 files.

**Changes:**
- Modified script to generate `{PROJECT}_evaluation_rubric10.html` instead of `{PROJECT}_evaluation.html`
- Removed old HTML files without suffix
- Regenerated all 4 rubric10 evaluation HTML files with correct naming

**File naming convention (now consistent):**
- Rubric10: `{PROJECT}_evaluation_rubric10.html`
- Rubric20: `{PROJECT}_evaluation_rubric20.html`
- D4D Human-Readable: `{PROJECT}_d4d_human_readable.html`

**Files affected:**
- scripts/render_evaluation_html_rubric10_semantic.py (line 642)
- AI_READI_evaluation_rubric10.html (renamed from AI_READI_evaluation.html)
- CHORUS_evaluation_rubric10.html (renamed from CHORUS_evaluation.html)
- CM4AI_evaluation_rubric10.html (renamed from CM4AI_evaluation.html)
- VOICE_evaluation_rubric10.html (renamed from VOICE_evaluation.html)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request updates the D4D rubric evaluation system from version 2.0 to 2.1, incorporating comprehensive review comments and regenerating evaluations for Bridge2AI datasets. The changes include rubric enhancements (W3C PROV-O provenance, AI/ML readiness criteria, sustainability indicators), updated evaluation outputs, and extensive documentation of the review process and results.

Changes:

  • Updated rubrics (v2.0 → v2.1) with 23 review comments addressed across rubric10.txt and rubric20.txt
  • Regenerated evaluation JSONs for CM4AI and VOICE with updated scoring methodology
  • Added comprehensive documentation (REVIEW_COMMENTS_RESPONSE_REPORT.md, EVALUATION_SUMMARY, COMPARISON_TABLES)
  • Updated HTML rendering script to include "_rubric10" suffix in output filenames
  • Refreshed HTML evaluation reports with new timestamps

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated no comments.

Show a summary per file
File Description
scripts/render_evaluation_html_rubric10_semantic.py Updated output filename pattern to include "_rubric10" suffix
data/rubric/rubric20.txt Schema v2.1 update with expanded field guide, revised questions (Q3, Q4, Q6, Q8-Q20)
data/rubric/rubric10.txt Schema v2.1 update with expanded field guide, revised elements (3, 8, 10)
data/rubric/REVIEW_COMMENTS_RESPONSE_REPORT.md New comprehensive 307-line documentation of 23 review comments
data/evaluation_llm/.../CM4AI_claudecode_agent_evaluation.json (rubric20) Version 2.1 re-evaluation with score changes (82→68 points)
data/evaluation_llm/.../VOICE_claudecode_agent_evaluation.json (rubric10) Version 2.1 re-evaluation with score changes (44→40 points)
data/evaluation_llm/.../CM4AI_claudecode_agent_evaluation.json (rubric10) Version 2.1 re-evaluation with score changes (48→36 points)
data/evaluation_llm/EVALUATION_SUMMARY_2026-01-12.md New 414-line summary report of evaluation results
data/evaluation_llm/COMPARISON_TABLES_2026-01-12.md New 335-line comparative analysis tables
data/d4d_html/.../VOICE_evaluation_rubric10.html New HTML rendering (1131 lines)
data/d4d_html/.../_evaluation.html Timestamp updates (2025 dates → 2026-01-12/13)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# Generate output filename
project_name = eval_file.stem.replace('_claudecode_agent_evaluation', '')
output_path = output_dir / f"{project_name}_evaluation.html"
output_path = output_dir / f"{project_name}_evaluation_rubric10.html"
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output filename pattern has been changed from {project_name}_evaluation.html to {project_name}_evaluation_rubric10.html. This change should be documented and any dependent scripts or documentation should be updated to reflect this new naming convention.

Copilot uses AI. Check for mistakes.
Clean up old evaluation HTML files that were replaced with properly
named versions containing the _rubric10 suffix.

Files removed:
- AI_READI_evaluation.html → AI_READI_evaluation_rubric10.html
- CHORUS_evaluation.html → CHORUS_evaluation_rubric10.html
- CM4AI_evaluation.html → CM4AI_evaluation_rubric10.html
- VOICE_evaluation.html → VOICE_evaluation_rubric10.html

All projects now use consistent naming with explicit rubric type suffixes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants