My journey during Google Summer of Code 2025 with Google DeepMind on the Gemma project. I built tools for evaluating large language models, focusing on systematic benchmarking and domain-specific assessment.
This repository is hosted at haileycheng.com/DeepMind/
How I Landed a Google DeepMind Project in Google Summer of Code 2025: A Step-by-Step Guide
- May 7: Selected by Google DeepMind for the Gemma project.
- May 8: Rejections from two other orgs, leading me here.
Proposals are public for anyone curious about the process.
Submission for DeepMind:
- A proposal (PDF attached)
- A blog post under the demo tag in the Gemma repo: google-deepmind/gemma#244
Good luck for your GSoC 2026 application.
Repository: github.com/heilcheng/openevals
Documentation: haileycheng.com/openevals
OpenEvals is a framework for LLM evaluation. Standardized benchmarking across academic tasks.
Functionality:
- Runs standard benchmarks: MMLU, GSM8K, MATH, HumanEval, ARC, TruthfulQA
- Compares model families: Gemma, Llama, Mistral, Qwen, DeepSeek, HuggingFace
- Measures efficiency: latency, throughput, memory
- Statistical analyses with confidence intervals
- Publication-ready visualizations
Significance:
Evaluation is fragmented. OpenEvals unifies it. Consistent benchmarks. Reproducible results.
Repository: github.com/heilcheng/medexplain-evals
Documentation: haileycheng.com/medexplain-evals
Domain-specific framework. Assessing model explanations of medical info for non-experts.
Functionality:
- Evaluates medical explanation tasks
- Measures accuracy, clarity, safety
- Specialized benchmarks
- Interactive web interface
Significance:
General benchmarks miss medical nuances. Misinformation harms. Targeted evaluation for patient-facing applications.
- GSoC Guide: Comprehensive platform with tips and resources.
- GSoC 2025 Proposals Archive: Archive of 120+ accepted proposals.
- GSoC Organizations: Search and filter participating orgs.
Original proposal submitted to Google DeepMind:
MIT