From b21e49e9cb9b6658aaaa3daf35a0720e1379f448 Mon Sep 17 00:00:00 2001 From: Shashank Srikanth Date: Mon, 23 Feb 2026 15:27:22 -0800 Subject: [PATCH] Update GSOC Project ideas --- docs/internals/gsoc-2026.md | 415 +++++++----------------------------- 1 file changed, 72 insertions(+), 343 deletions(-) diff --git a/docs/internals/gsoc-2026.md b/docs/internals/gsoc-2026.md index cebc8f06..fd1c4ed9 100644 --- a/docs/internals/gsoc-2026.md +++ b/docs/internals/gsoc-2026.md @@ -26,155 +26,6 @@ To contact specific mentors, you can tag them in threads within #gsoc: --- -## Open Source Metaflow Functions: Relocatable Compute with Ray and FastAPI Backends - -**Difficulty:** Medium/Advanced - -**Duration:** 350 hours (Large project) - -**Technologies:** Python, Metaflow, Ray, FastAPI - -**Mentors:** Shashank, Nissan - -### Description - -Metaflow Functions is a construct that enables relocatable compute; the -ability to package a computation along with its dependencies, environment, and -bound artifacts into a self-contained unit that can be deployed anywhere. -The core implementation already exists and has been -[presented publicly](https://www.infoq.com/presentations/ml-netflix/). - -The `@function` decorator solves a key pain point in ML workflows: -dependency management across the training-to-serving boundary. When you train a -model in a Metaflow flow, the function captures the exact environment -(Python version, packages, custom code) and binds it with -[task](/api/client#task) artifacts. -The resulting package can be loaded and executed in a completely different -process or machine without the caller needing to reconstruct the original -environment. - -The goal of this project would be to open-source Metaflow -Functions for the broader community by implementing two production-ready backends: -- **Ray backend** for distributed batch/offline inference -- **FastAPI backend** for real-time online serving - -See [Expected API](#expected-api) below for code examples. - -### Goals - -1. **Open source the @function primitive** - Create a new Metaflow extension -(`metaflow-functions`) that implements the `@function` decorator and -`JsonFunction` binding. - -2. **Ray backend for offline serving** - Deploy functions to Ray for scalable -batch inference. - -3. **FastAPI backend for online serving** - Wrap functions as HTTP endpoints for -real-time inference with automatic OpenAPI documentation and request validation. - -4. \[Stretch Goal\] **Serialization framework** - Pluggable serialization - supporting common -formats (JSON, Avro, custom) so functions can accept and return data appropriate -to their deployment context. - -### Deliverables - -- Core `@function` decorator adapted for open source Metaflow -- Function packaging and export to portable formats (local filesystem, S3) -- Ray backend with configurable resource allocation -- FastAPI backend with automatic OpenAPI schema generation -- Documentation and end-to-end examples -- Test suite - -### Why This Matters - -**For users:** -- **Eliminate the training-serving gap** - Deploy models with the exact same - environment used during training, eliminating "works in training, breaks in - production" issues -- **Simplify ML deployment** - No need to manually recreate environments or - manage dependency versions across teams -- **Flexible deployment targets** - Same function works for batch inference - (Ray) and real-time serving (FastAPI) without code changes - -**For the contributor:** -- Work on a production-proven system used at Netflix scale -- Gain deep experience with ML deployment patterns and challenges -- Learn Ray for distributed computing and FastAPI for API development - -### Skills Required - -- Python (intermediate/advanced) -- Ray -- FastAPI - -### Links - -- [Metaflow Functions Talk (InfoQ)](https://www.infoq.com/presentations/ml-netflix/) -- [Existing Implementation](https://github.com/Netflix/metaflow_rc/tree/master/nflx-metaflow-function) -- [Metaflow Documentation](https://docs.metaflow.org) -- [Metaflow Extensions Template](https://github.com/Netflix/metaflow-extensions-template) - -### Expected API - -#### 1. Creating a Function - -Define a function using the `@json_function` decorator: - -```python -from metaflow import json_function, FunctionParameters - -@json_function -def predict(data: dict, params: FunctionParameters) -> dict: - """Run inference using the bound model.""" - features = [data[f] for f in params.feature_names] - prediction = params.model.predict([features])[0] - return {"prediction": int(prediction)} -``` - -The function receives: -- `data`: JSON-serializable input (dict, list, str, etc.) -- `params`: Access to artifacts from the bound task - -#### 2. Binding to a Task - -Bind the function to a completed task to capture its environment and artifacts: - -```python -from metaflow import JsonFunction, Task - -task = Task("MyTrainFlow/123/train/456") -inference_fn = JsonFunction(predict, task=task) - -# Export portable reference -reference = inference_fn.reference -``` - -#### 3. Deploying with Ray (Batch Inference) - -```python -from metaflow import function_from_json - -fn = function_from_json(reference, backend="ray") -results = [fn(record) for record in batch_data] -``` - -#### 4. Deploying with FastAPI (Real-time Serving) - -```python -from fastapi import FastAPI -from metaflow import function_from_json - -app = FastAPI() -fn = function_from_json(reference) - -@app.post("/predict") -def predict(payload: dict): - return fn(payload) -``` - ---- - ## Metaflow CI/CD: Kubernetes Integration Testing with GitHub Actions **Difficulty:** Easy @@ -534,169 +385,6 @@ filesystem restrictions, and resource limits for sandboxed execution. --- -## Confidential Computing with Trusted Execution Environments - -**Difficulty:** Advanced - -**Duration:** 350 hours (Large project) - -**Technologies:** Python, Gramine/SGX, Phala Cloud, Metaflow - -**Mentors:** Nissan, Madhur - -### Description - -Machine learning workflows often process sensitive data: medical records, -financial transactions, proprietary models. Traditional isolation (containers, -VMs) protects against external attackers but not against the infrastructure -operator. Trusted Execution Environments (TEEs) provide hardware-level -isolation where even the cloud provider cannot access the computation. - -TEE adoption has historically been difficult due to complex tooling, but -platforms like [Gramine](https://gramine.readthedocs.io/) (open source, -runs locally in simulation mode) and -[Phala Cloud](https://phala.com/) (managed TEE infrastructure with free -credits for developers) have made confidential computing more accessible. - -This project adds a `@confidential` decorator that executes Metaflow steps -inside TEEs. Development and testing use Gramine's simulation mode locally; -production deployment targets Phala Cloud or other TEE providers. - -### Goals - -1. **`@confidential` decorator** - Mark steps for execution inside a TEE -with attestation verification. - -2. **Gramine backend for local development** - Run steps in Gramine-SGX -simulation mode, allowing development and testing without TEE hardware. - -3. **Phala Cloud backend for production** - Deploy confidential steps to -Phala's managed TEE infrastructure. - -4. **Attestation verification** - Verify TEE attestation reports before -trusting computation results. - -5. \[Stretch Goal\] **Encrypted artifact storage** - Encrypt artifacts at rest -with keys sealed to the TEE, ensuring only attested enclaves can decrypt them. - -### Deliverables - -- `@confidential` decorator with pluggable backend architecture -- Gramine simulation backend for local testing -- Phala Cloud backend with deployment automation -- Attestation verification utilities -- Documentation covering threat model and security properties -- Test suite (simulation mode) -- Example flow demonstrating confidential ML inference - -### Why This Matters - -**For users:** -- **Process sensitive data safely** - Run ML on medical, financial, or - proprietary data with hardware-level protection -- **Zero-trust infrastructure** - Even cloud providers cannot access your - computation or data -- **Compliance enablement** - Meet regulatory requirements (HIPAA, GDPR) for - data processing -- **Verifiable computation** - Attestation proves code ran in a secure enclave - without tampering - -**For the contributor:** -- Learn cutting-edge confidential computing technology (TEEs, SGX, attestation) -- Work with emerging cloud infrastructure (confidential VMs are becoming - mainstream) -- Build expertise applicable to blockchain, secure enclaves, and privacy tech - -### Skills Required - -- Python (intermediate/advanced) -- Basic understanding of TEE concepts (SGX, attestation) -- Docker/containerization -- Familiarity with Metaflow decorators - -### Links - -- [Gramine Documentation](https://gramine.readthedocs.io/) -- [Phala Cloud](https://phala.com/) -- [Phala Cloud Pricing](https://phala.com/pricing) ($20 free credits) -- [Intel SGX Overview](https://www.intel.com/content/www/us/en/architecture-and-technology/software-guard-extensions.html) -- [Metaflow Extensions Template](https://github.com/Netflix/metaflow-extensions-template) -- [Confidential Computing Consortium](https://confidentialcomputing.io/) - ---- - -## Metaflow Nomad Integration - -**Difficulty:** Medium - -**Duration:** 350 hours (Large project) - -**Technologies:** Python, HashiCorp Nomad, Metaflow - -**Mentors:** Madhur - -### Description - -Metaflow supports various compute backends for executing steps remotely: `@kubernetes`, `@batch` (AWS Batch), and community extensions like [`@slurm`](https://github.com/outerbounds/metaflow-slurm) for HPC clusters. However, many organizations use [HashiCorp Nomad](https://www.nomadproject.io/) as their workload orchestrator — a lightweight alternative to Kubernetes that's simpler to operate and supports diverse workload types (containers, VMs, binaries). - -Nomad is particularly popular in organizations already using HashiCorp's stack (Vault, Consul) and in edge computing scenarios where Kubernetes' complexity is overkill. Despite this, there's currently no way to run Metaflow steps on Nomad clusters. - -This project aims to implement a `@nomad` decorator that executes Metaflow steps as Nomad jobs, bringing Metaflow's workflow capabilities to the Nomad ecosystem. The [`@slurm` extension](https://github.com/outerbounds/metaflow-slurm) provides a reference implementation for integrating custom compute backends. - -### Goals - -1. **`@nomad` decorator** - Execute Metaflow steps as Nomad batch jobs with basic resource configuration (CPU, memory). -2. **Docker task driver support** - Run steps in Docker containers, similar to how `@kubernetes` and `@batch` work. -3. **Job submission and monitoring** - Submit jobs to Nomad, poll for completion, and retrieve exit codes. -4. **Log streaming** - Capture and display stdout/stderr from Nomad allocations in the Metaflow CLI. -5. **Basic retry support** - Integrate with Metaflow's `@retry` decorator to resubmit failed jobs. -6. \[Stretch Goal\] **Exec driver support** - Support Nomad's exec driver for running binaries directly without containers. -7. \[Stretch Goal\] **GPU resource allocation** - Support GPU constraints using Nomad's device plugins. - -### Deliverables - -- `@nomad` decorator implementation following Metaflow extension patterns -- Nomad job submission and monitoring backend -- Docker task driver support -- Basic resource configuration (CPU, memory) -- Log streaming from Nomad allocations -- Documentation with setup guide and basic examples -- Test scenarios covering job submission, execution, and failures -- Example flows demonstrating Docker-based execution - -### Why This Matters - -**For users:** -- **Use existing Nomad infrastructure** - Leverage Nomad clusters without needing Kubernetes or cloud batch services -- **Simpler operations** - Nomad's lightweight architecture reduces operational complexity compared to Kubernetes -- **HashiCorp ecosystem integration** - Natural fit for teams already using Vault, Consul, or Terraform -- **Edge and hybrid deployments** - Run ML workflows on edge infrastructure where Kubernetes is too heavy - -**For the contributor:** -- Learn HashiCorp Nomad—increasingly popular in the infrastructure space -- Understand how to extend Metaflow with custom compute backends (applicable to other schedulers) -- Gain experience with job orchestration, lifecycle management, and failure handling -- Work with a real-world reference implementation (`@slurm`) as a guide -- Build a foundation that the community can enhance with advanced features later - -### Skills Required - -- Python (intermediate) -- Basic familiarity with HashiCorp Nomad -- Docker -- Understanding of Metaflow decorators (or willingness to learn) - -### Links - -- [HashiCorp Nomad Documentation](https://www.nomadproject.io/docs) -- [Nomad Jobs API](https://developer.hashicorp.com/nomad/api-docs/jobs) -- [Metaflow Slurm Extension (Reference)](https://github.com/outerbounds/metaflow-slurm) -- [Metaflow Extensions Template](https://github.com/Netflix/metaflow-extensions-template) -- [Metaflow Step Decorators](https://docs.metaflow.org/api/step-decorators) -- [Metaflow Documentation](https://docs.metaflow.org) - ---- - ## Metadata service request improvements **Difficulty:** Easy @@ -725,37 +413,6 @@ Resources can also be filtered by tags in the Metaflow client. This is currently --- -## Metaflow-services eventing rework to a message broker architecture - -**Difficulty**: Hard - -**Duration**: 300 hours (Large project) - -**Technologies**: Python, Docker, PostgreSQL, Language of choice (f.ex. Rust/Go) - -**Mentors**: Sakari Ikonen - -### Description - -The current backend architecture relies heavily on PostgreSQL features for broadcasting and subscribing to database events (INSERT/UPDATE) in order to be able to provide real-time updates. This is a hard vendor-lock to PostgreSQL which is imposed by the architecture choice. The messaging mechanism in the database has proven to fall short in high-volume deployments more than once, so exploring alternatives to this is expected to be beneficial. - -As all data insertion and updates are handled by the metadata-service, and currently the only service that is interested in the events is the ui_backend service, a simple message broker between these two services should be the most straightforward solution. - -### Considerations - -Some considerations for the implementation are -- The usual ui backend db is a replica. If the events come off a broker that receives its messages based on inserts on a main db, then there is no guarantee that the replica is up-to-date when the message gets processed. Therefore some retry logic needs to be introduced on top of the message handling -- The volume of messages is significant on large deployments, so performance of the broker is of utmost importance -- Messages need to have some guarantee of in-order arrival within certain scopes (flow level for runs, run level for tasks etc.) - -### Goals - -- Develop a PoC message broker service that metadata-service can publish messages to, and ui_backend can subscribe to topic in order to receive only messages of interest. -- Completely replace currently used LISTEN/NOTIFY mechanism in favour of message broker service. -- Being able to deploy ui service with a pure read-replica instead of a logical replica - ---- - ## Jupyter-Native Metaflow **Difficulty:** Medium @@ -853,6 +510,78 @@ entire DAG at once). --- +## Metaflow Nomad Integration + +**Difficulty:** Medium + +**Duration:** 350 hours (Large project) + +**Technologies:** Python, HashiCorp Nomad, Metaflow + +**Mentors:** Madhur + +### Description + +Metaflow supports various compute backends for executing steps remotely: `@kubernetes`, `@batch` (AWS Batch), and community extensions like [`@slurm`](https://github.com/outerbounds/metaflow-slurm) for HPC clusters. However, many organizations use [HashiCorp Nomad](https://www.nomadproject.io/) as their workload orchestrator — a lightweight alternative to Kubernetes that's simpler to operate and supports diverse workload types (containers, VMs, binaries). + +Nomad is particularly popular in organizations already using HashiCorp's stack (Vault, Consul) and in edge computing scenarios where Kubernetes' complexity is overkill. Despite this, there's currently no way to run Metaflow steps on Nomad clusters. + +This project aims to implement a `@nomad` decorator that executes Metaflow steps as Nomad jobs, bringing Metaflow's workflow capabilities to the Nomad ecosystem. The [`@slurm` extension](https://github.com/outerbounds/metaflow-slurm) provides a reference implementation for integrating custom compute backends. + +### Goals + +1. **`@nomad` decorator** - Execute Metaflow steps as Nomad batch jobs with basic resource configuration (CPU, memory). +2. **Docker task driver support** - Run steps in Docker containers, similar to how `@kubernetes` and `@batch` work. +3. **Job submission and monitoring** - Submit jobs to Nomad, poll for completion, and retrieve exit codes. +4. **Log streaming** - Capture and display stdout/stderr from Nomad allocations in the Metaflow CLI. +5. **Basic retry support** - Integrate with Metaflow's `@retry` decorator to resubmit failed jobs. +6. \[Stretch Goal\] **Exec driver support** - Support Nomad's exec driver for running binaries directly without containers. +7. \[Stretch Goal\] **GPU resource allocation** - Support GPU constraints using Nomad's device plugins. + +### Deliverables + +- `@nomad` decorator implementation following Metaflow extension patterns +- Nomad job submission and monitoring backend +- Docker task driver support +- Basic resource configuration (CPU, memory) +- Log streaming from Nomad allocations +- Documentation with setup guide and basic examples +- Test scenarios covering job submission, execution, and failures +- Example flows demonstrating Docker-based execution + +### Why This Matters + +**For users:** +- **Use existing Nomad infrastructure** - Leverage Nomad clusters without needing Kubernetes or cloud batch services +- **Simpler operations** - Nomad's lightweight architecture reduces operational complexity compared to Kubernetes +- **HashiCorp ecosystem integration** - Natural fit for teams already using Vault, Consul, or Terraform +- **Edge and hybrid deployments** - Run ML workflows on edge infrastructure where Kubernetes is too heavy + +**For the contributor:** +- Learn HashiCorp Nomad—increasingly popular in the infrastructure space +- Understand how to extend Metaflow with custom compute backends (applicable to other schedulers) +- Gain experience with job orchestration, lifecycle management, and failure handling +- Work with a real-world reference implementation (`@slurm`) as a guide +- Build a foundation that the community can enhance with advanced features later + +### Skills Required + +- Python (intermediate) +- Basic familiarity with HashiCorp Nomad +- Docker +- Understanding of Metaflow decorators (or willingness to learn) + +### Links + +- [HashiCorp Nomad Documentation](https://www.nomadproject.io/docs) +- [Nomad Jobs API](https://developer.hashicorp.com/nomad/api-docs/jobs) +- [Metaflow Slurm Extension (Reference)](https://github.com/outerbounds/metaflow-slurm) +- [Metaflow Extensions Template](https://github.com/Netflix/metaflow-extensions-template) +- [Metaflow Step Decorators](https://docs.metaflow.org/api/step-decorators) +- [Metaflow Documentation](https://docs.metaflow.org) + +--- + ## Agent-Friendly Metaflow Client: Analyzing and Addressing Client API Inefficiencies **Difficulty:** Hard