Skip to content

feat: add production-ready MNIST example for PyTorch#3063

Open
Snehadas2005 wants to merge 29 commits intokubeflow:masterfrom
Snehadas2005:master
Open

feat: add production-ready MNIST example for PyTorch#3063
Snehadas2005 wants to merge 29 commits intokubeflow:masterfrom
Snehadas2005:master

Conversation

@Snehadas2005
Copy link

@Snehadas2005 Snehadas2005 commented Jan 3, 2026

What this PR does / Why we need it:
This PR updates and refactors the PyTorch Jupyter Notebook examples to support the Kubeflow Trainer V2 SDK. It adds a new Audio Classification workflow and improves existing examples to ensure stability, cross-platform compatibility, and alignment with current SDK features.

Key Changes:

  • Added a new audio classification workflow using the GTZAN dataset with automated data download and standardized preprocessing.
  • Updated speech recognition, image classification, and question answering examples to follow the official V2 training workflow.
  • Migrated all notebooks to use TrainerClient and CustomTrainer APIs.
  • Improved distributed training support on Windows using the gloo backend.
  • Standardized environment variables (DATA_DIR, OUTPUT_DIR) across examples.

Issues Fixed:
Fixes #3062
Fixes #2040
Related PR: #2830

Checklist:

  • Updated PyTorch README
  • Verified local execution
  • Updated .gitignore for new artifacts

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jaiakash
Copy link
Member

jaiakash commented Jan 3, 2026

Thanks for raising this, @Snehadas2005.

I see its still a draft PR but few minor suggestions which will help you.

Happy contributing.

@Snehadas2005
Copy link
Author

Snehadas2005 commented Jan 4, 2026

Thank you so much, @jaiakash, for the detailed feedback and references. I really appreciate it.

That makes sense. I will convert the example into a Jupyter notebook and align it with the existing example patterns you shared, focusing on clarity and readability for data scientists.

I also appreciate the note on DCO signing. I will fix the commit signatures and ensure all future commits are properly signed.

Thanks again for the guidance, happy to iterate further and adjust based on feedback from the team.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coveralls
Copy link

coveralls commented Jan 12, 2026

Pull Request Test Coverage Report for Build 21773851100

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 51.998%

Totals Coverage Status
Change from base Build 21761068098: 0.0%
Covered Lines: 1288
Relevant Lines: 2477

💛 - Coveralls

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @Snehadas2005!
I left a few comments.

@@ -0,0 +1,164 @@
# KEP-2841: Support Flux Framework for HPC in Kubeflow Trainer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase the PR from the master branch to remove unnecessary changes from this PR.

"name": "#%% md\n"
}
},
"id": "9fb68cb1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to update this Notebook?

"metadata": {},
"source": [
"# Fine-tuning DistilBERT for question answering\n",
"# Fine-tune DistilBERT for Question Answering\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question.

"id": "62c2ba7a",
"metadata": {},
"source": [
"# Speech Recognition with Kubeflow Trainer (V2 SDK)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"# Speech Recognition with Kubeflow Trainer (V2 SDK)\n",
"# Speech Recognition with Kubeflow Trainer\n",

"\n",
"## Environment Setup\n",
"\n",
"For details on how to deploy a local Kubernetes cluster using **Kind** and configure the **torch-distributed** Runtime, please refer to the [Getting Started Guide](https://github.com/kubeflow/trainer/blob/master/README.md).\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

" import numpy as np\n",
"\n",
" # Configuration & Distributed Setup \n",
" is_test = os.environ.get(\"KUBEFLOW_TRAINER_TEST\") == \"1\"\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this?

Comment on lines 166 to 75
" if world_size > 1:\n",
" backend = \"nccl\" if torch.cuda.is_available() else \"gloo\"\n",
" dist.init_process_group(backend=backend)\n",
" device = torch.device(\"cuda\", local_rank) if torch.cuda.is_available() else torch.device(\"cpu\")\n",
" else:\n",
" device = torch.device(\"cpu\")\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we always start the PyTorch with torchrun you can always extract values from dist.get_rank() and dist.get_world_size(). So you can make it similar to this one: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb

Comment on lines 317 to 231
"import os\n",
"os.environ[\"KUBEFLOW_TRAINER_TEST\"] = \"1\"\n",
"train_fn()"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid setting the test run manually by just using local backend that @szaher created in the SDK.
Check mnist example: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb

" docker_client.ping()\n",
" print(\"Docker is running\")\n",
" \n",
" backend_config = ContainerBackendConfig(container_runtime=\"docker\")\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use just Kubernetes backend here, since we have dedicated examples for Container backend testing.

}
],
"source": [
"from kubeflow.trainer import TrainerClient, CustomTrainer\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add steps to watch for job and get job logs like in other examples

@Snehadas2005
Copy link
Author

Thanks a lot @andreyvelich, for the detailed review and guidance. This was very helpful.

I have rebased the PR to remove unrelated changes, reverted updates to other notebooks, and aligned the Speech Recognition example with the existing MNIST pattern (distributed setup, backend usage, and job monitoring).

All requested changes have been pushed. Please let me know if anything else needs adjustment.

@andreyvelich
Copy link
Member

Thanks a lot @andreyvelich, for the detailed review and guidance. This was very helpful.

I have rebased the PR to remove unrelated changes, reverted updates to other notebooks, and aligned the Speech Recognition example with the existing MNIST pattern (distributed setup, backend usage, and job monitoring).

All requested changes have been pushed. Please let me know if anything else needs adjustment.

I still can see changes in the Flux proposal in this PR: https://github.com/kubeflow/trainer/pull/3063/files
Also, please sign your commits and run these Notebooks as part of E2Es: https://github.com/kubeflow/trainer/blob/master/.github/workflows/test-e2e.yaml

@jaiakash
Copy link
Member

Hi @Snehadas2005 please dont add more commit, first fix the signing off commit and rebase NOT merge commits. I see you are "merging" from master branch 😞
Check this kubeflow/sdk#115 (comment)

…ation

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
@Snehadas2005
Copy link
Author

Hi @jaiakash, thank you for checking. I am actively working on addressing the remaining comments. Currently, I am facing an issue when running distributed training on a local Kind cluster. The job is launched successfully, but it times out while waiting to reach the Running/Complete/Failed status:

Launched cluster job: m431e8e16ee4  
Waiting for Kubernetes to allocate resources and start pulling images...  
Error interacting with cluster: Timeout waiting for TrainJob to reach status

The Native Local Run with LocalProcessBackendConfig works correctly, and the Kind cluster is running, but the job does not progress when using the Kubeflow runtime.

Below is the relevant code snippet I’m using for the cluster run:

from kubeflow.trainer import TrainerClient, CustomTrainer
import time

try:
    client = TrainerClient()

    runtimes = client.list_runtimes()
    torch_runtime = next((r for r in runtimes if "torch-distributed" in r.name), None)

    if torch_runtime is None:
        raise ValueError("Could not find 'torch-distributed' runtime on cluster.")

    job_name = client.train(
        trainer=CustomTrainer(
            func=train_fn,
            num_nodes=2,
            resources_per_node={"cpu": 1, "memory": "1Gi"},
            packages_to_install=["torch", "torchaudio", "librosa", "soundfile"]
        ),
        runtime=torch_runtime
    )

    print(f"Launched cluster job: {job_name}")
    print("Waiting for Kubernetes to allocate resources and start pulling images...")

    job = client.wait_for_job_status(name=job_name, status={"Running", "Complete", "Failed"}, timeout=1200)
    print(f"Job Status: {job.status}")

    for step in client.get_job(name=job_name).steps:
        print(f"Step: {step.name}, Status: {step.status}, Devices: {step.device} x {step.device_count}")

    # Stream the logs
    for line in client.get_job_logs(job_name, follow=True):
        print(line, end="")
except Exception as e:
    print(f"Error interacting with cluster: {e}")

I am continuing to debug this and will share updates soon. If you have any suggestions on what I should check next, I’d really appreciate your guidance.

Thank you.

…ynb. Make changes in speech-recognition.ipynb

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
@google-oss-prow google-oss-prow bot added size/XXL and removed size/XL labels Feb 5, 2026
@Snehadas2005
Copy link
Author

Hi all, sharing a brief status update.

The Speech Recognition example and the previously requested review changes have been completed.

I am currently working on the Audio Classification integration and finalising the remaining updates. Some additional time is required for debugging and aligning the setup with the existing E2E configuration.

I am aiming to have this ready by today or tomorrow and will share an update once it is complete. Thank you for your patience.

@andreyvelich
Copy link
Member

Sure, thanks @Snehadas2005.
If you want, you can implement Audio Classification in the followup PR, so you can first ensure that E2Es for Speech Recognition works fine.

@andreyvelich
Copy link
Member

/ok-to-test

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
]
},
{
"ename": "RuntimeError",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clean those errors up?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am currently working on this. I will clean up the errors that are coming up.

.gitignore Outdated
Comment on lines 44 to 47
kind.exe
kind-windows.exe
kind-windows-amd64.exe
kubectl.exe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed give windows binaries are excluded below?


- **Image Classification (mnist.ipynb):** Demonstrates distributed training on the Fashion MNIST dataset using CNNs.
- **Question Answering (fine-tune-distilbert.ipynb):** Fine-tuning DistilBERT on the SQuAD dataset with Hugging Face integration.
- **Speech Recognition (speech-recognition.ipynb):** Spoken word classification using an Audio Transformer on the Speech Commands dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clean up the output in this example please.

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
@Snehadas2005
Copy link
Author

Hi @andreyvelich and @astefanutti, thank you for the detailed feedback.

I have cleaned up the reported errors, updated the .gitignore entries, and improved the notebook outputs as requested. All changes have been pushed, and the E2E tests are now passing. Please let me know if there are any remaining issues or further improvements needed. I will be happy to address them.

Thank you for your continued guidance.

@andreyvelich
Copy link
Member

@Snehadas2005 Can you sign your commit and remove long output from your Notebook?
Currently, this PR has 35k lines of changes.

@Snehadas2005
Copy link
Author

@andreyvelich, I have cleared the Jupyter Notebook outputs for speech-recognition.ipynb and audio-classification.ipynb, and signed the commits as requested. Please let me know if any further changes are needed.

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to adjust your examples to follow the same flow as we run for MNIST with TrainJob creation and log checking: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb

CLAUDE.md Outdated
@@ -1 +1 @@
AGENTS.md No newline at end of file
AGENTS.md
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you making changes to this file?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andreyvelich,

Thank you for pointing this out.

The change to CLAUDE.md was unintentional and happened while editing nearby files. This is not related to this PR. I will revert this change and ensure only relevant files are modified going forward.

Thank you for flagging this.

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add production-ready standalone Python examples for PyTorch (non-notebook) Add more AI/ML Training Example with Kubeflow Trainer

5 participants