feat: add production-ready MNIST example for PyTorch by Snehadas2005 · Pull Request #3063 · kubeflow/trainer

Snehadas2005 · 2026-01-03T02:15:52Z

What this PR does / Why we need it:
This PR updates and refactors the PyTorch Jupyter Notebook examples to support the Kubeflow Trainer V2 SDK. It adds a new Audio Classification workflow and improves existing examples to ensure stability, cross-platform compatibility, and alignment with current SDK features.

Key Changes:

Added a new audio classification workflow using the GTZAN dataset with automated data download and standardized preprocessing.
Updated speech recognition, image classification, and question answering examples to follow the official V2 training workflow.
Migrated all notebooks to use TrainerClient and CustomTrainer APIs.
Improved distributed training support on Windows using the gloo backend.
Standardized environment variables (DATA_DIR, OUTPUT_DIR) across examples.

Issues Fixed:
Fixes #3062
Fixes #2040
Related PR: #2830

Checklist:

Updated PyTorch README
Verified local execution
Updated .gitignore for new artifacts

google-oss-prow · 2026-01-03T02:15:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jaiakash · 2026-01-03T18:21:52Z

Thanks for raising this, @Snehadas2005.

I see its still a draft PR but few minor suggestions which will help you.

Please use jupiter notebook (.ipynb) file for examples instead of standard python files (.py). You can checkout these examples PR (feat: qwen 2.5 1.5b runtime, example and fix gpu e2e test #2835, feat(runtimes): Support Distributed MLX on CUDA #2790) as reference.
You commits are not signed, thats why DCO is failing. Check this for more info on how to sign your current commits and even any of future commits.

Happy contributing.

Snehadas2005 · 2026-01-04T03:50:41Z

Thank you so much, @jaiakash, for the detailed feedback and references. I really appreciate it.

That makes sense. I will convert the example into a Jupyter notebook and align it with the existing example patterns you shared, focusing on clarity and readability for data scientists.

I also appreciate the note on DCO signing. I will fix the commit signatures and ensure all future commits are properly signed.

Thanks again for the guidance, happy to iterate further and adjust based on feedback from the team.

review-notebook-app · 2026-01-04T05:13:18Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coveralls · 2026-01-12T16:40:50Z

Pull Request Test Coverage Report for Build 21773851100

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 51.998%

Totals
Change from base Build 21761068098:	0.0%
Covered Lines:	1288
Relevant Lines:	2477

💛 - Coveralls

andreyvelich

Thanks for this @Snehadas2005!
I left a few comments.

andreyvelich · 2026-01-13T00:09:02Z

docs/proposals/2841-flux-hpc/README.md

@@ -0,0 +1,164 @@
+# KEP-2841: Support Flux Framework for HPC in Kubeflow Trainer


Please rebase the PR from the master branch to remove unnecessary changes from this PR.

andreyvelich · 2026-01-13T00:09:44Z

examples/pytorch/image-classification/mnist.ipynb

-     "name": "#%% md\n"
-    }
-   },
+   "id": "9fb68cb1",


Why do you need to update this Notebook?

andreyvelich · 2026-01-13T00:10:20Z

examples/pytorch/question-answering/fine-tune-distilbert.ipynb

   "metadata": {},
   "source": [
-    "# Fine-tuning DistilBERT for question answering\n",
+    "# Fine-tune DistilBERT for Question Answering\n",


Same question.

andreyvelich · 2026-01-13T00:10:42Z

examples/pytorch/audio-classification/audio-classification.ipynb

+   "id": "62c2ba7a",
+   "metadata": {},
+   "source": [
+    "# Speech Recognition with Kubeflow Trainer (V2 SDK)\n",


Suggested change

"# Speech Recognition with Kubeflow Trainer (V2 SDK)\n",

"# Speech Recognition with Kubeflow Trainer\n",

andreyvelich · 2026-01-13T00:11:12Z

examples/pytorch/audio-classification/audio-classification.ipynb

+    "\n",
+    "## Environment Setup\n",
+    "\n",
+    "For details on how to deploy a local Kubernetes cluster using **Kind** and configure the **torch-distributed** Runtime, please refer to the [Getting Started Guide](https://github.com/kubeflow/trainer/blob/master/README.md).\n",


Please refer to this guide: https://www.kubeflow.org/docs/components/trainer/getting-started/

andreyvelich · 2026-01-13T00:11:52Z

examples/pytorch/audio-classification/audio-classification.ipynb

+    "    import numpy as np\n",
+    "\n",
+    "    #  Configuration & Distributed Setup \n",
+    "    is_test = os.environ.get(\"KUBEFLOW_TRAINER_TEST\") == \"1\"\n",


Why do you need this?

andreyvelich · 2026-01-13T00:13:16Z

examples/pytorch/audio-classification/audio-classification.ipynb

+    "    if world_size > 1:\n",
+    "        backend = \"nccl\" if torch.cuda.is_available() else \"gloo\"\n",
+    "        dist.init_process_group(backend=backend)\n",
+    "        device = torch.device(\"cuda\", local_rank) if torch.cuda.is_available() else torch.device(\"cpu\")\n",
+    "    else:\n",
+    "        device = torch.device(\"cpu\")\n",


Since we always start the PyTorch with torchrun you can always extract values from dist.get_rank() and dist.get_world_size(). So you can make it similar to this one: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb

andreyvelich · 2026-01-13T00:16:55Z

examples/pytorch/audio-classification/audio-classification.ipynb

+    "import os\n",
+    "os.environ[\"KUBEFLOW_TRAINER_TEST\"] = \"1\"\n",
+    "train_fn()"


You can avoid setting the test run manually by just using local backend that @szaher created in the SDK.
Check mnist example: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb

andreyvelich · 2026-01-13T00:17:33Z

examples/pytorch/audio-classification/audio-classification.ipynb

+    "    docker_client.ping()\n",
+    "    print(\"Docker is running\")\n",
+    "    \n",
+    "    backend_config = ContainerBackendConfig(container_runtime=\"docker\")\n",


Let's use just Kubernetes backend here, since we have dedicated examples for Container backend testing.

andreyvelich · 2026-01-13T00:18:27Z

examples/pytorch/audio-classification/audio-classification.ipynb

+    }
+   ],
+   "source": [
+    "from kubeflow.trainer import TrainerClient, CustomTrainer\n",


Please add steps to watch for job and get job logs like in other examples

Snehadas2005 · 2026-01-13T04:27:32Z

Thanks a lot @andreyvelich, for the detailed review and guidance. This was very helpful.

I have rebased the PR to remove unrelated changes, reverted updates to other notebooks, and aligned the Speech Recognition example with the existing MNIST pattern (distributed setup, backend usage, and job monitoring).

All requested changes have been pushed. Please let me know if anything else needs adjustment.

andreyvelich · 2026-01-13T14:21:25Z

Thanks a lot @andreyvelich, for the detailed review and guidance. This was very helpful.

I have rebased the PR to remove unrelated changes, reverted updates to other notebooks, and aligned the Speech Recognition example with the existing MNIST pattern (distributed setup, backend usage, and job monitoring).

All requested changes have been pushed. Please let me know if anything else needs adjustment.

I still can see changes in the Flux proposal in this PR: https://github.com/kubeflow/trainer/pull/3063/files
Also, please sign your commits and run these Notebooks as part of E2Es: https://github.com/kubeflow/trainer/blob/master/.github/workflows/test-e2e.yaml

jaiakash · 2026-01-13T15:32:16Z

Hi @Snehadas2005 please dont add more commit, first fix the signing off commit and rebase NOT merge commits. I see you are "merging" from master branch 😞
Check this kubeflow/sdk#115 (comment)

…ation Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Snehadas2005 · 2026-01-28T15:02:06Z

Hi @jaiakash, thank you for checking. I am actively working on addressing the remaining comments. Currently, I am facing an issue when running distributed training on a local Kind cluster. The job is launched successfully, but it times out while waiting to reach the Running/Complete/Failed status:

Launched cluster job: m431e8e16ee4  
Waiting for Kubernetes to allocate resources and start pulling images...  
Error interacting with cluster: Timeout waiting for TrainJob to reach status

The Native Local Run with LocalProcessBackendConfig works correctly, and the Kind cluster is running, but the job does not progress when using the Kubeflow runtime.

Below is the relevant code snippet I’m using for the cluster run:

from kubeflow.trainer import TrainerClient, CustomTrainer
import time

try:
    client = TrainerClient()

    runtimes = client.list_runtimes()
    torch_runtime = next((r for r in runtimes if "torch-distributed" in r.name), None)

    if torch_runtime is None:
        raise ValueError("Could not find 'torch-distributed' runtime on cluster.")

    job_name = client.train(
        trainer=CustomTrainer(
            func=train_fn,
            num_nodes=2,
            resources_per_node={"cpu": 1, "memory": "1Gi"},
            packages_to_install=["torch", "torchaudio", "librosa", "soundfile"]
        ),
        runtime=torch_runtime
    )

    print(f"Launched cluster job: {job_name}")
    print("Waiting for Kubernetes to allocate resources and start pulling images...")

    job = client.wait_for_job_status(name=job_name, status={"Running", "Complete", "Failed"}, timeout=1200)
    print(f"Job Status: {job.status}")

    for step in client.get_job(name=job_name).steps:
        print(f"Step: {step.name}, Status: {step.status}, Devices: {step.device} x {step.device_count}")

    # Stream the logs
    for line in client.get_job_logs(job_name, follow=True):
        print(line, end="")
except Exception as e:
    print(f"Error interacting with cluster: {e}")

I am continuing to debug this and will share updates soon. If you have any suggestions on what I should check next, I’d really appreciate your guidance.

Thank you.

…ynb. Make changes in speech-recognition.ipynb Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Snehadas2005 · 2026-02-05T14:00:35Z

Hi all, sharing a brief status update.

The Speech Recognition example and the previously requested review changes have been completed.

I am currently working on the Audio Classification integration and finalising the remaining updates. Some additional time is required for debugging and aligning the setup with the existing E2E configuration.

I am aiming to have this ready by today or tomorrow and will share an update once it is complete. Thank you for your patience.

andreyvelich · 2026-02-05T14:15:54Z

Sure, thanks @Snehadas2005.
If you want, you can implement Audio Classification in the followup PR, so you can first ensure that E2Es for Speech Recognition works fine.

andreyvelich · 2026-02-05T14:16:00Z

/ok-to-test

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

astefanutti · 2026-02-06T10:08:52Z

examples/pytorch/audio-classification/audio-classification.ipynb

+                    ]
+                },
+                {
+                    "ename": "RuntimeError",


Could you clean those errors up?

I am currently working on this. I will clean up the errors that are coming up.

astefanutti · 2026-02-06T10:11:35Z

.gitignore

+kind.exe
+kind-windows.exe
+kind-windows-amd64.exe
+kubectl.exe


Is this really needed give windows binaries are excluded below?

astefanutti · 2026-02-06T10:13:02Z

examples/pytorch/README.md

+
+- **Image Classification (mnist.ipynb):** Demonstrates distributed training on the Fashion MNIST dataset using CNNs.
+- **Question Answering (fine-tune-distilbert.ipynb):** Fine-tuning DistilBERT on the SQuAD dataset with Hugging Face integration.
+- **Speech Recognition (speech-recognition.ipynb):** Spoken word classification using an Audio Transformer on the Speech Commands dataset.


Could you clean up the output in this example please.

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Snehadas2005 · 2026-02-06T19:11:13Z

Hi @andreyvelich and @astefanutti, thank you for the detailed feedback.

I have cleaned up the reported errors, updated the .gitignore entries, and improved the notebook outputs as requested. All changes have been pushed, and the E2E tests are now passing. Please let me know if there are any remaining issues or further improvements needed. I will be happy to address them.

Thank you for your continued guidance.

andreyvelich · 2026-02-06T19:49:06Z

@Snehadas2005 Can you sign your commit and remove long output from your Notebook?
Currently, this PR has 35k lines of changes.

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Snehadas2005 · 2026-02-06T20:06:48Z

@andreyvelich, I have cleared the Jupyter Notebook outputs for speech-recognition.ipynb and audio-classification.ipynb, and signed the commits as requested. Please let me know if any further changes are needed.

andreyvelich

You need to adjust your examples to follow the same flow as we run for MNIST with TrainJob creation and log checking: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb

andreyvelich · 2026-02-06T20:57:16Z

CLAUDE.md

@@ -1 +1 @@
-AGENTS.md
+AGENTS.md


Why are you making changes to this file?

Hi @andreyvelich,

Thank you for pointing this out.

The change to CLAUDE.md was unintentional and happened while editing nearby files. This is not related to this PR. I will revert this change and ensure only relevant files are modified going forward.

Thank you for flagging this.

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

google-oss-prow bot requested review from jinchihe and kuizhiqing January 3, 2026 02:15

google-oss-prow bot added the size/L label Jan 3, 2026

Snehadas2005 marked this pull request as draft January 3, 2026 02:25

google-oss-prow bot added the do-not-merge/work-in-progress label Jan 3, 2026

Snehadas2005 force-pushed the master branch from 7dc39c6 to a1f1c75 Compare January 4, 2026 04:42

google-oss-prow bot added size/XL size/XXL and removed size/L size/XL labels Jan 4, 2026

Snehadas2005 marked this pull request as ready for review January 6, 2026 05:12

google-oss-prow bot removed the do-not-merge/work-in-progress label Jan 6, 2026

Snehadas2005 mentioned this pull request Jan 8, 2026

chore: Add Speech Recognition with DDP Example #2830

Open

andreyvelich mentioned this pull request Jan 12, 2026

Add more AI/ML Training Example with Kubeflow Trainer #2040

Open

8 tasks

andreyvelich reviewed Jan 13, 2026

View reviewed changes

jaiakash mentioned this pull request Jan 13, 2026

chore: add audio m5 speechcommands #3092

Closed

1 task

google-oss-prow bot added size/XL and removed size/XXL labels Jan 13, 2026

Snehadas2005 force-pushed the master branch from 5d97690 to 2348f57 Compare January 13, 2026 15:46

Snehadas2005 added 3 commits January 13, 2026 21:41

feat: add PyTorch MNIST training example with Kubeflow Trainer integr…

6b94bc3

…ation Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

docs: add README for PyTorch examples with Kubeflow Trainer SDK

5bc458c

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

feat: add Speech Recognition PyTorch

284946c

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

andreyvelich mentioned this pull request Feb 4, 2026

chore(examples): add speech recognition DDP notebook #3156

Open

1 task

restored back to the original fine-tune-distilbert.ipynb and mnist.ip…

0d23558

…ynb. Make changes in speech-recognition.ipynb Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

google-oss-prow bot added size/XXL and removed size/XL labels Feb 5, 2026

google-oss-prow bot added the ok-to-test label Feb 5, 2026

Snehadas2005 added 8 commits February 6, 2026 11:34

rectifying errors for e2e tests

2942744

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Merge branch 'master' into master

3163075

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

rectifying errors for e2e tests

8546d7b

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

revert back to the original existing one

13c08b8

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

before rebase

80bd80d

removed trainer and training operator

7861445

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

e2e test trail

f1ee6ce

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

e2e test trail

d264cfc

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

astefanutti reviewed Feb 6, 2026

View reviewed changes

Snehadas2005 added 5 commits February 6, 2026 22:11

feat: add Kubeflow-compatible audio classification

88d6f49

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Fix end-of-file newlines

7e1ed3e

Added kaggle dataset

166682f

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

feat(audio): implement robust data sourcing with mock fallback

aedd4b9

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Merge branch 'master' into master

23b528d

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Snehadas2005 added 2 commits February 7, 2026 01:30

refactor: remove long notebook outputs to reduce PR size

68a4af0

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

Merge branch 'master' of https://github.com/Snehadas2005/trainer

2794b11

andreyvelich reviewed Feb 6, 2026

View reviewed changes

Revert unintended change to CLAUDE.md

e3bc8a1

Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>

		@@ -0,0 +1,164 @@
		# KEP-2841: Support Flux Framework for HPC in Kubeflow Trainer

	"# Speech Recognition with Kubeflow Trainer (V2 SDK)\n",
	"# Speech Recognition with Kubeflow Trainer\n",

		@@ -1 +1 @@
		AGENTS.md No newline at end of file
		AGENTS.md

Conversation

Snehadas2005 commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Jan 3, 2026

Uh oh!

jaiakash commented Jan 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Snehadas2005 commented Jan 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

review-notebook-app bot commented Jan 4, 2026

Uh oh!

coveralls commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 21773851100

Details

💛 - Coveralls

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Snehadas2005 commented Jan 13, 2026

Uh oh!

andreyvelich commented Jan 13, 2026

Uh oh!

jaiakash commented Jan 13, 2026

Uh oh!

Snehadas2005 commented Jan 28, 2026

Uh oh!

Snehadas2005 commented Feb 5, 2026

Uh oh!

andreyvelich commented Feb 5, 2026

Uh oh!

andreyvelich commented Feb 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Snehadas2005 commented Feb 6, 2026

Uh oh!

andreyvelich commented Feb 6, 2026

Uh oh!

Snehadas2005 commented Feb 6, 2026

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Snehadas2005 commented Jan 3, 2026 •

edited

Loading

jaiakash commented Jan 3, 2026 •

edited

Loading

Snehadas2005 commented Jan 4, 2026 •

edited

Loading

coveralls commented Jan 12, 2026 •

edited

Loading