feat: add production-ready MNIST example for PyTorch#3063
feat: add production-ready MNIST example for PyTorch#3063Snehadas2005 wants to merge 29 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Thanks for raising this, @Snehadas2005. I see its still a draft PR but few minor suggestions which will help you.
Happy contributing. |
|
Thank you so much, @jaiakash, for the detailed feedback and references. I really appreciate it. That makes sense. I will convert the example into a Jupyter notebook and align it with the existing example patterns you shared, focusing on clarity and readability for data scientists. I also appreciate the note on DCO signing. I will fix the commit signatures and ensure all future commits are properly signed. Thanks again for the guidance, happy to iterate further and adjust based on feedback from the team. |
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Pull Request Test Coverage Report for Build 21773851100Details
💛 - Coveralls |
andreyvelich
left a comment
There was a problem hiding this comment.
Thanks for this @Snehadas2005!
I left a few comments.
| @@ -0,0 +1,164 @@ | |||
| # KEP-2841: Support Flux Framework for HPC in Kubeflow Trainer | |||
There was a problem hiding this comment.
Please rebase the PR from the master branch to remove unnecessary changes from this PR.
| "name": "#%% md\n" | ||
| } | ||
| }, | ||
| "id": "9fb68cb1", |
There was a problem hiding this comment.
Why do you need to update this Notebook?
| "metadata": {}, | ||
| "source": [ | ||
| "# Fine-tuning DistilBERT for question answering\n", | ||
| "# Fine-tune DistilBERT for Question Answering\n", |
| "id": "62c2ba7a", | ||
| "metadata": {}, | ||
| "source": [ | ||
| "# Speech Recognition with Kubeflow Trainer (V2 SDK)\n", |
There was a problem hiding this comment.
| "# Speech Recognition with Kubeflow Trainer (V2 SDK)\n", | |
| "# Speech Recognition with Kubeflow Trainer\n", |
| "\n", | ||
| "## Environment Setup\n", | ||
| "\n", | ||
| "For details on how to deploy a local Kubernetes cluster using **Kind** and configure the **torch-distributed** Runtime, please refer to the [Getting Started Guide](https://github.com/kubeflow/trainer/blob/master/README.md).\n", |
There was a problem hiding this comment.
Please refer to this guide: https://www.kubeflow.org/docs/components/trainer/getting-started/
| " import numpy as np\n", | ||
| "\n", | ||
| " # Configuration & Distributed Setup \n", | ||
| " is_test = os.environ.get(\"KUBEFLOW_TRAINER_TEST\") == \"1\"\n", |
| " if world_size > 1:\n", | ||
| " backend = \"nccl\" if torch.cuda.is_available() else \"gloo\"\n", | ||
| " dist.init_process_group(backend=backend)\n", | ||
| " device = torch.device(\"cuda\", local_rank) if torch.cuda.is_available() else torch.device(\"cpu\")\n", | ||
| " else:\n", | ||
| " device = torch.device(\"cpu\")\n", |
There was a problem hiding this comment.
Since we always start the PyTorch with torchrun you can always extract values from dist.get_rank() and dist.get_world_size(). So you can make it similar to this one: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb
| "import os\n", | ||
| "os.environ[\"KUBEFLOW_TRAINER_TEST\"] = \"1\"\n", | ||
| "train_fn()" |
There was a problem hiding this comment.
You can avoid setting the test run manually by just using local backend that @szaher created in the SDK.
Check mnist example: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb
| " docker_client.ping()\n", | ||
| " print(\"Docker is running\")\n", | ||
| " \n", | ||
| " backend_config = ContainerBackendConfig(container_runtime=\"docker\")\n", |
There was a problem hiding this comment.
Let's use just Kubernetes backend here, since we have dedicated examples for Container backend testing.
| } | ||
| ], | ||
| "source": [ | ||
| "from kubeflow.trainer import TrainerClient, CustomTrainer\n", |
There was a problem hiding this comment.
Please add steps to watch for job and get job logs like in other examples
|
Thanks a lot @andreyvelich, for the detailed review and guidance. This was very helpful. I have rebased the PR to remove unrelated changes, reverted updates to other notebooks, and aligned the Speech Recognition example with the existing MNIST pattern (distributed setup, backend usage, and job monitoring). All requested changes have been pushed. Please let me know if anything else needs adjustment. |
I still can see changes in the Flux proposal in this PR: https://github.com/kubeflow/trainer/pull/3063/files |
|
Hi @Snehadas2005 please dont add more commit, first fix the signing off commit and rebase NOT merge commits. I see you are "merging" from master branch 😞 |
…ation Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
|
Hi @jaiakash, thank you for checking. I am actively working on addressing the remaining comments. Currently, I am facing an issue when running distributed training on a local Kind cluster. The job is launched successfully, but it times out while waiting to reach the The Native Local Run with Below is the relevant code snippet I’m using for the cluster run: I am continuing to debug this and will share updates soon. If you have any suggestions on what I should check next, I’d really appreciate your guidance. Thank you. |
…ynb. Make changes in speech-recognition.ipynb Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
|
Hi all, sharing a brief status update. The I am currently working on the I am aiming to have this ready by today or tomorrow and will share an update once it is complete. Thank you for your patience. |
|
Sure, thanks @Snehadas2005. |
|
/ok-to-test |
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
| ] | ||
| }, | ||
| { | ||
| "ename": "RuntimeError", |
There was a problem hiding this comment.
Could you clean those errors up?
There was a problem hiding this comment.
I am currently working on this. I will clean up the errors that are coming up.
.gitignore
Outdated
| kind.exe | ||
| kind-windows.exe | ||
| kind-windows-amd64.exe | ||
| kubectl.exe |
There was a problem hiding this comment.
Is this really needed give windows binaries are excluded below?
examples/pytorch/README.md
Outdated
|
|
||
| - **Image Classification (mnist.ipynb):** Demonstrates distributed training on the Fashion MNIST dataset using CNNs. | ||
| - **Question Answering (fine-tune-distilbert.ipynb):** Fine-tuning DistilBERT on the SQuAD dataset with Hugging Face integration. | ||
| - **Speech Recognition (speech-recognition.ipynb):** Spoken word classification using an Audio Transformer on the Speech Commands dataset. |
There was a problem hiding this comment.
Could you clean up the output in this example please.
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
|
Hi @andreyvelich and @astefanutti, thank you for the detailed feedback. I have cleaned up the reported errors, updated the .gitignore entries, and improved the notebook outputs as requested. All changes have been pushed, and the E2E tests are now passing. Please let me know if there are any remaining issues or further improvements needed. I will be happy to address them. Thank you for your continued guidance. |
|
@Snehadas2005 Can you sign your commit and remove long output from your Notebook? |
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
|
@andreyvelich, I have cleared the Jupyter Notebook outputs for |
andreyvelich
left a comment
There was a problem hiding this comment.
You need to adjust your examples to follow the same flow as we run for MNIST with TrainJob creation and log checking: https://github.com/kubeflow/trainer/blob/master/examples/pytorch/image-classification/mnist.ipynb
CLAUDE.md
Outdated
| @@ -1 +1 @@ | |||
| AGENTS.md No newline at end of file | |||
| AGENTS.md | |||
There was a problem hiding this comment.
Why are you making changes to this file?
There was a problem hiding this comment.
Hi @andreyvelich,
Thank you for pointing this out.
The change to CLAUDE.md was unintentional and happened while editing nearby files. This is not related to this PR. I will revert this change and ensure only relevant files are modified going forward.
Thank you for flagging this.
Signed-off-by: Sneha Das <154408198+Snehadas2005@users.noreply.github.com>
What this PR does / Why we need it:
This PR updates and refactors the PyTorch Jupyter Notebook examples to support the Kubeflow Trainer V2 SDK. It adds a new Audio Classification workflow and improves existing examples to ensure stability, cross-platform compatibility, and alignment with current SDK features.
Key Changes:
TrainerClientandCustomTrainerAPIs.gloobackend.DATA_DIR,OUTPUT_DIR) across examples.Issues Fixed:
Fixes #3062
Fixes #2040
Related PR: #2830
Checklist:
.gitignorefor new artifacts