chore: add audio m5 speechcommands#3092
Conversation
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR adds an audio classification example demonstrating how to train an M5 Network model on the Google Speech Commands dataset using PyTorch and Kubeflow Trainer. The example shows distributed training capabilities and scaling from local execution to Kubernetes clusters.
Changes:
- Added a comprehensive Jupyter notebook example for audio classification with M5 Network architecture
- Implemented distributed training setup with PyTorch DDP on Speech Commands dataset
- Integrated Kubeflow Trainer SDK for job submission, monitoring, and management
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| " batch_size=batch_size,\n", | ||
| " sampler=sampler,\n", | ||
| " collate_fn=collate_fn,\n", | ||
| " num_workers=1 if device_type == \"cuda\" else 0\n", |
There was a problem hiding this comment.
The num_workers is set to 1 for cuda and 0 for cpu on line 227. However, using num_workers=1 with distributed training can cause issues with data loading performance and may not fully utilize multi-core CPUs. Consider using a higher value like num_workers=4 or making it configurable as a parameter for better performance, especially when using multiple workers per node.
| " num_workers=1 if device_type == \"cuda\" else 0\n", | |
| " num_workers=4 if device_type == \"cuda\" else 2\n", |
| "\n", | ||
| " # Initialize model with DDP\n", | ||
| " model = M5(n_input=waveform.shape[0], n_output=len(labels)).to(device)\n", | ||
| " model = nn.parallel.DistributedDataParallel(model)\n", |
There was a problem hiding this comment.
The model is wrapped with DistributedDataParallel without specifying the device_ids parameter. While this works, for GPU training it's recommended to explicitly pass device_ids=[local_rank] and output_device=local_rank to ensure proper device placement and avoid potential issues with multi-GPU setups.
| " model = nn.parallel.DistributedDataParallel(model)\n", | |
| " if device.type == \"cuda\":\n", | |
| " # Determine local_rank for this process and bind model to the correct GPU\n", | |
| " if torch.distributed.is_initialized() and torch.cuda.is_available() and torch.cuda.device_count() > 0:\n", | |
| " local_rank = torch.distributed.get_rank() % torch.cuda.device_count()\n", | |
| " else:\n", | |
| " local_rank = torch.cuda.current_device()\n", | |
| " model = nn.parallel.DistributedDataParallel(\n", | |
| " model,\n", | |
| " device_ids=[local_rank],\n", | |
| " output_device=local_rank,\n", | |
| " )\n", | |
| " else:\n", | |
| " model = nn.parallel.DistributedDataParallel(model)\n", |
| " output = model(data)\n", | ||
| " loss = F.nll_loss(output.squeeze(), target)\n", |
There was a problem hiding this comment.
The output shape from the model forward pass appears inconsistent with the loss calculation. The forward method returns F.log_softmax(self.classifier(x), dim=2) where x has been permuted to shape (batch_size, 1, 2*n_channel). This means the output has shape (batch_size, 1, n_output). The loss function then calls output.squeeze() to get shape (batch_size, n_output), but this adds unnecessary complexity. The dim parameter in log_softmax should be dim=-1 or the architecture should be simplified to avoid the extra dimension.
| " output = model(data)\n", | |
| " loss = F.nll_loss(output.squeeze(), target)\n", | |
| " output = model(data).squeeze(1)\n", | |
| " loss = F.nll_loss(output, target)\n", |
| " device_type, backend = (\n", | ||
| " (\"cuda\", \"nccl\") if torch.cuda.is_available() else (\"cpu\", \"gloo\")\n", |
There was a problem hiding this comment.
The output logs show "Using Device: cpu, Backend: gloo" (lines 440-469) even though the job configuration requests a GPU with "nvidia.com/gpu": 1 on line 361. This suggests the GPU is not being properly detected or utilized. Verify that the Docker image includes proper CUDA support and that torch.cuda.is_available() returns True in the GPU environment, or update the example to clarify CPU-only behavior.
| " x = F.avg_pool1d(x, x.shape[-1]).permute(0, 2, 1)\n", | ||
| " return F.log_softmax(self.classifier(x), dim=2)\n", |
There was a problem hiding this comment.
The forward method returns log probabilities with shape (batch_size, 1, n_output) after permute, but the training loop applies nll_loss after squeeze on line 250. This architecture design means x.shape[-1] is always 1 after avg_pool1d since the pooled dimension becomes 1. The permute(0, 2, 1) operation and squeeze() work, but this architecture seems unnecessarily complex. Consider simplifying to return shape (batch_size, n_output) directly by using x = F.avg_pool1d(x, x.shape[-1]).squeeze(-1) followed by self.classifier(x), and removing the permute and the need for squeeze in the loss calculation.
| " x = F.avg_pool1d(x, x.shape[-1]).permute(0, 2, 1)\n", | |
| " return F.log_softmax(self.classifier(x), dim=2)\n", | |
| " x = F.avg_pool1d(x, x.shape[-1]).squeeze(-1)\n", | |
| " return F.log_softmax(self.classifier(x), dim=1)\n", |
| " def collate_fn(batch):\n", | ||
| " tensors, targets = [], []\n", | ||
| " for waveform, _, label, *_ in batch:\n", | ||
| " tensors += [waveform.t()]\n", | ||
| " targets += [torch.tensor(labels.index(label))]\n", | ||
| " # Pad to same length\n", | ||
| " tensors = torch.nn.utils.rnn.pad_sequence(\n", | ||
| " tensors, batch_first=True, padding_value=0.\n", | ||
| " ).permute(0, 2, 1)\n", | ||
| " return tensors, torch.stack(targets)\n", |
There was a problem hiding this comment.
The collate function is defined inside train_m5_speechcommands but it needs access to the 'labels' variable which is only created later on line 213. This will cause the DataLoader to fail when it tries to use collate_fn. The function definition order should be reconsidered - either move the collate_fn definition after labels is created, or restructure to pass labels as a closure variable or use a class-based approach.
| " dist.init_process_group(backend=backend)\n", | ||
| " print(\n", | ||
| " f\"Distributed Training - WORLD_SIZE: {dist.get_world_size()}, \"\n", | ||
| " f\"RANK: {dist.get_rank()}, LOCAL_RANK: {local_rank}\"\n", | ||
| " )\n", |
There was a problem hiding this comment.
The logs show WORLD_SIZE of 30 processes being spawned (lines 470-499), which seems excessive for a single node configuration. The job is configured with num_nodes=1 on line 355, but torch distributed appears to be creating 30 worker processes. This could be intentional for data parallelism, but it's not clearly documented and may lead to resource contention. Consider adding documentation about why 30 processes are used or making this configurable.
Pull Request Test Coverage Report for Build 20945115994Details
💛 - Coveralls |
|
Closing this as #3063 already has raised PR for audio example. |
What this PR does / why we need it:
This PR add audio classification example to trainer repo. It trains an audio classification model using the M5 Network architecture on the Google Speech Commands dataset with PyTorch and Kubeflow Trainer.
On below system specs, it took around 3 mins to run, we can also include it in our E2E test coverage.
Which issue(s) this PR fixes
Related #2040
Checklist: