Skip to content

Changes to make torch.compile work#82

Open
zalbanob wants to merge 2 commits intoclessig:epicure-devfrom
zalbanob:epicure-dev
Open

Changes to make torch.compile work#82
zalbanob wants to merge 2 commits intoclessig:epicure-devfrom
zalbanob:epicure-dev

Conversation

@zalbanob
Copy link

@zalbanob zalbanob commented Jan 8, 2025

Key Performance Metrics Analysis:

Memory Distribution:

  • System RAM maintains ~400GB utilization
    image

  • GPU memory utilization is notably imbalanced:

    • GPUs 0-2: ~60%
    • GPU 3: >90% (hosting two encoder-decoder transformers)
      image

GPU Computational Load:

  • Utilization metrics:
    • GPUs 0-2: ~12% mean utilization
    • GPU 3: 25% mean utilization

image

Training Throughput:

  • Steady-state performance: 58.48 samples/second (mean)
  • Performance distribution shows strong consistency (IQR: 3.8)
  • 90% of samples fall between 56.7-71.7 samples/second
    image
    image

Current Bottleneck: The assignment of two encoder-decoder transformers on GPU 3 is creating a memory constraint that prevents batch size scaling, and subsequently limiting our ability to increase the number data loaders.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant