Skip to content

Define initial grid search #406

@le1nux

Description

@le1nux

Fixed:

  • model definition
  • sequence length (?)

Variable:

  • batch size
  • number of ranks
  • parallelization strategy (Megatron, FSDP2, HSDP, DP, TP, PP once ready, CP needed?),
  • (selective) Activation Checkpointing
  • torch compile
  • Pytorch Flash Attention vs Dao Flash Attention (need to check if pytorch calls internally DAO Flash Attention anyways)
  • Special kernels?
  • CPU Offloading (optional)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions