Define initial grid search

**Fixed:**
* model definition
* sequence length (?)

**Variable:** 
* batch size
* number of ranks
* parallelization strategy (Megatron, FSDP2, HSDP, DP, TP, PP once ready, CP needed?),
* (selective) Activation Checkpointing
* torch compile
* Pytorch Flash Attention vs Dao Flash Attention (need to check if pytorch calls internally DAO Flash Attention anyways)
* Special kernels? 
* CPU Offloading (optional)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Define initial grid search #406

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Define initial grid search #406

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions