-
Notifications
You must be signed in to change notification settings - Fork 12
Open
Description
Fixed:
- model definition
- sequence length (?)
Variable:
- batch size
- number of ranks
- parallelization strategy (Megatron, FSDP2, HSDP, DP, TP, PP once ready, CP needed?),
- (selective) Activation Checkpointing
- torch compile
- Pytorch Flash Attention vs Dao Flash Attention (need to check if pytorch calls internally DAO Flash Attention anyways)
- Special kernels?
- CPU Offloading (optional)
Metadata
Metadata
Assignees
Labels
No labels