Skip to content

Scaling Laws for Fine-Grained Mixture of Experts #38

@eagle705

Description

@eagle705

Author

  • Jakub Krajewski ∗ University of Warsaw IDEAS NCBR

Abstract

  • Mixture of Experts (MoE) models have emerged as a primary solution for reducing the computational cost of Large Language Models
  • Specifically, we introduce a new hyperparameter, granularity, whose adjustment enables precise control over the size of the experts
  • we establish scaling laws for fine-grained MoE, taking into account the number of training tokens, model size, and granularity
  • Our findings not only show that MoE models consistently outperform dense Transformers but also highlight that the efficiency gap between dense and MoE models widens as we scale up the model size and training budget.
  • we demonstrate that the common practice of setting the size of experts in MoE to mirror the feed-forward layer is not optimal at almost any computational budget.

Introduction

  • Our results suggest that a compute-optimal MoE model trained with a budget of 1020 FLOPs will achieve the same quality as a dense Transformer trained with a 20× greater computing budget, with the compute savings rising steadily, exceeding 40× when budget of 1025 FLOPs is surpassed (see Figure 1).
    image
  • Importantly, we show that the standard practice of fixing the size of experts in MoE to be the same as feed-forward layer is almost never optimal.

Our main contributions are:

  1. Introducing a new hyperparameter - granularity. Adjusting this parameter allows us to determine the optimal size of experts in MoE models, which translates into increased efficiency.
  2. Deriving new scaling laws for MoE models that incorporate variable training duration, the number of parameters, and granularity. Such scaling laws allow us to calculate optimal training hyperparameters for MoE models.
  3. Demonstrating that, with optimal settings, MoE models can always outperform traditional Transformers at any computing budget. This is a conclusion contrary to the results from Clark et al. (2022).

Background

  • skip

Granularity

  • In this work, we suggest an alternative approach where the hidden dimension of the expert is not necessarily set to mirror that of the standard feed-forward layer. Instead, it can be adjusted to a value that is the most effective.
  • This approach allows the configuration of MoE to be articulated in terms of two key hyperparameters: granularity (G) and expansion rate (E). In the following parts of this work, we will also use the term active parameters to refer to the non-embedding parameters used to produce output for a single token, except routing

image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions