Welcome to the LLäMmlein codebase — used to train our German, decoder-only LLäMmlein model family from scratch. We currently provide three model sizes — 120M, 1B, and 7B — all trained on the same curated dataset for consistent scaling comparisons. All components of the training process, including code, datasets, and intermediate checkpoints are openly released. Find more information in our paper. This repository builds on TinyLlama as its backbone.
- ⚡️ Flash Attention 3
- 🏎️ Fast dataloader with caching for improved speed
- ✏️ Datapoint logging for better tracking
All models are available on Hugging Face, along with:
- Intermediate checkpoints
- Data logging metadata
| Model Size | Hugging Face Link |
|---|---|
| LLäMmlein 7B | LLäMmlein_7B |
| LLäMmlein 1B | LLäMmlein_1B |
| LLäMmlein 120M | LLäMmlein_120M |
These earlier versions were used in our initial experiments (see accompanying paper):
| Model | Hugging Face Link |
|---|---|
| LLäMmlein 1B (prerelease) | LLäMmlein_1B_prerelease |
| LLäMmlein 120M (prerelease) | LLäMmlein_120M_prerelease |
All our models are trained using our filtered version of the RedPajama V2 dataset, available under LLaMmlein-Dataset In addition to paragraph deduplication, we applied a token-to-word ratio filter to exclude low-quality or noisy text.
We provide our requirements in TinyLlama/cluster/building/requirements.txt. They can be installed using:
pip install -r requirements.txtIn addition, we provide a containerized environment using Singularity (Apptainer).
The .sif image can be build using the provided definition file llammlein.def.
(located in TinyLlama/cluster/building/):
bash build-image.shThe tinyllama_fabric.py contains the general training script. Before starting several parameters have to be set, we will list the most important here:
- model_name : Identifier specified in lit-gpt/config.py
- global_batch_size: batch size across all gpus
- learning_rate
- micro_batch_size
- max_steps
- sharding_strategy: Efficiency of the FSDP sharding strategy can differ drastically between cluster settings
- train_data_dir: Path to the training dataset (line 112)
- state_dict_type: Depending on the size of model you are training and available gpus it can make sense to set this parameter to
shardedinstead offull(line 131) - tokenizer_path: Path to the tokenizer (line 417)
Once all training parameters are configured, training can be launched using a script similar to this one:
#!/bin/bash
#SBATCH --job-name="training"
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:h100:4
#SBATCH --partition=h100
#SBATCH --time=24:00:00
# Set all cluster specific variables i.e. HTTP_PROXY, NCCL_IB_HCA ...
export MODEL_NAME="$OUTDIR" # location to save intermediate checkpoints
srun apptainer exec llammlein.sif python pretrain/tinyllama_fabric.py Our specific .sh file with all our selected environment variables can be found in TinyLlama/cluster/exec.
Training can be resumed from a specific checkpoint using:
srun apptainer exec llammlein.sif python pretrain/tinyllama_fabric.py --resume $MODEL_NAME/iter-00100000-ckpt.pthThe checkpoints saved during training are not Huggingface compatible therefore the examplary script scripts/create.sh has to be executed.
Please make sure to add all paths correctly in the .sh file.
@inproceedings{pfister-etal-2025-llammlein,
title = {{LL}{\"a}{M}mlein: Transparent, Compact and Competitive {G}erman-Only Language Models from Scratch},
author = "Pfister, Jan and
Wunderle, Julia and
Hotho, Andreas",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.111/",
pages = "2227--2246",
ISBN = "979-8-89176-251-0",
abstract = {We transparently create two German-only decoder models, LL{\"a}Mmlein 120M and 1B, from scratch and publish them, along with the training data, for the (German) NLP research community to use. The model training involved several key steps, including data preprocessing/filtering, the creation of a German tokenizer, the training itself, as well as the evaluation of the final models on various benchmarks, also against existing models. Throughout the training process, multiple checkpoints were saved in equal intervals and analyzed using the German SuperGLEBer benchmark to gain insights into the models' learning process.Compared to state-of-the-art models on the SuperGLEBer benchmark, both LL{\"a}Mmlein models performed competitively, consistently matching or surpassing models with similar parameter sizes. The results show that the models' quality scales with size as expected, but performance improvements on some tasks plateaued early during training, offering valuable insights into resource allocation for future models.}
}