Check out the collection of models pretrained based on this code: BERTmosphere.
Thank you Nishan Chatterjee for the creative collection name!
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch
# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1
# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)
# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth."
# Run prediction
predictions = fill_mask(text)
# Show top predictions
print(text)
print(10*">")
for p in predictions:
print(f"{p['sequence']} — {p['score']:.4f}")Output:
The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth.
>>>>>>>>>>
the increase in greenhouse gas ... affected the energy balance of the earth . — 0.6922
the increase in greenhouse gas ... affected the mass balance of the earth . — 0.0631
the increase in greenhouse gas ... affected the radiation balance of the earth . — 0.0606
the increase in greenhouse gas ... affected the radiative balance of the earth . — 0.0517
the increase in greenhouse gas ... affected the carbon balance of the earth . — 0.0365- Create Docker image from the folder containing ./Dockerfile
docker build -t bert_pretraining:1.0 . - Run docker compose in the folder where ./docker-compose.yml is and open it in VS code:
docker compose upsentence_tokenizer.py > tokenization_remainder.py > sentences2batches.py > csv2dataset.py
- Perform sentence tokenizaton on the data containing Titles and Content from CSV filE using sentence_tokenizer.py -> DURATION: 12h for 180,000 papers
python3 sentence_tokenizer.py > process_log.txt- Use created
process_log.txtto createleft2process.txt(left2process.txt) that containes failed batches - Use tokenization_remainder.py to create new csv file based on the failed batches
- Repeat first 3 steps until all data is tokenized into sentences
- Use sentences2batches.py to create csv with text rows each containing ~11 sentences (tokenized ~512 tokens for MLM training) -> DURATION: 1 minute
- Use csv2dataset.py to create final dataset (using dataset library) for training -> DURATION: 3h for 3,600,000 rows -> NOTE: Watch out for RobertaTokenizer/BertTokenizer
- 💊 💊 Step 6 needs to be performed for every model with it's tokenizer!!! 💊 💊
- Start training BERT with model_training.py, RoBERTa with model_training_roberta.py, BERT from scratch with model_training_fromscratch.py or model_training_fromscratch_roberta.py -> DURATION: Depends on batch size, sequence length and hardware; 13 Days for this setup
Note: Fix training paramaters and directories according to your need!
- Final checkpoint saves in binary format (convinient for older code, but use with caution!). To save other checkpoints in desired format, use model_chkpt2bin.py.
- linkbert_prep is used for initial stats on the data and fetching citations using Semantic Scholar/Crossref
@Article{Poleksić2025,
author={Poleksi{\'{c}}, Andrija
and Martin{\v{c}}i{\'{c}}-Ip{\v{s}}i{\'{c}}, Sanda},
title={Pretraining and evaluation of BERT models for climate research},
journal={Discover Applied Sciences},
year={2025},
month={Oct},
day={24},
volume={7},
number={11},
pages={1278},
issn={3004-9261},
doi={10.1007/s42452-025-07740-5},
url={https://doi.org/10.1007/s42452-025-07740-5}
}- Used PDF2TXT repo
- - Add trainig script similar to evidence_synthesis from dspoka.
- - Make a framework for training and evaluation of all available models: BERT, RoBERTa, DeBERTa, ...
- - https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DataCollatorForPermutationLanguageModeling
- https://www.youtube.com/watch?v=IC9FaVPKlYc&t=85s
- https://towardsdatascience.com/how-to-train-bert-aaad00533168
- https://github.com/jamescalam/transformers/blob/main/course/training/03_mlm_training.ipynb
- https://aclanthology.org/2023.bionlp-1.19.pdf
- https://github.com/stelladk/PretrainingBERT
- https://github.com/stelladk/PretrainingBERT/blob/main/pretrain.py#L6
- https://huggingface.co/docs/transformers/en/main_classes/data_collator
- https://huggingface.co/docs/transformers/en/notebooks
- https://github.com/huggingface/notebooks/blob/main/examples/language_modeling_from_scratch.ipynb
- https://github.com/huggingface/transformers/tree/main/examples
- https://huggingface.co/docs/datasets/en/loading
- https://www.geeksforgeeks.org/python-random-sample-function/
- https://www.youtube.com/watch?v=IcrN_L2w0_Y
- https://thepythoncode.com/article/pretraining-bert-huggingface-transformers-in-python
- https://huggingface.co/docs/transformers/main_classes/trainer#trainingarguments
- https://keras.io/examples/nlp/pretraining_BERT/
- https://medium.com/data-and-beyond/complete-guide-to-building-bert-model-from-sratch-3e6562228891
- https://huggingface.co/learn/nlp-course/chapter7/6?fw=pt