Skip to content

P0L3/bert_pretraining

Repository files navigation

10.1007/s42452-025-07740-5 Hugging Face - BERTmosphere

BERTmosphere

Check out the collection of models pretrained based on this code: BERTmosphere.

Thank you Nishan Chatterjee for the creative collection name!

Usage example

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch

# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1

# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)

# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth."

# Run prediction
predictions = fill_mask(text)

# Show top predictions
print(text)
print(10*">")
for p in predictions:
    print(f"{p['sequence']}{p['score']:.4f}")

Output:

The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth.
>>>>>>>>>>
the increase in greenhouse gas ... affected the energy balance of the earth . — 0.6922
the increase in greenhouse gas ... affected the mass balance of the earth . — 0.0631
the increase in greenhouse gas ... affected the radiation balance of the earth . — 0.0606
the increase in greenhouse gas ... affected the radiative balance of the earth . — 0.0517
the increase in greenhouse gas ... affected the carbon balance of the earth . — 0.0365

Docker initialization

  1. Create Docker image from the folder containing ./Dockerfile
docker build -t bert_pretraining:1.0 . 
  1. Run docker compose in the folder where ./docker-compose.yml is and open it in VS code:
docker compose up
  • Command for server is available here.

  • Conda environment procedure is available here.

Workflow

sentence_tokenizer.py > tokenization_remainder.py > sentences2batches.py > csv2dataset.py

  1. Perform sentence tokenizaton on the data containing Titles and Content from CSV filE using sentence_tokenizer.py -> DURATION: 12h for 180,000 papers
python3 sentence_tokenizer.py > process_log.txt
  1. Use created process_log.txt to create left2process.txt (left2process.txt) that containes failed batches
  2. Use tokenization_remainder.py to create new csv file based on the failed batches
  3. Repeat first 3 steps until all data is tokenized into sentences
  4. Use sentences2batches.py to create csv with text rows each containing ~11 sentences (tokenized ~512 tokens for MLM training) -> DURATION: 1 minute
  5. Use csv2dataset.py to create final dataset (using dataset library) for training -> DURATION: 3h for 3,600,000 rows -> NOTE: Watch out for RobertaTokenizer/BertTokenizer
  6. 💊 💊 Step 6 needs to be performed for every model with it's tokenizer!!! 💊 💊
  7. Start training BERT with model_training.py, RoBERTa with model_training_roberta.py, BERT from scratch with model_training_fromscratch.py or model_training_fromscratch_roberta.py -> DURATION: Depends on batch size, sequence length and hardware; 13 Days for this setup

Note: Fix training paramaters and directories according to your need!

  1. Final checkpoint saves in binary format (convinient for older code, but use with caution!). To save other checkpoints in desired format, use model_chkpt2bin.py.

Model training info

LinkBERT

  1. linkbert_prep is used for initial stats on the data and fetching citations using Semantic Scholar/Crossref

Cite

@Article{Poleksić2025,
author={Poleksi{\'{c}}, Andrija
and Martin{\v{c}}i{\'{c}}-Ip{\v{s}}i{\'{c}}, Sanda},
title={Pretraining and evaluation of BERT models for climate research},
journal={Discover Applied Sciences},
year={2025},
month={Oct},
day={24},
volume={7},
number={11},
pages={1278},
issn={3004-9261},
doi={10.1007/s42452-025-07740-5},
url={https://doi.org/10.1007/s42452-025-07740-5}
}

Data collection

TODO

Links

About

Repository for pretraining of BERT-like models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published