BERTmosphere

Check out the collection of models pretrained based on this code: BERTmosphere.

Thank you Nishan Chatterjee for the creative collection name!

Usage example

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch

# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1

# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)

# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth."

# Run prediction
predictions = fill_mask(text)

# Show top predictions
print(text)
print(10*">")
for p in predictions:
    print(f"{p['sequence']} — {p['score']:.4f}")

Output:

The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth.
>>>>>>>>>>
the increase in greenhouse gas ... affected the energy balance of the earth . — 0.6922
the increase in greenhouse gas ... affected the mass balance of the earth . — 0.0631
the increase in greenhouse gas ... affected the radiation balance of the earth . — 0.0606
the increase in greenhouse gas ... affected the radiative balance of the earth . — 0.0517
the increase in greenhouse gas ... affected the carbon balance of the earth . — 0.0365

Docker initialization

Create Docker image from the folder containing ./Dockerfile

docker build -t bert_pretraining:1.0 .

Run docker compose in the folder where ./docker-compose.yml is and open it in VS code:

docker compose up

Command for server is available here.
Conda environment procedure is available here.

Workflow

sentence_tokenizer.py > tokenization_remainder.py > sentences2batches.py > csv2dataset.py

Perform sentence tokenizaton on the data containing Titles and Content from CSV filE using sentence_tokenizer.py -> DURATION: 12h for 180,000 papers

python3 sentence_tokenizer.py > process_log.txt

Use created process_log.txt to create left2process.txt (left2process.txt) that containes failed batches
Use tokenization_remainder.py to create new csv file based on the failed batches
Repeat first 3 steps until all data is tokenized into sentences
Use sentences2batches.py to create csv with text rows each containing ~11 sentences (tokenized ~512 tokens for MLM training) -> DURATION: 1 minute
Use csv2dataset.py to create final dataset (using dataset library) for training -> DURATION: 3h for 3,600,000 rows -> NOTE: Watch out for RobertaTokenizer/BertTokenizer
💊 💊 Step 6 needs to be performed for every model with it's tokenizer!!! 💊 💊
Start training BERT with model_training.py, RoBERTa with model_training_roberta.py, BERT from scratch with model_training_fromscratch.py or model_training_fromscratch_roberta.py -> DURATION: Depends on batch size, sequence length and hardware; 13 Days for this setup

Note: Fix training paramaters and directories according to your need!

Final checkpoint saves in binary format (convinient for older code, but use with caution!). To save other checkpoints in desired format, use model_chkpt2bin.py.

Model training info

training_notes

LinkBERT

linkbert_prep is used for initial stats on the data and fetching citations using Semantic Scholar/Crossref

Cite

@Article{Poleksić2025,
author={Poleksi{\'{c}}, Andrija
and Martin{\v{c}}i{\'{c}}-Ip{\v{s}}i{\'{c}}, Sanda},
title={Pretraining and evaluation of BERT models for climate research},
journal={Discover Applied Sciences},
year={2025},
month={Oct},
day={24},
volume={7},
number={11},
pages={1278},
issn={3004-9261},
doi={10.1007/s42452-025-07740-5},
url={https://doi.org/10.1007/s42452-025-07740-5}
}

Data collection

Used PDF2TXT repo

TODO

- Add trainig script similar to evidence_synthesis from dspoka.
- Make a framework for training and evaluation of all available models: BERT, RoBERTa, DeBERTa, ...
- https://huggingface.co/docs/transformers/en/main_classes/data_collator#transformers.DataCollatorForPermutationLanguageModeling

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
PRETRAINING		PRETRAINING
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
conda_init.md		conda_init.md
directory_structure.md		directory_structure.md
docker-compose.yml		docker-compose.yml
docker_command.md		docker_command.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERTmosphere

Usage example

Docker initialization

Workflow

Model training info

LinkBERT

Cite

Data collection

TODO

Links

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

P0L3/bert_pretraining

Folders and files

Latest commit

History

Repository files navigation

BERTmosphere

Usage example

Docker initialization

Workflow

Model training info

LinkBERT

Cite

Data collection

TODO

Links

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages