ContextCountBooster

Sequence context rate modeling with xgboost

Input data

Input data sets for the encode command should be in TSV file format with no column headers or row indices. Context counts should be a TSV file with two columns: sequence context and context count. Context weights should be a TSV file with two columns: sequence context and context weight (opportunities).

Usage

All commands and their descriptions can be founds by running:

uv run contextcountbooster --help

Usage and all parameters of specific commands can be found by running:

uv run contextcountbooster command --help

Data encoding

Run encoding to one-hot encode k-mer sequences and return a data set with the explanatory and target variables for modeling

uv run contextcountbooster encode --output_dir ../output/ --output_prefix train_ --encoding 4 ../input/train_counts.txt ../input/train_weights.txt

Encoding command outputs a combined TSV file with columns representing the k-mer, count, weight, rate (count/weight) and the sequence encoding.

Model training

CCB trains a boosting model using xgboost Python package.

Hyperparameter tuning via grid search cross-validation and subsequent model training can be run as:

uv run contextcountbooster train --encoding 4 --output_dir ../output/ --train_data ../output/train_3mers_4bitOHE.tsv

The default grids and hyperparameter values can be specified with command line arguments specified in CCB documentation.

Training command outputs the optimal hyperparameter values and mean validation loss to bst_best_param.csv and the trained model to bst_best.json (used for loading and making predictions). Additionally, figures of feature gains and weights are generated, and cross-validation results with mean validation loss and mean Nagelkerke R2 values are output to training_res.csv. Lastly, the null model, represented by the mean rate of the training set, is output to mu_freq.csv.

Cross-validation and model training is feasible to run as a one computational task up to 9-mers. Beyond 9-mers, it is recommended to split cross-validation to independent jobs and subsequently aggregate CV results and train the full model.

To run cross-validation in a parallel manner, testing one combination of hyperparameters can be run as:

uv run contextcountbooster train --encoding 4 --output_dir ../output/CV/ --dist_CV --l2_lambda 100 --eta 0.1 --max_depth 6 --alpha 0.01 --train_data ../output/train_3mers_4bitOHE.tsv

Here, five-fold cross-validation is run with L2 regularization value of 100, learning rate of 0.1, maximum tree depth of 6 and pseudoweight fraction of 0.01. The training command with the dist_CV flag outputs the CV results to a .CSV file named based on the paramter values (L2_100__ETA_0.1__MD_6__ALPHA_0.01.csv).

After running cross-validation in a distributed manned, the full model can be trained as:

uv run contextcountbooster train --aggregate_and_train_only --encoding 4 --output_dir ../output/ --train_data ../output/train_3mers_4bitOHE.tsv

The distributed CV results will be automatically collected from the subfolder specified by the --output_dir and suffixed with /CV/. The train command outputs the same files as described above. The distributed model training can easily be applied by running the CCB Snakemake workflow

Predictions

The trained model can be used to make predictions as:

uv run contextcountbooster predict --output_dir ../output/ ../input/test_3mers_4bitOHE.tsv ../output/bst_best.json ../output/mu_freq.csv

Here, the test_3mers_4bit.tsv is the test data encoded with CCB's encode command, bst_best.json is the fitted model, and mu_freq.csv is the null model output by the train command.

The predict command outputs the test loss to test_nll.csv and test set predictions (along with observed contxt, weight and frequency data) to test_pred.csv.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
src/contextcountbooster		src/contextcountbooster
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ContextCountBooster

Input data

Usage

Data encoding

Model training

Predictions

About

Uh oh!

Releases

Packages

Languages

BesenbacherLab/contextcountbooster

Folders and files

Latest commit

History

Repository files navigation

ContextCountBooster

Input data

Usage

Data encoding

Model training

Predictions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages