Author: Arianna Rigamonti
This project focuses on predicting taxonomic identity and genetic composition based on codon usage bias levels. The dataset, sourced from Khomtchouk, Bohdan B. “Codon usage bias levels predict taxonomic identity and genetic composition” (bioRxiv, 2020), includes codon usage frequencies from multiple organisms.
The goal is to classify organisms based on codon usage and recover missing feature data using machine learning techniques.
The dataset consists of 13,028 organisms and features two classification tasks:
- Kingdom classification – Classifying organisms into 11 taxonomic groups.
- DNA type classification – Identifying DNA types across 11 categories.
- 67 attributes per sample, including codon frequencies.
- Training set: 10,422 samples (
train.csv). - Test set: 2,606 samples (
test.csv, missing AGA codon frequency).
-
Data Analysis & Clustering:
- Visualizing data distribution.
- Identifying feature correlations.
- Evaluating clustering methods.
-
Classification:
- Feature selection.
- Model comparison (e.g., decision trees, SVMs, neural networks).
- Testing on the provided test set.
-
Regression for Missing Data:
- Training a regressor to predict missing AGA codon frequency.
- Comparing regression algorithms.
-
Evaluating the Impact of Missing Data Recovery:
- Retraining the classification model after imputing the missing AGA codon values.
- Assessing performance improvements.
- Language: Python
- Libraries Used:
numpy,pandas,scikit-learn,matplotlib,seaborn - Notebook: The analysis and models are implemented in a Jupyter Notebook.
The project evaluates clustering, classification, and regression methods based on accuracy, precision, recall, and other relevant metrics. The final performance is assessed before and after imputing missing values.
To reproduce the results:
- Install dependencies:
pip install numpy pandas scikit-learn matplotlib seaborn- Run the Jupyter Notebook:
jupyter notebook MachineLearning_hw.ipynb