Machine Learning Homework 2024

Author: Arianna Rigamonti

Description

This project focuses on predicting taxonomic identity and genetic composition based on codon usage bias levels. The dataset, sourced from Khomtchouk, Bohdan B. “Codon usage bias levels predict taxonomic identity and genetic composition” (bioRxiv, 2020), includes codon usage frequencies from multiple organisms.

The goal is to classify organisms based on codon usage and recover missing feature data using machine learning techniques.

Dataset

The dataset consists of 13,028 organisms and features two classification tasks:

Kingdom classification – Classifying organisms into 11 taxonomic groups.
DNA type classification – Identifying DNA types across 11 categories.

Features:

67 attributes per sample, including codon frequencies.
Training set: 10,422 samples (train.csv).
Test set: 2,606 samples (test.csv, missing AGA codon frequency).

Tasks

Data Analysis & Clustering:
- Visualizing data distribution.
- Identifying feature correlations.
- Evaluating clustering methods.
Classification:
- Feature selection.
- Model comparison (e.g., decision trees, SVMs, neural networks).
- Testing on the provided test set.
Regression for Missing Data:
- Training a regressor to predict missing AGA codon frequency.
- Comparing regression algorithms.
Evaluating the Impact of Missing Data Recovery:
- Retraining the classification model after imputing the missing AGA codon values.
- Assessing performance improvements.

Implementation

Language: Python
Libraries Used: numpy, pandas, scikit-learn, matplotlib, seaborn
Notebook: The analysis and models are implemented in a Jupyter Notebook.

Results

The project evaluates clustering, classification, and regression methods based on accuracy, precision, recall, and other relevant metrics. The final performance is assessed before and after imputing missing values.

Usage

To reproduce the results:

Install dependencies:

pip install numpy pandas scikit-learn matplotlib seaborn

Run the Jupyter Notebook:

jupyter notebook MachineLearning_hw.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
LICENSE.txt		LICENSE.txt
MachineLearning_hw.ipynb		MachineLearning_hw.ipynb
README.md		README.md
ml_homework_2024.pdf		ml_homework_2024.pdf
test.csv		test.csv
train.csv		train.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Homework 2024

Description

Dataset

Features:

Tasks

Implementation

Results

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

AriannaRigamonti/MachineLearning

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Homework 2024

Description

Dataset

Features:

Tasks

Implementation

Results

Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages