- This repository explores the use of a Histogram-based Gradient Boosting Classifier to predict the functional class of proteins based on their sequence-derived features.
- https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated?select=proteinas_test.csv
This repository addresses the task of classifying proteins into one of five functional classes (Estrutural, Receptora, Enzima, Transporte, Outras) using a tabular dataset. The approach formulates this as a multi-class classification problem, showing various protein features. The focused on making a single algorithm work effectively as per the initial request. The chosen model, a Histogram-based Gradient Boosting Classifier, achieved a validation accuracy of approximately 20%.
- Data:
- Type: Tabular data (CSV format)
- Input: Protein features such as molecular weight, isoelectric point, hydrophobicity, charge, amino acid proportions, and sequence length.
- Output: Multi-class target column ("Classe"), with five possible protein functional classes.
- Size:
- Train: 16,000 rows × 10 columns (including the "Classe" target)
- Test: 4,000 rows × 10 columns (including the "Classe" target for evaluation)
- Instances (Train, Test, Validation Split): The initial training data was split into:
- 60% training (9,600 samples)
- 20% validation (3,200 samples)
- 20% testing (3,200 samples)
- Type: Tabular data (CSV format)
- Missing Values: Checked for missing values in numerical features; none were found.
- Sequence Length Validation: Verified that the length of the protein sequence matched the provided sequence length feature.
- Amino Acid Composition: Engineered new numerical features based on the proportion of different types of amino acids in the protein sequence (Hydrophobic, Charged, Polar, Small, Aromatic, Proline Content).
- Unnecessary Columns: Removed 'ID_Proteína' and 'Sequência' columns after feature engineering.
- Feature Scaling: Applied
StandardScalerto the numerical features to standardize their ranges.
-
Bar Chart of Target Variable ('Classe'): A bar chart showing the distribution of protein functional classes in the dataset.
-
Comparison of Features Across Classes: Histograms comparing the distributions of numerical features across different protein classes. The Kolmogorov-Smirnov (KS) test is used to quantify the difference between the distributions.
-
Confusion Matrix (Test Set): A confusion matrix visualizing the model's performance on the test set, showing the distribution of predicted versus actual classes.
- Define:
- Input: Numerical features derived from protein sequences and physicochemical properties.
- Output: Prediction of the protein's functional class (one of five categories).
- Models:
- Histogram-based Gradient Boosting Classifier: Chosen for its efficiency and performance on tabular data.
- Describe the training:
- Trained using scikit-learn in a standard Python environment.
- Training time was minimal due to the dataset size and the efficiency of the algorithm.
- Training curves (loss vs epoch for test/train): Not explicitly tracked as the
HistGradientBoostingClassifieris not trained in epochs like neural networks. Performance was evaluated on the validation set after training. - How did you decide to stop training: Training stopped when the
fitmethod of the classifier completed. For more advanced usage, one might monitor performance on a validation set during training and use early stopping. - Any difficulties? How did you resolve them? Initially encountered a
ValueErrordue to non-numeric data. This was resolved by ensuring only numeric columns were used for training and prediction. Feature name mismatches between training and validation/test sets after selecting numeric columns were addressed by ensuring consistent column selection.
- Clearly define the key performance metric(s):
- Accuracy: Overall percentage of correctly classified proteins.
- Classification Report: Includes precision, recall, and F1-score for each class.
- Confusion Matrix: Shows the distribution of predicted vs. actual classes.
- Show/compare results in one table:
| Metric | Value (Validation Set) |
|---|---|
| Accuracy | ~0.20 |
| Macro Precision | ~0.20 |
| Macro Recall | ~0.20 |
| Macro F1-Score | ~0.20 |
- The initial implementation of the Histogram-based Gradient Boosting Classifier achieved an accuracy of approximately 20% on the validation set. This performance is close to random guessing given the five classes, indicating that further work is needed to build a more effective model. The confusion matrix shows a distribution of predictions across all classes, with no single class being consistently well-predicted.
- Investigate more advanced feature engineering techniques, potentially exploring protein sequence embeddings or other biological features.
- Address the class imbalance, although it appeared relatively minor in the initial analysis, by using techniques like oversampling or undersampling.
- To reproduce the results, follow the steps in a Python environment with the necessary libraries installed.
- This README.md: Provides an overview of the project.
- pandas
- numpy
- scikit-learn (for data splitting, scaling, model, and metrics)
- matplotlib (for visualization)
- The
proteinas_train.csvandproteinas_test.csvfiles should be in the same directory as the Python scripts.
-
Trained using scikit-learn.
-
Training time was minimal.
-
Performance evaluated on the validation set.
-
Training stopped after the fit method completed.
-
Difficulties with data type and feature name mismatches were resolved.
- Run the provided Python code to calculate and display metrics (accuracy, classification report, confusion matrix) on the validation set.
Gallo, H. (2023). Bioinformatics protein dataset simulated. Kaggle. https://www.kaggle.com/datasets/gallo33henrique/bioinformatics-protein-dataset-simulated/code



