This project uses machine learning to classify mushrooms as edible or poisonous based on 22 categorical features describing physical characteristics. The dataset originates from the Audubon Society Field Guide to North American Mushrooms and was donated to the UCI Machine Learning Repository in 1987. Given the real-life consequences of misidentification, this problem is both challenging and impactful.
Objective:
Build a classification model to predict whether a mushroom is edible (e) or poisonous (p) based on observable traits like odor, gill size, cap shape, and spore print color.
- Samples: 8,124
- Features: 22 categorical features + 1 target label (
class) - Target Values:
e= Ediblep= Poisonous
Note: The original "unknown edibility" category was combined with poisonous to ensure safety.
- Source: [https://www.kaggle.com/datasets/uciml/mushroom-classification)
- Loaded and inspected the dataset using
pandas - Counted total rows and columns
- Identified that all features are categorical
- Noted missing values:
stalk-roothad ~30.5% missing, marked as"?"
- Verified target class (
class) is nearly balanced- Edible: 51.8%
- Poisonous: 48.2%
- Plotted distribution for transparency
- Used
pd.crosstabto find features highly correlated with the poisonous class - Top predictors included:
odorgill-sizespore-print-color
- Created bar plots and histograms to visualize feature separability
- Dropped
veil-type(constant feature) - Replaced missing
stalk-rootvalues with its mode - One-hot encoded all categorical features using
sklearn.OneHotEncoderwithdrop='first' - Final shape: 8124 samples × 92 encoded features
- Split into:
- 80% training
- 10% validation
- 10% testing
- Used stratified sampling to maintain class balance
- Chose Random Forest Classifier:
- 100 trees
- Class weight = "balanced"
- Random seed for reproducibility
- Evaluated on both validation and test sets
- Reported:
- Accuracy
- Precision
- Recall
- F1-score
- ROC AUC
- Confusion matrices visualized using
seaborn
- Extracted top 20 features by importance from the trained model
- Displayed using a horizontal bar chart
├── FinalProj.ipynb
├── mushrooms.csv
├── README.md