This fourth Lab Assignment will give you practice in using many different models to build a classifier.
You've been hired as a data scientist for a wilderness survival mobile app.
Your mission: build the most accurate mushroom classifier to help foragers distinguish edible mushrooms from poisonous ones.
Dataset: here
Dictionary: here
(Please note that this is a modified version of a mushroom dataset you can find online. Do not use any dataset other than this one!)
Your Goal: Build a binary classifier (edible or poisonous) with the highest cross-validated Accuracy score.
Your final submission will be primarily judged along two dimensions:
Best Cross-Validated Accuracy
The highest 5-fold cross-validated Accuracy for any combination of features, expressed to two decimal points (for example, 87.54). The model should also have a comparable Recall score. Track and report both.
Most Efficient Model
Build a model using as few features as possible, with as high a score as possible, so that the following function is optimized:
Efficiency Score = (CV Accuracy)² / (Number of Features Used)
Load and Explore
Load the mushroom dataset and get familiar with it.
Check dataset shape and feature types.
Check for errors and missing values. Fix them if required. Document what you did.
Show the class distribution (edible vs poisonous).
For 3-5 features, create visualizations showing how they relate to edibility.
Encode Categorical Features
All features are categorical and need encoding. Do this, and justify your strategy.
Train-Test Split
Split the dataset 70/30 (i.e., not 80/20) and use the same random seed (so your work is repeatable). Print the class distribution in train and test sets and confirm they are correct.
Train at least 6 different algorithms. Feel free to search for options and use them. You may use an ensemble based approach, but no neural network based algos please.
For EACH algorithm:
Fit the best model on a proper training set and evaluate on a test set.
Use 5-fold cross-validation.
Use GridSearchCV to find the best hyperparameters.
Record: best parameters, CV accuracy (mean ± std), training time, and prediction time.
Record full performance metrics for your best performing configurations.
Create a comprehensive comparison table which summarizes all your results (you can automate this!)
Organize your work neatly in a Jupyter Notebook. Please download the dataset in a manner which will work after you email the Jupyter notebook to me.
Use clear sections and insert comments or markdown to explain key decisions. Format your output. Modularize where possible.
At the bottom of the Notebook clearly identify your final results for the purpose of the competition.
Once done, download your Jupyter Notebook,with your all your above work and results, and send it to me as an attachment to an email with the subject line: CS 4403 - Lab Assignment 4.
Submit this no later than 23:59 on Friday 20 February, 2026.
There will be a prize for the best submission, taking into account:
Primarily, the scores achieved with respect to the two dimensions mentioned above.
Secondarily (i.e., where scores are very close), the quality with which each of the steps was completed and the extent to which your code was appropriately modularized.