This third Lab Assignment will give you practice in crafting a regression and classification model using Regression and Naive Bayes.
In this lab, you will act as a consultant for a telecommunications company. You have been asked to analyze a dataset containing customer demographics and account information.
Your assigned goal is to build two predictive models using no more than 5 features.
You must identify which 5 features provide the highest predictive power for two distinct tasks:
Predicting a customer's monthly bill (using any form of Regression).
Predicting if they will leave the company (Classification using Naive Bayes).
You can download the dataset in CSV format here.
The column names should be sufficiently explanatory, but for greater clarity please note the following:
tenure is expressed in months,
columns G through O reflect various services sold to customers,
the churn column is a label indicating whether the customer left or not.
Your approach must include the following steps:
data cleanup and pre-processing (missing values, data encoding).
feature selection (use no more than 5 features, excluding your target variable). Do not simply guess. Use a systematic approach. For task 1, learn what model.coef_ does. Explain why you chose the 5 features you ended up using.
Use a train/test approach for both tasks. Use the same random_state to ensure repeatability and consistency.
Use a combination of different and appropriate metrics for each task, and try to maximize the result by trying different feature combinations. Explain how the results for various metrics informed your final feature configuration and which metric was optimal for your problem solution.
For the Naive Bayes challenge, research the GaussianNB, MultinomialNB, and BernoulliNB variants, and how to specify the variant to your model. Summarize in your comments when to use each. Choose which one to use for your feature set, and explain your choice.
Write a brief summary and answer the following two questions:
did the same 5 features work well for predicting cost versus behaviour? Why or why not?
Which of the two models was easier to optimize, and why do you think this was the case?
Download your Jupyter Notebook,with your clean up, analysis and approach, model building, and results, and send it to me as an attachment to an email with the subject line: CS 4403 - Lab Assignment 3.
Submit this no later than 23:59 on Tuesday February 3, 2026.
There will be a prize for the best submission, taking into account:
The quality with which each of the 6 steps was completed.
The level to which your code was appropriately modularized.
The results obtained.