CS 4403 - Lab Assignment 2

This second Lab Assignment will give you some practice exploring and cleaning a dataset.

The Task

Part A

In New York City, parents were asked, a few years ago, to report their kids’ scores on the gifted and talented exam, as well as school priority ranking.

The questionnaire was not very well designed. As a result there are inconsistencies in reporting, skipped questions, no standard formats, etc, etc.

The relatively small dataset is here. Load this file into a pandas dataframe in a Jupyter Notebook. Explore the dataset and identify any potential data entry errors, outliers, etc.

Your task is then to programmatically better organize and clean up this dataset, and have all the data properly represented in some sort of numerical format.

As you go, please explain what you are doing and why. If your actions raise certain issues or risks, mention these as well.

Part B

Now that you have organized the dataset, explore it a bit by comparing correlations between the various features. Learn a bit about the very useful pandas corr() function here. You can visualize correlations with the seaborn package. Example here.

Try to identify some potential patterns you might be able to model with this dataset (or one with more data). Identify at least two potential patterns, and explain your rationale.

I will give 1 bonus point for an effort to implement a basic model irrespective of its performance (being fully aware that I have only shown you a couple so far and, given the nature of the dataset and the type of info in it, there likely is not a whole lot to go on here). Explore it, if you like, for the sake of learning the process of model building.

Page updated

Google Sites

Report abuse