The test checklist for exam 1 is now available. Reminder that the exam will be held in class on Tuesday February 10.
Please also note that there will not be a lab this week, but I will be available during the lab time, in the lab, to assist anyone with exam related questions.
Lab Assignment 3 is now available. It is due next Tuesday.
Lab Assignment 1 has now been marked and you should all have received an email with your grade. I generally gave students full marks if their logic worked, irrespective of how efficient and modular the code was. Many students approached the coding of the assignment by cutting and pasting the approach for one part of the process (e.g., 2-grams) and changing a few values to get it to work for the next approach. This is very poor coding practice. You should aim to modularize your code as much as possible. Here is a notebook explaining how to do this, with some example code. Starting with Lab Assignment 3, I will consider the quality of your code (i.e., whether you have appropriately modularized it), in assessing your grade.
Please note that I have clarified Lab Assignment 1 with respect to the kind of shingling you need to do. Please carefully read the instructions before starting.
I have to go to a funeral on Thursday morning January 8. The morning class is thus cancelled.
But, given that there will be no lab on January 8, we will instead have our regular class meeting in the lab space (Irving 102) at 13:30.
Please read this page with details applicable to all my courses.
Data mining, or knowledge discovery, is an interdisciplinary area of computer science with the goal of extracting new knowledge and insights from big and complex data sets.
This course begins with an overview of the data mining landscape, an introduction to similarity, and then introduces the knowledge discovery workflow, including data processing, cleaning, integration, and transformation, towards pattern recognition methodologies using various statistical and machine learning techniques. Additional theoretical dimensions will be introduced as we work our way through a number of different kinds of data mining problems.
(These are at a high level, and put together with Bloom's Taxonomy in mind):
Understand the core statistical and logical principles which guide knowledge discovery
Describe and apply various approaches to identifying patterns and similarity
Describe and apply the data mining process and demonstrate proficiency in preprocessing and visualizing data
Explain and apply the various data mining algorithms used for classification, clustering, and regression
Use Python to implement data mining algorithms in real world datasets
Evaluate the performance of data mining models
Explore applications of data mining in various domains
Communicate data mining findings and insights effectively
CS or INFO 1103, CS 2704, and (STAT 2593 or STAT 2793).
Lectures will be in-person as a default, but we may mix it up with an online lecture on occasion, or when warranted.
Class meetings will be on Tuesday and Thursday from 11:00 to 12:20, in Ganong Hall 313.
A number of labs will be scheduled to take place on certain Thursdays from 13:30 to 14:20 in KC Irving Hall 102.
Please consult the CS 4403 Course Schedule page for details of the various meeting and deliverable deadline dates for the course. This page will be updated as we go along.
There is no textbook to buy for this course.
For the material to be covered during our meetings, I will primarily refer to various portions from the following free online resources:
"MMD": Mining Massive Datasets (Jure Leskovec, Anand Rajaraman, Jeff Ullman), Stanford U.
"FDS": Foundations of Data Science (Avrim Blum, John Hopcroft, and Ravindran Kannan), Cornell U.
"DMML": Data Mining and Machine Learning: Fundamental Concepts and Algorithms (Mohammed J. Zaki, Wagner Meira, Jr.), Cambridge U.
"ISL": Introduction to Statistical Learning by Gareth James et al. Freely available online.
For some select topics I may also provide extracts from other books or good online resources. Links to these will be placed on the Course Schedule page.
Other good resources are:
The Art of Statistics by David Spiegelhalter. A highly readable and contextualized examination of how to do statistics right (and wrong). Has several chapters dedicated to the key statistical aspects of robust machine learning.
The Hundred Page Machine Learning Book by Andriy Burkov. A great algorithm focused reference resource for machine learning. I have a printed copy which I am happy to lend out.
You may also find Chapter 1 of the International Handbook of AI Law useful as a readable layman's overview of what we generally cover in this course.
Please use Python 3.x for the assignments/labs and project.
For most of the in class demos and labs, I will use Jupyter Notebooks, a great free platform for editing and running Python. I recommend it for your use also.
There are a number of ways to implement/access Jupyter Notebooks. I recommend you use one of the following:
Google Colab (cloud based). Easy to use OOB. You will need a Google account.
PyCharm (available for Win, MacOS and Linux). Has a bit steeper learning curve, but with rewards (the link is to the JetBrains free student licence which is well worth it). If you just want the IDE, click here.
Anaconda Navigator (available for Win, MacOS and Linux). It includes everything you will need to run Jupyter Notebooks on your own machine.
For notebook storage, you can use GitHub or if you are using Colab, Google Drive.
Engagement and Participation: 10 points
Lab based assignments (5): 15 points
In class tests (2): 45 points
Term project: 30 points
Effective learning in machine learning depends on active engagement with the material, your peers, and the learning process itself. The Engagement and Participation component of the course grade is designed to reward consistent, meaningful involvement that enhances both your own understanding and the learning environment for others.
It is a metric for your demonstrated curiosity, critical thinking, collaboration, and professional growth.
Doing the following will help ensure you get a good score for this:
Ask thoughtful questions that show engagement with the readings, lectures, or examples.
Offer ideas, insights, or alternative perspectives during class discussions.
Help clarify concepts to other students.
Come to class prepared and demonstrate familiarity with prior material.
Complete all deliverables on time.
Note that here is no final exam.
The Term project report will be due the first day of the week before the last day of classes, and you will be required to prepare and give a 5 minute presentation about your project at a time to be scheduled near the end of the term.
As noted above, each lab will have an accompanying assignment which will be due usually a day or two after the lab. Details of each lab will be provided a few days before the lab's date.
Please consult the CS 4403 Course Schedule page for the due dates of assignments. Assignments must be submitted by email to the instructor, no later than 11:59 p.m. on the day they are due. Please ensure you use a descriptive subject line for the email (for example: CS 4403 Assignment 1).
All assignments must be done individually.
There will be two in class tests. Details of how the tests are put together will be released as we get a little further along in the course. I will provide a checklist of things to know for each exam.
Details and requirements are set out on the Term Project page.