CS 4403 - Test Checklists

Below are non-exhaustive lists of key questions you should be able to answer on the referenced test.

You should be able to apply the key functional concepts inherent in these questions.

It is thus important to study these with the aim of understanding the key concepts, as opposed to simply being able to regurgitate a memorized response.

If you have a good understanding of the core concepts inherent in the questions on this page, you should do fine.

Test 1

What Bonferroni’s principle is

Why intuition about data and patterns can be misleading

The relationship between a sample and a population

What induction and deduction are

What Simpson's Paradox is

What false positives and false negatives are, and their consequences

The difference between association/correlation and causation

What the most dangerous equation is and be able to give an example

What an independent identically distributed (iid) set of random variables is

What a Normal or Gaussian distribution is

What a Bernoulli, binomial or Poisson distribution is

What tendency and variation are

The difference between mean, median and mode

The various ways to assess dataset variance (SD, Z-score, quartiles, IQR)

What statistical power is

What covariance is and how it relates to correlation

What data dredging is

What causation is, and how it relates to correlation

What the PPDAC cycle is

What Bayes’ Theorem is

What independent and dependent variables are

What confounding and lurking factors are

What a counterfactual is

Why it is important to study similarity

Have familiarity with Hamming distance, Euclidian distance, and some of the other similarity metrics

Know, specifically, what Jaccard similarity is and be able to calculate it

How to k-shingle a text document

How to choose the optimal k size

What minhashing is and how it works

Why minhashing matters

Explain LSH at a high level and why it is super handy

Explain the various steps in the data mining process

Explain what CRISP-DM and KDD are

Explain the four key types of data mining

How to one-hot encode categorical variables

Explain the data cleaning steps, and how to execute them

Options for encoding categorical data

Some of the pitfalls of encoding poorly

Explain multicolinearity

Explain frequency encoding

Explain the difference between ratio and interval scaled attributes

The difference between discrete and continuous attributes

Explain normalization and the different approaches

What feature engineering is

What the various metrics are for various machine learning approaches

What a pandas data frame is

Be able to explain some of the key NumPy and Pandas functions for data manipulation, cleaning, etc.

Approaches to dealing with outliers

Explain how the various machine learning metrics work and what they mean

Explain linear, polynomial and multi-variate regression and when to use each of them

Give a simple explanation of a Python lambda function

Explain the core steps in developing a model

What a loss function is

How to measure performance of a model

The difference between precision and recall

What r-squared is

Have a general sense of which algorithms work for which problems (see the SciKitLearn diagram)

Explain how each of the following algorithms/ strategies/ functions work at an operational level, and have a good sense of which algos work well for which types of modelling problems :

Naive Bayes
K-Nearest Neighbour
Agglomerative Clustering
Gradient Descent
K-means clustering
Support Vector Machine
Decision tree
The Gini coefficient
Cross validation
Grid search

Test 2

What an ensemble model is

How the several approaches we covered (voting, bagging, boosting, stacking) work and their use cases

What a "weak learner" is

What MAML is

What the MOE approach is

Be able to identify and give examples of various boosting algorithms and their use cases

Identify the computational dimensions of some boosting approaches

What the over-underfitting | bias-variance trade-off is and how to pursue optimization

Explain how Logistic Regression works and why it is an important algo

Explain the sigmoid function and its use in machine learning

Identify and explain the various options for assessing model performance

What regularization is and the difference between L1 and L2 approaches

How DBScan works and for which use cases it is a good fit

How Recommender Systems work, including the various strategies (collaborative filtering, content based filtering, social and demographic filtering, contextual filtering) and their specific strengths and weaknesses

What dimensionality reduction is and why we do it sometimes

How we can do dimensionality reduction by way of feature selection

The difference between the PCA, LDA and UMAP approaches

What t-SNE is and what to use it for

What Natural Language Processing is

Know who Noam Chomsky is, and his role in language theory

Be able to identify various use cases for NLP

Identify the key Python NLP libraries and their uses

Be able to define the key terms in NLP

Identify key challenges and limitations in NLP

Have familiarity with common text encoding standards

Know the key differences between the NLTK and SpaCy libraries

What tokenization is and how to do it

Define Part of Speech tagging

Define chunking

What a dependency parse is

What Named Entity Recognition is

What Lemmatization is

What Stopwords removal is

What text normalization is

Explain the various types of vectorizers (Bag of Words, TFIDF, etc.)

Explain how embedding vectorizers generally work

Explain the CBOW and Skip-gram approaches to Word2Vec

How the OECD defines AI

The two general streams of AI

The difference between shallow and deep learning

What a Neuron is and what its key components are

How a Neural Network functions, and which features give it its unique capacity to learn complex problems

How generative AI works at a high level

How diffusion models work, at a high level

What a large language model is, and how it works

What the "attention" mechanism is

What RLHF and "Chain of Thought" are, and how they augment LLM functionality

Be able to identify challenges and pitfalls of deep learning models

What the "human evaluation bias" is

What an activation function is, and be able to describe different approaches and their use cases

Explain forward and back propagation at a high level

Explain the importance of understanding the role of potential bias in model building

Be able to identify the key types of bias which may creep into a model

Explain what "explainability" is, and be able to generally describe the various technical approaches to addressing this issue

Page updated

Google Sites

Report abuse