Below are non-exhaustive lists of key questions you should be able to answer on the referenced test. This is not a comprehensive list of knowledge bits that you should know for the exam. Please refer to the slides for finer grained detail of many of the below concepts.
You should be able to apply the key functional concepts inherent in these questions.
It is thus important to study these with the aim of understanding the key concepts, as opposed to simply being able to regurgitate a memorized response.
If you have a good understanding of the core concepts inherent in the questions on this page, you should do fine on any exam.
What Bonferroni’s principle is
Why intuition about data and patterns can be misleading
The relationship between a sample and a population
What induction and deduction are
What Simpson's Paradox is
What false positives and false negatives are, and their consequences
The difference between association/correlation and causation
What the most dangerous equation is and be able to give an example
What an independent identically distributed (iid) set of random variables is
What a Normal or Gaussian distribution is
What a Bernoulli, binomial or Poisson distribution is
What tendency and variation are
The difference between mean, median and mode
The various ways to assess dataset variance (SD, Z-score, quartiles, IQR)
What statistical power is
What covariance is and how it relates to correlation
What data dredging is
What causation is, and how it relates to correlation
What the PPDAC cycle is
What Bayes’ Theorem is
What independent and dependent variables are
What confounding and lurking factors are
What a counterfactual is
Why it is important to study similarity
Identify uses of similarity finding functions
Have familiarity with Hamming distance, Euclidian distance, and some of the other similarity metrics
What Jaccard similarity is and be able to calculate it
How to k-shingle a text document
How to choose the optimal k size
What minhashing is and how it works
Why minhashing matters
Explain LSH at a high level and why it is super handy
Explain the four key types of data mining
Explain the various steps in the data mining process
Explain what CRISP-DM and KDD are and in which domains they are used
How to one-hot encode categorical variables
Explain the data cleaning steps, and how to execute them
Options for encoding categorical data
Some of the pitfalls of encoding poorly
Explain multicolinearity
Explain frequency encoding
Explain target encoding
Explain the difference between ratio and interval scaled attributes
The difference between discrete and continuous attributes
Explain normalization and the different approaches and when to use what
Explain what skewed data is and how to deal with it
Explain what time series data is and how to best encode it
Explain what geospatial data is
What feature engineering is
What the various metrics are for various machine learning approaches
What a pandas data frame is
Be able to explain some of the key NumPy and Pandas functions for data manipulation, cleaning, etc.
Approaches to dealing with outliers
Explain how the various machine learning metrics work and what they mean
Explain linear, polynomial and multi-variate regression and when to use each of them
Give a simple explanation of a Python lambda function
Explain the core steps in developing a model
What a loss function is and how to measure it for the purpose of optimization
Identify the common Linear Regression loss functions, and what they try to do
How to measure predictive performance of a model
The difference between precision and recall
What r-squared is
What gradient descent is
What a silhouette score is
What the "curse of dimensionality" is
What the bias-variance tradeoff is
How to join two dataframes
Have a general sense of which algorithms work for which problems (see the SciKitLearn diagram)
Explain how each of the following algorithms/ strategies/ functions work at an operational level, know their key hyperparameters, and have a good sense of which algos work well for which types of modelling problems:
Naive Bayes
K-Nearest Neighbour
Agglomerative Clustering
K-means clustering
DBScan
Support Vector Machine
Decision tree
The Gini coefficient
Random Forest
Ensemble methods
How Logistic Regression is different from Linear Regression
What the over-underfitting | bias-variance trade-off is and how to pursue optimization
Explain how Logistic Regression works and why it is an important algo
Explain the sigmoid function and its use in machine learning
What ROC and AUC are and how they aid in assessing a classifier
How to interpret ROC and AUC
What regularization is and how it works
The two main types of regularization and their use cases
What an ensemble model is
What "boosting" is
What a "weak learner" is
What ensemble prediction is
Be able to name and identify use cases for several boosting algorithms (Ada, Gradient, XG, LightGBM, CatBoost)
Be able to identify and describe the effect of key boosting hyperparameters (learning_rate, iterations, tree_depth, subsampling, and complexity_penalty)
Explain how the Google Page Rank algo works
What the user-item interaction matrix is
How content based filtering works, and what its strengths and weaknesses are
How collaborative filtering works, the two main approaches, and what its strengths and weaknesses are
What matrix factorization is, how it works, and what it is used for
What social and demographic filtering is
What contextual filtering is
What hybrid recommender systems are and when they are useful
What the concept of exploration is in recommendation systems
The key evaluation metrics in recommender systems
What dimensionality reduction is and why we do it sometimes
What the curse of dimensionality is
What a manifold is, in the context of dimensionality reduction, and be able to give an example
Identify the different types of dimensionality reduction, how each works, and what the use cases, strengths and limitations of each are
What t-SNE is and what to use it for
What binning is, and describe the three approaches we discussed
What Natural Language Processing is
Know who Noam Chomsky is, and his role in language theory
Be able to identify various use cases for NLP
Identify the key Python NLP libraries and their uses
Be able to define the key terms in NLP
Identify key challenges and limitations in NLP
Have familiarity with common text encoding standards
Know the key differences between the NLTK and SpaCy libraries
What tokenization is and how to do it
Define Part of Speech tagging
Define chunking
What a dependency parse is
What Named Entity Recognition is
What Lemmatization is
What Stopwords removal is
What text normalization is
Explain the various types of vectorizers (Bag of Words, TFIDF, etc.)
Explain how embedding vectorizers generally work
Explain the CBOW and Skip-gram approaches to Word2Vec
The three dimensions of sentiment analysis
What VADER is and how it works
The difference between shallow and deep learning
What a Neuron is and what its key components are
How a Neural Network functions, and which features give it its unique capacity to learn complex problems
The different types of activation functions and their general use cases
What a loss function is and its role in neural networks
What a vanishing gradient is
How back propagation works, at a high level
Explain batching and dropout in the context of neural networks
Be able to identify the key hyperparameters of a neural network
What a feedforward, convolutional, and recurrent neural network are, at a high level
What a transformer is, and why it was a large step forward
The difference between an encoder and a decoder
Be able to identify implementations of each
What a large language model is, and how it works
What a co-occurrence matrix is
What an embedding is, and how it is derived
Why tokenization is important in LLMs
What the "attention" mechanism is
what positional encoding is, and why it is done
What emergent abilities are, and how they appear
What Fine tuning and RLHF is
How time series forecasting is different from other modeling
What data leakage is
Identify and describe the four components of a Time Series
What seasonality is
What the residual is
What decomposition is
The difference between additive and multiplicative decomposition and when to use each
What stationarity is and why it is required in time series forecasting