Below are non-exhaustive lists of key questions you should be able to answer on the referenced test.
You should be able to apply the key functional concepts inherent in these questions.
It is thus important to study these with the aim of understanding the key concepts, as opposed to simply being able to regurgitate a memorized response.
If you have a good understanding of the core concepts inherent in the questions on this page, you should do fine.
What Bonferroni’s principle is
Why intuition about data and patterns can be misleading
The relationship between a sample and a population
What induction and deduction are
What Simpson's Paradox is
What false positives and false negatives are, and their consequences
The difference between association/correlation and causation
What the most dangerous equation is and be able to give an example
What an independent identically distributed (iid) set of random variables is
What a Normal or Gaussian distribution is
What a Bernoulli, binomial or Poisson distribution is
What tendency and variation are
The difference between mean, median and mode
The various ways to assess dataset variance (SD, Z-score, quartiles, IQR)
What statistical power is
What covariance is and how it relates to correlation
What data dredging is
What causation is, and how it relates to correlation
What the PPDAC cycle is
What Bayes’ Theorem is
What independent and dependent variables are
What confounding and lurking factors are
What a counterfactual is
Why it is important to study similarity
Have familiarity with Hamming distance, Euclidian distance, and some of the other similarity metrics
Know, specifically, what Jaccard similarity is and be able to calculate it
How to k-shingle a text document
How to choose the optimal k size
What minhashing is and how it works
Why minhashing matters
Explain LSH at a high level and why it is super handy
Explain the various steps in the data mining process
Explain what CRISP-DM and KDD are
Explain the four key types of data mining
How to one-hot encode categorical variables
Explain the data cleaning steps, and how to execute them
Options for encoding categorical data
Some of the pitfalls of encoding poorly
Explain multicolinearity
Explain frequency encoding
Explain the difference between ratio and interval scaled attributes
The difference between discrete and continuous attributes
Explain normalization and the different approaches
What feature engineering is
What the various metrics are for various machine learning approaches
What a pandas data frame is
Be able to explain some of the key NumPy and Pandas functions for data manipulation, cleaning, etc.
Approaches to dealing with outliers
Explain how the various machine learning metrics work and what they mean
Explain linear, polynomial and multi-variate regression and when to use each of them
Give a simple explanation of a Python lambda function
Explain the core steps in developing a model
What a loss function is
How to measure performance of a model
The difference between precision and recall
What r-squared is
Have a general sense of which algorithms work for which problems (see the SciKitLearn diagram)
Explain how each of the following algorithms/ strategies/ functions work at an operational level, and have a good sense of which algos work well for which types of modelling problems :
Naive Bayes
K-Nearest Neighbour
Agglomerative Clustering
Gradient Descent
K-means clustering
Support Vector Machine
Decision tree
The Gini coefficient
Cross validation
Grid search
What an ensemble model is
How the several approaches we covered (voting, bagging, boosting, stacking) work and their use cases
What a "weak learner" is
What MAML is
What the MOE approach is
Be able to identify and give examples of various boosting algorithms and their use cases
Identify the computational dimensions of some boosting approaches
What the over-underfitting | bias-variance trade-off is and how to pursue optimization
Explain how Logistic Regression works and why it is an important algo
Explain the sigmoid function and its use in machine learning
Identify and explain the various options for assessing model performance
What regularization is and the difference between L1 and L2 approaches
How DBScan works and for which use cases it is a good fit
How Recommender Systems work, including the various strategies (collaborative filtering, content based filtering, social and demographic filtering, contextual filtering) and their specific strengths and weaknesses
What dimensionality reduction is and why we do it sometimes
How we can do dimensionality reduction by way of feature selection
The difference between the PCA, LDA and UMAP approaches
What t-SNE is and what to use it for
What Natural Language Processing is
Know who Noam Chomsky is, and his role in language theory
Be able to identify various use cases for NLP
Identify the key Python NLP libraries and their uses
Be able to define the key terms in NLP
Identify key challenges and limitations in NLP
Have familiarity with common text encoding standards
Know the key differences between the NLTK and SpaCy libraries
What tokenization is and how to do it
Define Part of Speech tagging
Define chunking
What a dependency parse is
What Named Entity Recognition is
What Lemmatization is
What Stopwords removal is
What text normalization is
Explain the various types of vectorizers (Bag of Words, TFIDF, etc.)
Explain how embedding vectorizers generally work
Explain the CBOW and Skip-gram approaches to Word2Vec
How the OECD defines AI
The two general streams of AI
The difference between shallow and deep learning
What a Neuron is and what its key components are
How a Neural Network functions, and which features give it its unique capacity to learn complex problems
How generative AI works at a high level
How diffusion models work, at a high level
What a large language model is, and how it works
What the "attention" mechanism is
What RLHF and "Chain of Thought" are, and how they augment LLM functionality
Be able to identify challenges and pitfalls of deep learning models
What the "human evaluation bias" is
What an activation function is, and be able to describe different approaches and their use cases
Explain forward and back propagation at a high level
Explain the importance of understanding the role of potential bias in model building
Be able to identify the key types of bias which may creep into a model
Explain what "explainability" is, and be able to generally describe the various technical approaches to addressing this issue