This fourth Lab Assignment will give you practice in using many different models to find patterns in a dataset.
In this lab, you will work with the Credit Card Fraud Detection Dataset from Kaggle, which contains transactions made by European credit card holders in September 2013. You can download the dataset here.
This dataset contains 284,807 transactions over two days. Please note that his dataset contains only 492 fraud (0.172% of all transactions).
There are 30 features, 28 of which are PCA-transformed + Time + Amount. Time denotes the seconds elapsed between the transaction and the first transaction. Amount is the transaction amount. The class variable denotes 1 for fraud, and 0 for legitimate.
**Why this dataset?**
- Real-world problem with high business impact
- Extreme class imbalance mirrors real fraud detection scenarios
- PCA features make distance-based algorithms particularly relevant
- Challenges students to think beyond simple accuracy metrics
- Large enough for meaningful analysis (~285K samples)
- Tests understanding of precision vs. recall trade-offs
**Key Challenge:** With only 0.172% fraudulent transactions, a model that predicts "legitimate" for every transaction achieves 99.828% accuracy but is completely useless! Students must use appropriate evaluation metrics.
---
## Lab Structure
This lab has **three main parts** plus a comparative analysis:
### Part 1: Unsupervised Learning (Anomaly Detection)
### Part 2: Supervised Learning (Fraud Classification)
### Part 3: Handling Imbalanced Data & Model Optimization
### Part 4: Comparative Analysis and Business Recommendations
---
## Part 1: Unsupervised Learning - Anomaly Detection (25 points)
Fraud detection can be approached as an **anomaly detection problem**. In this section, you will use clustering algorithms to identify unusual transaction patterns **without using fraud labels**.
### Task 1.1: Data Preprocessing and Exploration (5 points)
1. Load the credit card dataset
2. Perform exploratory data analysis:
- Check for missing values
- Analyze the class distribution (fraud vs. legitimate)
- Visualize the distribution of 'Amount' and 'Time' features
- Plot correlation matrix for V1-V28 features
3. Create a **balanced subset** for initial clustering exploration:
- Sample all 492 fraud cases
- Randomly sample 2,000 legitimate cases
- This gives you ~2,500 transactions to work with initially
4. Standardize 'Time' and 'Amount' features (V1-V28 are already scaled from PCA)
5. **Justify** your preprocessing choices
**Deliverable:** Code + visualizations + explanation of preprocessing
### Task 1.2: K-Means for Anomaly Detection (7 points)
1. Apply K-Means with k=2, 3, 5, 7, 10 on your balanced subset
2. For each k, analyze which cluster contains more fraud cases
3. Use the elbow method and silhouette score to find optimal k
4. **Anomaly Score Method:**
- Train K-Means with optimal k on legitimate-only transactions
- Calculate distance of each transaction to nearest cluster center
- Transactions with large distances are potential anomalies
- Evaluate: What percentage of actual frauds are in the top 1% of distances?
5. Visualize clusters using PCA (reduce to 2D)
**Deliverable:** Elbow plot, cluster analysis, anomaly detection results, visualization
### Task 1.3: DBSCAN for Outlier Detection (7 points)
1. Apply DBSCAN with at least 4 different (eps, min_samples) combinations
2. Identify points labeled as noise (outliers) by DBSCAN
3. Calculate what percentage of noise points are actual frauds
4. Compare density-based outlier detection with K-Means distance-based approach
5. Visualize: Plot legitimate vs fraud transactions, highlighting DBSCAN outliers
**Key Questions:**
- Does DBSCAN naturally separate frauds as outliers?
- How does DBSCAN performance change with different parameters?
- Which approach (K-Means distance vs DBSCAN outliers) better identifies fraud?
**Deliverable:** Parameter exploration, outlier analysis, comparison, visualizations
### Task 1.4: Agglomerative Clustering Analysis (6 points)
1. Apply agglomerative clustering with Ward linkage
2. Create a dendrogram (may need to subsample for visualization)
3. Cut dendrogram at different heights to create 2, 5, and 10 clusters
4. Analyze cluster purity: What fraction of each cluster is fraudulent?
5. Compare results with K-Means using adjusted rand index
**Deliverable:** Dendrogram, cluster purity analysis, comparison with K-Means
**Part 1 Reflection Questions:**
- Can unsupervised methods effectively detect fraud without labels?
- What are the limitations of treating fraud detection purely as anomaly detection?
- Which unsupervised approach showed most promise?
---
## Part 2: Supervised Learning - Fraud Classification (40 points)
Now you will build supervised classifiers using fraud labels. **Critical:** You must use appropriate metrics for imbalanced data!
### Task 2.1: Data Preparation and Baseline (8 points)
1. Split data into training (70%) and test (30%) sets using **stratified sampling**
2. Verify class distribution is preserved in both sets
3. Standardize 'Time' and 'Amount' features using training set statistics
4. Create a **baseline "always predict legitimate" classifier**:
- Calculate accuracy, precision, recall, F1-score
- This shows why accuracy is meaningless for this problem!
5. Define your **primary evaluation metric** and justify it
- Options: Precision-Recall AUC, F1-score, F2-score, etc.
- Consider business context: False positives vs. false negatives
**Deliverable:** Train/test split code, baseline results, metric selection justification
### Task 2.2: K-Nearest Neighbors (6 points)
1. Train KNN with k=1, 3, 5, 7, 11, 15
2. For each k, report:
- Precision, Recall, F1-score for fraud class
- Confusion matrix
- ROC curve and AUC
3. Plot precision-recall curve for optimal k
4. Analyze: How does k affect fraud detection performance?
5. **Warning:** KNN may be slow on full dataset - consider training on subset if needed
**Deliverable:** Performance metrics table, confusion matrices, PR/ROC curves, analysis
### Task 2.3: Decision Tree (6 points)
1. Train decision tree with max_depth = 3, 5, 10, 15, None
2. Visualize a pruned tree (max_depth=5) and interpret key splits
3. Extract feature importances - which features are most predictive?
4. For each depth:
- Report precision, recall, F1 for fraud detection
- Plot precision-recall curve
5. Analyze overfitting: Compare train vs. test performance
**Deliverable:** Tree visualization, feature importance plot, depth comparison, analysis
### Task 2.4: Random Forest (7 points)
1. Train Random Forest with n_estimators = 50, 100, 200, 500
2. Extract and visualize feature importances
3. Compare feature importances with single decision tree
4. Analyze out-of-bag (OOB) score if available
5. Generate precision-recall curve
6. **Class weight experiment:** Try `class_weight='balanced'` parameter
- Compare performance with and without balanced weights
- How does this help with imbalance?
**Deliverable:** Performance vs. trees plot, feature importance, class weight comparison
### Task 2.5: Support Vector Machine (7 points)
1. **Note:** SVM is computationally expensive on 285K samples
- Either use a stratified sample (10K legitimate + all frauds)
- Or use SGDClassifier as a scalable alternative
2. Train SVM with different kernels on your sample:
- Linear kernel
- RBF kernel (try C=0.1, 1, 10)
3. Experiment with `class_weight='balanced'` parameter
4. Report precision, recall, F1 for fraud class
5. Compare training times across configurations
6. Generate precision-recall curves
**Deliverable:** Performance comparison, timing analysis, PR curves
### Task 2.6: Model Comparison Table (6 points)
Create a comprehensive comparison including:
- Algorithm name
- Best hyperparameters
- **Fraud Detection Metrics (what matters!):**
- Precision (fraud class)
- Recall (fraud class)
- F1-score (fraud class)
- Precision-Recall AUC
- ROC AUC
- Overall accuracy (for reference, but note it's misleading)
- Training time
- Number of false positives and false negatives
Rank models by your chosen primary metric.
**Deliverable:** Comparison table with analysis and ranking
---
## Part 3: Handling Imbalanced Data & Optimization (25 points)
This is where you address the class imbalance challenge directly!
### Task 3.1: Resampling Techniques (10 points)
Implement and compare these approaches on your top 2 models from Part 2:
1. **Random Undersampling:**
- Randomly remove legitimate transactions to balance classes
- Train model on balanced dataset
- Evaluate on original imbalanced test set
2. **Random Oversampling:**
- Randomly duplicate fraud cases to balance classes
- Train model on balanced dataset
- Evaluate on original imbalanced test set
3. **SMOTE (Synthetic Minority Oversampling):**
- Use SMOTE to generate synthetic fraud examples
- Train model on SMOTE-balanced dataset
- Evaluate on original imbalanced test set
4. **Class Weights (no resampling):**
- Use `class_weight='balanced'` or custom weights
- Train on original imbalanced data
- Evaluate on test set
For each approach:
- Report fraud detection precision, recall, F1
- Analyze trade-offs
- Create precision-recall curves
**Questions to answer:**
- Which resampling technique works best?
- What are the downsides of each approach?
- How does the confusion matrix change?
**Deliverable:** Implementation of all 4 approaches, comparison table, analysis
### Task 3.2: Hyperparameter Tuning with GridSearchCV (8 points)
Select your **best performing model + resampling approach** from Task 3.1.
1. Define a comprehensive hyperparameter grid (3+ parameters)
2. Use GridSearchCV with:
- 5-fold stratified cross-validation
- Custom scoring: use 'f1', 'precision', or 'recall' based on business needs
3. Report best hyperparameters found
4. Compare performance before and after tuning
5. Analyze: How much did tuning improve fraud detection?
**Example grids:**
- **Random Forest:** n_estimators, max_depth, min_samples_split, max_features, class_weight
- **Decision Tree:** max_depth, min_samples_split, min_samples_leaf, class_weight, criterion
- **SVM/SGD:** C, loss, penalty, class_weight
**Deliverable:** Grid search code, best parameters, performance comparison
### Task 3.3: Threshold Optimization (7 points)
Most classifiers output probabilities. The default threshold is 0.5, but this may not be optimal for imbalanced data.
1. For your best model, predict fraud probabilities on test set
2. Try classification thresholds from 0.1 to 0.9 (step 0.05)
3. For each threshold, calculate:
- Precision
- Recall
- F1-score
- Number of false alarms (false positives)
4. Plot precision, recall, and F1 vs. threshold
5. **Business scenario:** If investigating a false positive costs $5 but missing fraud costs $500:
- What threshold would you choose?
- Calculate expected cost for different thresholds
6. Find threshold that maximizes F1-score vs. threshold that minimizes cost
**Deliverable:** Threshold analysis plots, optimal threshold recommendation, cost analysis
---
## Part 4: Final Analysis and Business Recommendations (10 points)
### Task 4.1: Comprehensive Analysis (7 points)
Write a structured analysis addressing:
1. **Unsupervised vs. Supervised:**
- Can unsupervised methods detect fraud without labels?
- What role could clustering play in a real fraud detection system?
- When might unsupervised approaches be valuable?
2. **Algorithm Performance:**
- Which algorithm family (distance-based, tree-based, SVM) worked best?
- Why do certain algorithms struggle with this dataset?
- How did class imbalance affect different algorithms?
3. **Handling Imbalance:**
- Which technique (undersampling, oversampling, SMOTE, class weights) was most effective?
- What are the practical implications of each approach?
- Would you recommend combining techniques?
4. **Business Deployment:**
- Which model would you deploy in production and why?
- What threshold would you set and why?
- How would you handle the precision-recall trade-off?
- What is the cost of false positives vs. false negatives?
- How would you monitor model performance over time?
5. **Limitations and Future Work:**
- What are the limitations of your best model?
- What additional features might improve performance?
- How would you handle concept drift (fraud patterns change over time)?
### Task 4.2: Key Visualizations (3 points)
Create at least three insightful visualizations:
1. Comparison of all models (precision-recall curves on same plot)
2. Impact of resampling techniques (comparative bar chart)
3. Threshold optimization analysis (multi-line plot)
**Deliverable:** Written analysis (3-4 pages) + visualizations
---
## Submission Requirements
### 1. Jupyter Notebook (70% of grade)
- Well-commented, organized code for all tasks
- Clear section headers matching lab structure
- All required outputs (plots, tables, metrics)
- Markdown cells with explanations
- Reproducible results (set `random_state=42`)
### 2. Written Report (30% of grade)
- Executive summary (1 paragraph)
- Methodology overview
- Key findings and results
- Part 4 analysis (detailed)
- Business recommendations
- **Format:** PDF, 5-7 pages, 12pt font
### 3. Code Quality
- Use scikit-learn and imblearn libraries
- Follow PEP 8 style guidelines
- Proper variable naming
- Efficient implementations
---
## Grading Rubric
| Component | Points |
|-----------|--------|
| **Part 1: Unsupervised/Anomaly Detection** | 25 |
| - Data preprocessing and exploration | 5 |
| - K-Means anomaly detection | 7 |
| - DBSCAN outlier detection | 7 |
| - Agglomerative clustering | 6 |
| **Part 2: Supervised Classification** | 40 |
| - Data preparation and baseline | 8 |
| - KNN | 6 |
| - Decision Tree | 6 |
| - Random Forest | 7 |
| - SVM | 7 |
| - Model comparison | 6 |
| **Part 3: Imbalance & Optimization** | 25 |
| - Resampling techniques | 10 |
| - Hyperparameter tuning | 8 |
| - Threshold optimization | 7 |
| **Part 4: Analysis & Recommendations** | 10 |
| - Written analysis | 7 |
| - Visualizations | 3 |
| **TOTAL** | **100** |
**Bonus (up to 5 points):**
- Implement cost-sensitive learning
- Try ensemble methods (Voting, Stacking)
- Explore feature engineering (transaction patterns over time)
- Implement anomaly detection algorithms (Isolation Forest, One-Class SVM)
- Deep dive into feature importance and interpretability
---
## Getting Started
### Required Libraries
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier # Scalable alternative to SVM
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix, classification_report, roc_auc_score, roc_curve,
precision_recall_curve, auc, silhouette_score
)
from imblearn.over_sampling import SMOTE, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')
```
### Loading the Data
```python
# Load the dataset
df = pd.read_csv('creditcard.csv')
print(f"Dataset shape: {df.shape}")
print(f"Fraud cases: {df['Class'].sum()}")
print(f"Legitimate cases: {len(df) - df['Class'].sum()}")
print(f"Fraud percentage: {100 * df['Class'].sum() / len(df):.3f}%")
# Check for missing values
print(f"\nMissing values:\n{df.isnull().sum()}")
# Display basic statistics
print(f"\nBasic statistics:\n{df.describe()}")
```
### Creating Balanced Subset for Initial Exploration
```python
# For clustering experiments - balanced subset
fraud_df = df[df['Class'] == 1] # All 492 frauds
legitimate_df = df[df['Class'] == 0].sample(n=2000, random_state=42)
balanced_subset = pd.concat([fraud_df, legitimate_df]).sample(frac=1, random_state=42)
print(f"Balanced subset size: {len(balanced_subset)}")
print(f"Fraud ratio: {balanced_subset['Class'].mean():.3f}")
```
---
## Important Notes on Class Imbalance
**Why accuracy is misleading:**
- A model that always predicts "legitimate" achieves 99.83% accuracy
- But it has 0% recall on fraud (catches no fraud at all!)
- This would be a disaster in production
**Appropriate metrics:**
- **Precision:** Of predicted frauds, how many are actually fraud?
- High precision = few false alarms
- **Recall:** Of actual frauds, how many did we catch?
- High recall = catch most fraud
- **F1-Score:** Harmonic mean of precision and recall
- **Precision-Recall AUC:** Area under precision-recall curve
- **ROC AUC:** Area under ROC curve (less affected by imbalance than accuracy)
**Business considerations:**
- False Positive: Flag legitimate transaction as fraud → customer annoyance, investigation cost
- False Negative: Miss actual fraud → financial loss, customer trust damage
- Usually, missing fraud is much more costly than false alarms
- This suggests favoring higher recall (catch more fraud) even if precision drops slightly
---
## Helpful Tips
1. **Start with balanced subset** for initial clustering to save computation time
2. **Always use stratified splits** to preserve class distribution
3. **Never evaluate using only accuracy** - it's meaningless for imbalanced data
4. **Visualize confusion matrices** - they tell the real story
5. **Plot precision-recall curves** - better than ROC for imbalanced data
6. **Be patient with training times** - some algorithms are slow on 285K samples
7. **Consider sampling** for computationally expensive algorithms (SVM, KNN)
8. **Document your choices** - especially regarding evaluation metrics and thresholds
9. **Think like a business stakeholder** - what really matters in production?
---
## Resources
- **Scikit-learn:** https://scikit-learn.org/
- **Imbalanced-learn:** https://imbalanced-learn.org/
- **Dataset:** https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud
- **Paper:** Andrea Dal Pozzolo et al., "Calibrating Probability with Undersampling for Unbalanced Classification"
---
## Academic Integrity
- All code and analysis must be your own work
- Properly cite external resources
- Collaboration on concepts is allowed; code must be individual
- AI-generated code is not permitted
---
Good luck! This lab will give you real-world experience with one of the most important challenges in machine learning: handling severely imbalanced datasets.
Download your Jupyter Notebook,with your clean up, analysis and approach, model building, and results, and send it to me as an attachment to an email with the subject line: CS 4403 - Lab Assignment 4.
Submit this no later than 23:59 on Tuesday ~, 2026.
There will be a prize for the best submission, taking into account:
The quality with which each of the 6 steps was completed.
The level to which your code was appropriately modularized.
The results obtained.