Preprocessing Numerical Datasets

by Renato Sanchez

Wed Sep 11 2024

notes

data mining

Handling outliers, missing data, or inconsistent scales.

Review data quality: Identify missing values, eliminate columns or rows with missing values. Simple imputation: use mean, mode, or median to fill in missing values. Advanced imputation: use KNN or regression to fill in missing values.
Check for duplicate rows.
Identify outliers: Values that fall outside the typical data range.
- Standard deviation
- Interquartile range
Data scaling (Normalization): Theory suggests standardizing data between 0 and 1.
Logarithmic transformations: Reduce skewness in the distribution.
- Box-Cox
Encoding categorical variables:
- One-Hot Encoding: Convert categorical variables into binary columns -> unordered
- Label Encoding -> ordered
Feature selection.
Remove collinearity: It may introduce redundancy into the model. Use correlation matrices and VIF.

Dimensionality Reduction

PCA -> Reduces data size while maintaining the same amount of variance. Component reduction.

Aims to project data into a lower-dimensional space to observe relationships.

Split the dataset into multiple subsets depending on the K-fold value.

Standardizing data:

Oversampling: Increase the number of minority class data by duplicating examples.
Undersampling: Reduce the number of majority class data by eliminating examples.
SMOTE: Create synthetic examples of the minority class to balance the classes.