Preprocessing Numerical Datasets
Wed Sep 11 2024
Handling outliers, missing data, or inconsistent scales.
- Review data quality: Identify missing values, eliminate columns or rows with missing values. Simple imputation: use mean, mode, or median to fill in missing values. Advanced imputation: use KNN or regression to fill in missing values.
- Check for duplicate rows.
- Identify outliers: Values that fall outside the typical data range.
- Standard deviation
- Interquartile range
- Data scaling (Normalization): Theory suggests standardizing data between 0 and 1.
- Logarithmic transformations: Reduce skewness in the distribution.
- Box-Cox
- Encoding categorical variables:
- One-Hot Encoding: Convert categorical variables into binary columns -> unordered
- Label Encoding -> ordered
- Feature selection.
- Remove collinearity: It may introduce redundancy into the model. Use correlation matrices and VIF.
Dimensionality Reduction
PCA -> Reduces data size while maintaining the same amount of variance. Component reduction.
LDA (Linear Discriminant Analysis)
Aims to project data into a lower-dimensional space to observe relationships.
- MF and NMF
- T-SNE
- Autoencoder
Division into two sets: training and testing
- 70% training - 30% testing (common)
- 80% training - 20% testing (common)
- 90% training - 10% testing
Cross-Validation (K-fold)
Split the dataset into multiple subsets depending on the K-fold
value.
Class Balancing
Standardizing data:
- Oversampling: Increase the number of minority class data by duplicating examples.
- Undersampling: Reduce the number of majority class data by eliminating examples.
- SMOTE: Create synthetic examples of the minority class to balance the classes.
Objectives
- Reduce training time
- Increase accuracy
- Make data compatible with algorithms
- Interpretability