Data Preprocessing

Missing Values

Datasets for various reasons may have missing values or empty records, often encoded as blanks or NaN. Most of the machine learning algorithms are not capable of dealing with the missing values.

Data Types

Each feature in the dataset has an associated data type such as numeric, categorical, or Datetime. PyCaret automatically detects the data type of each feature.

Categorical features in the dataset contain the label values (ordinal or nominal) rather than continuous numbers. Most of the machine learning algorithms are not capable of handling categorical data without encoding.

Ordinal Encoding

When the categorical features in the dataset contain variables with intrinsic natural order such as Low, Medium, and High, these must be encoded differently than nominal variables (where there is no intrinsic order for e.g. Male or Female).

Cardinal Encoding

When categorical features in the dataset contain variables with many levels (also known as high cardinality features), then typical One-Hot Encoding leads to the creation of a very large number of new features.

Target Imbalance

When the training dataset has an unequal distribution of target class it can be fixed using the fix_imbalance parameter in the setup.

Remove Outliers

The remove_outliers function in PyCaret allows you to identify and remove outliers from the dataset before training the model.

Normalize

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to rescale the values of numeric columns in the dataset without distorting the differences in the ranges of values.

Feature Transform

While normalization rescales the data within new limits to reduce the impact of magnitude in the variance, Feature transformation is a more radical technique. Transformation changes the shape of the distribution.

Target Transform

Target Transformation is similar to feature transformation as it will change the shape of the distribution of the target variable instead of the features.

Feature Interaction

It is often seen in machine learning experiments when two features combined through an arithmetic operation become more significant in explaining variances in the data than the same two features separately.

Polynomial Features

In machine learning experiments the relationship between the dependent and independent variable is often assumed as linear, however, this is not always the case. Sometimes the relationship between dependent and independent variables is more complex.

Group Features

When a dataset contains features that are related to each other in some way, for example, features recorded at some fixed time intervals, then new statistical features such as mean, median, variance, and standard deviation for a group of such features.

Bin Numeric Features

Feature binning is a method of turning continuous variables into categorical values using the pre-defined number of bins. It is effective when a continuous feature has too many unique values or few extreme values outside the expected range.

Combine Rare Levels

Sometimes a dataset can have a categorical feature (or multiple categorical features) that has a very high number of levels (i.e. high cardinality features). If such feature (or features) are encoded into numeric values, then the resultant matrix is a sparse matrix.

Create Clusters

Creating Clusters using the existing features from the data is an unsupervised ML technique to engineer and create new features.

Feature Selection

Feature Selection is a process used to select features in the dataset that contributes the most in predicting the target variable. Working with selected features instead of all the features reduces the risk of over-fitting, improves accuracy, and decreases the training time.

⚙️Data Preprocessing

Missing Values

Data Types

One-Hot Encoding

Ordinal Encoding

Cardinal Encoding

Target Imbalance

Remove Outliers

Normalize

Feature Transform

Target Transform

Feature Interaction

Polynomial Features

Group Features

Bin Numeric Features

Combine Rare Levels

Create Clusters

Feature Selection

Remove Multicollinearity

Principal Component Analysis

Ignore Low Variance

Required Parameters

Experiment Logging

Model Selection

Other Miscellaneous