Data Preprocessing
Data preprocessing and Transformations available in PyCaret
Datasets for various reasons may have missing values or empty records, often encoded as blanks or NaN
. Most of the machine learning algorithms are not capable of dealing with the missing values.
Each feature in the dataset has an associated data type such as numeric, categorical, or Datetime. PyCaret automatically detects the data type of each feature.
Categorical features in the dataset contain the label values (ordinal or nominal) rather than continuous numbers. Most of the machine learning algorithms are not capable of handling categorical data without encoding.
When the categorical features in the dataset contain variables with intrinsic natural order such as Low, Medium, and High, these must be encoded differently than nominal variables (where there is no intrinsic order for e.g. Male or Female).
When categorical features in the dataset contain variables with many levels (also known as high cardinality features), then typical One-Hot Encoding leads to the creation of a very large number of new features.
When the training dataset has an unequal distribution of target class it can be fixed using the fix_imbalance
parameter in the setup.
The remove_outliers
function in PyCaret allows you to identify and remove outliers from the dataset before training the model.
Last updated