Docs
  • PyCaret 3.0
  • GET STARTED
    • 💻Installation
    • 🚀Quickstart
    • ⭐Tutorials
    • 📶Modules
    • ⚙️Data Preprocessing
      • Data Preparation
      • Scale and Transform
      • Feature Engineering
      • Feature Selection
      • Other setup parameters
    • 💡Functions
      • Initialize
      • Train
      • Optimize
      • Analyze
      • Deploy
      • Others
  • LEARN PYCARET
    • 📖Blog
      • Announcing PyCaret 1.0
      • Announcing PyCaret 2.0
      • 5 things you dont know about PyCaret
      • Build and deploy your first machine learning web app
      • Build your own AutoML in Power BI using PyCaret
      • Deploy ML Pipeline on Google Kubernetes
      • Deploy PyCaret and Streamlit on AWS Fargate
      • Anomaly Detector in Power BI using PyCaret
      • Deploy ML App on Google Kubernetes
      • Deploy Machine Learning Pipeline on GKE
      • Deploy Machine Learning Pipeline on AWS Fargate
      • Deploy ML Pipeline on the cloud with Docker
      • Clustering Analysis in Power BI using PyCaret
      • Deploy PyCaret Models on edge with ONNX Runtime
      • GitHub is the best AutoML you will ever need
      • Deploy PyCaret and Streamlit on AWS Fargate
      • Easy MLOps with PyCaret and MLflow
      • Clustering Analysis in Power BI using PyCaret
      • Machine Learning in Alteryx with PyCaret
      • Machine Learning in KNIME with PyCaret
      • Machine Learning in SQL using PyCaret Part I
      • Machine Learning in Power BI using PyCaret
      • Machine Learning in Tableau with PyCaret
      • Multiple Time Series Forecasting with PyCaret
      • Predict Customer Churn using PyCaret
      • Predict Lead Score (the Right Way) Using PyCaret
      • NLP Text Classification in Python using PyCaret
      • Predict Lead Score (the Right Way) Using PyCaret
      • Predicting Crashes in Gold Prices Using PyCaret
      • Predicting Gold Prices Using Machine Learning
      • PyCaret 2.1 Feature Summary
      • Ship ML Models to SQL Server using PyCaret
      • Supercharge Your ML with PyCaret and Gradio
      • Time Series 101 - For beginners
      • Time Series Anomaly Detection with PyCaret
      • Time Series Forecasting with PyCaret Regression
      • Topic Modeling in Power BI using PyCaret
      • Write and train custom ML models using PyCaret
      • Build and deploy ML app with PyCaret and Streamlit
      • PyCaret 2.3.6 is Here! Learn What’s New?
    • 📺Videos
    • 🛩️Cheat sheet
    • ❓FAQs
    • 👩‍💻Examples
  • IMPORTANT LINKS
    • 🛠️Release Notes
    • ⚙️API Reference
    • 🙋 Discussions
    • 📤Issues
    • 👮 License
  • MEDIA
    • 💻Slack
    • 📺YouTube
    • 🔗LinkedIn
    • 😾GitHub
    • 🔅Stack Overflow
Powered by GitBook
On this page

Was this helpful?

  1. GET STARTED

Data Preprocessing

Data preprocessing and Transformations available in PyCaret

PreviousModulesNextData Preparation

Last updated 1 year ago

Was this helpful?

Datasets for various reasons may have missing values or empty records, often encoded as blanks or NaN. Most of the machine learning algorithms are not capable of dealing with the missing values.

Each feature in the dataset has an associated data type such as numeric, categorical, or Datetime. PyCaret automatically detects the data type of each feature.

Categorical features in the dataset contain the label values (ordinal or nominal) rather than continuous numbers. Most of the machine learning algorithms are not capable of handling categorical data without encoding.

When the categorical features in the dataset contain variables with intrinsic natural order such as Low, Medium, and High, these must be encoded differently than nominal variables (where there is no intrinsic order for e.g. Male or Female).

When categorical features in the dataset contain variables with many levels (also known as high cardinality features), then typical One-Hot Encoding leads to the creation of a very large number of new features.

When the training dataset has an unequal distribution of target class it can be fixed using the fix_imbalance parameter in the setup.

The remove_outliers function in PyCaret allows you to identify and remove outliers from the dataset before training the model.

Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to rescale the values of numeric columns in the dataset without distorting the differences in the ranges of values.

While normalization rescales the data within new limits to reduce the impact of magnitude in the variance, Feature transformation is a more radical technique. Transformation changes the shape of the distribution.

Target Transformation is similar to feature transformation as it will change the shape of the distribution of the target variable instead of the features.

It is often seen in machine learning experiments when two features combined through an arithmetic operation become more significant in explaining variances in the data than the same two features separately.

In machine learning experiments the relationship between the dependent and independent variable is often assumed as linear, however, this is not always the case. Sometimes the relationship between dependent and independent variables is more complex.

When a dataset contains features that are related to each other in some way, for example, features recorded at some fixed time intervals, then new statistical features such as mean, median, variance, and standard deviation for a group of such features.

Feature binning is a method of turning continuous variables into categorical values using the pre-defined number of bins. It is effective when a continuous feature has too many unique values or few extreme values outside the expected range.

Sometimes a dataset can have a categorical feature (or multiple categorical features) that has a very high number of levels (i.e. high cardinality features). If such feature (or features) are encoded into numeric values, then the resultant matrix is a sparse matrix.

Creating Clusters using the existing features from the data is an unsupervised ML technique to engineer and create new features.

Feature Selection is a process used to select features in the dataset that contributes the most in predicting the target variable. Working with selected features instead of all the features reduces the risk of over-fitting, improves accuracy, and decreases the training time.

Multicollinearity (also called collinearity) is a phenomenon in which one feature variable in the dataset is highly linearly correlated with another feature variable in the same dataset.

Principal Component Analysis (PCA) is an unsupervised technique used in machine learning to reduce the dimensionality of the data. It does so by compressing the feature space.

Sometimes a dataset may have a categorical feature with multiple levels, where the distribution of such levels is skewed and one level may dominate over other levels.

There are only two non-optional parameters in the setup function i.e. data and name of the target variable.

PyCaret uses MLflow for experiment tracking. A parameter in the setup can be set to automatically track all the metrics, hyperparameters, and other model artifacts.

Parameters in the setup can be used for setting parameters for the model selection process. These are not related to data preprocessing but can influence your model selection process.

Other miscellaneous parameters in the setup that are used for controlling experiment settings such as using GPU for training or setting verbosity of the experiment.

⚙️
Feature Interaction
Polynomial Features
Group Features
Bin Numeric Features
Combine Rare Levels
Create Clusters
Normalize
Feature Transform
Target Transform
Missing Values
Data Types
One-Hot Encoding
Ordinal Encoding
Cardinal Encoding
Target Imbalance
Remove Outliers
Feature Selection
Remove Multicollinearity
Principal Component Analysis
Ignore Low Variance
Required Parameters
Experiment Logging
Model Selection
Other Miscellaneous