Docs
  • PyCaret 3.0
  • GET STARTED
    • 💻Installation
    • 🚀Quickstart
    • ⭐Tutorials
    • 📶Modules
    • ⚙️Data Preprocessing
      • Data Preparation
      • Scale and Transform
      • Feature Engineering
      • Feature Selection
      • Other setup parameters
    • 💡Functions
      • Initialize
      • Train
      • Optimize
      • Analyze
      • Deploy
      • Others
  • LEARN PYCARET
    • 📖Blog
      • Announcing PyCaret 1.0
      • Announcing PyCaret 2.0
      • 5 things you dont know about PyCaret
      • Build and deploy your first machine learning web app
      • Build your own AutoML in Power BI using PyCaret
      • Deploy ML Pipeline on Google Kubernetes
      • Deploy PyCaret and Streamlit on AWS Fargate
      • Anomaly Detector in Power BI using PyCaret
      • Deploy ML App on Google Kubernetes
      • Deploy Machine Learning Pipeline on GKE
      • Deploy Machine Learning Pipeline on AWS Fargate
      • Deploy ML Pipeline on the cloud with Docker
      • Clustering Analysis in Power BI using PyCaret
      • Deploy PyCaret Models on edge with ONNX Runtime
      • GitHub is the best AutoML you will ever need
      • Deploy PyCaret and Streamlit on AWS Fargate
      • Easy MLOps with PyCaret and MLflow
      • Clustering Analysis in Power BI using PyCaret
      • Machine Learning in Alteryx with PyCaret
      • Machine Learning in KNIME with PyCaret
      • Machine Learning in SQL using PyCaret Part I
      • Machine Learning in Power BI using PyCaret
      • Machine Learning in Tableau with PyCaret
      • Multiple Time Series Forecasting with PyCaret
      • Predict Customer Churn using PyCaret
      • Predict Lead Score (the Right Way) Using PyCaret
      • NLP Text Classification in Python using PyCaret
      • Predict Lead Score (the Right Way) Using PyCaret
      • Predicting Crashes in Gold Prices Using PyCaret
      • Predicting Gold Prices Using Machine Learning
      • PyCaret 2.1 Feature Summary
      • Ship ML Models to SQL Server using PyCaret
      • Supercharge Your ML with PyCaret and Gradio
      • Time Series 101 - For beginners
      • Time Series Anomaly Detection with PyCaret
      • Time Series Forecasting with PyCaret Regression
      • Topic Modeling in Power BI using PyCaret
      • Write and train custom ML models using PyCaret
      • Build and deploy ML app with PyCaret and Streamlit
      • PyCaret 2.3.6 is Here! Learn What’s New?
    • 📺Videos
    • 🛩️Cheat sheet
    • ❓FAQs
    • 👩‍💻Examples
  • IMPORTANT LINKS
    • 🛠️Release Notes
    • ⚙️API Reference
    • 🙋 Discussions
    • 📤Issues
    • 👮 License
  • MEDIA
    • 💻Slack
    • 📺YouTube
    • 🔗LinkedIn
    • 😾GitHub
    • 🔅Stack Overflow
Powered by GitBook
On this page
  • Feature Selection
  • Remove Multicollinearity
  • Principal Component Analysis
  • Ignore Low Variance

Was this helpful?

  1. GET STARTED
  2. Data Preprocessing

Feature Selection

Feature Selection

Feature Importance is a process used to select features in the dataset that contribute the most in predicting the target variable. Working with selected features instead of all the features reduces the risk of over-fitting, improves accuracy, and decreases the training time. In PyCaret, this can be achieved using feature_selection parameter.

PARAMETERS

  • feature_selection: bool, default = False When set to True, a subset of features is selected based on a feature importance score determined by feature_selection_estimator.

  • feature_selection_method: str, default = 'classic'

    Algorithm for feature selection. Choose from:

    • 'univariate': Uses sklearn's SelectKBest.

    • 'classic': Uses sklearn's SelectFromModel.

    • 'sequential': Uses sklearn's SequentialFeatureSelector.

  • feature_selection_estimator: str or sklearn estimator, default = 'lightgbm'

    Classifier used to determine the feature importance. The estimator should have a feature_importances_ or coef_ attribute after fitting. If None, it uses LGBClassifier. This parameter is ignored when feature_selection_method=univariate.

  • n_features_to_select: int or float, default = 0.2

    The maximum number of features to select with feature_selection. If <1, it's the fraction of starting features. Note that this parameter doesn't take features in ignore_features or keep_features into account when counting.

Example

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.regression import *
clf1 = setup(data = diabetes, target = 'Class variable', feature_selection = True)

Before

After

Remove Multicollinearity

Multicollinearity (also called collinearity) is a phenomenon in which one feature variable in the dataset is highly linearly correlated with another feature variable in the same dataset. Multicollinearity increases the variance of the coefficients, thus making them unstable and noisy for linear models. One such way to deal with Multicollinearity is to drop one of the two features that are highly correlated with each other. This can be achieved in PyCaret using remove_multicollinearity parameter.

PARAMETERS

  • remove_multicollinearity: bool, default = False When set to True, features with the inter-correlations higher than the defined threshold are removed. For each group, it removes all except the feature with the highest correlation to y.

  • multicollinearity_threshold: float, default = 0.9 Minimum absolute Pearson correlation to identify correlated features. The default value removes equal columns. Ignored when remove_multicollinearity is not True.

Example

# load dataset
from pycaret.datasets import get_data
concrete = get_data('concrete')

# init setup
from pycaret.regression import *
reg1 = setup(data = concrete, target = 'strength', remove_multicollinearity = True, multicollinearity_threshold = 0.3)

Before

After

Principal Component Analysis

Principal Component Analysis (PCA) is an unsupervised technique used in machine learning to reduce the dimensionality of a data. It does so by compressing the feature space by identifying a subspace that captures most of the information in the complete feature matrix. It projects the original feature space into lower dimensionality.

PARAMETERS

  • pca: bool, default = False When set to True, dimensionality reduction is applied to project the data into a lower dimensional space using the method defined in pca_method parameter.

  • pca_method: string, default = ‘linear’ Method with which to apply PCA. Possible values are:

    • 'linear': Uses Singular Value Decomposition.

    • 'kernel': Dimensionality reduction through the use of RBF kernel.

    • 'incremental': Similar to 'linear', but more efficient for large datasets.

  • pca_components: int/float, default = 0.99 Number of components to keep. if pca_components is a float, it is treated as a target percentage for information retention. When pca_components is an integer it is treated as the number of features to be kept. pca_components must be strictly less than the original number of features in the dataset.

  • pca_components: int, float, str or None, default = None Number of components to keep. This parameter is ignored when pca=False.

    • If None: All components are kept.

    • If int: Absolute number of components. -

    • If float: Such an amount that the variance that needs to be explained is greater than the percentage specified by n_components. Value should lie between 0 and 1 (ony for pca_method='linear').

    • If 'mle': Minka’s MLE is used to guess the dimension (ony for pca_method='linear').

Example

# load dataset
from pycaret.datasets import get_data
income = get_data('income')

# init setup
from pycaret.classification import *
clf1 = setup(data = income, target = 'income >50K', pca = True, pca_components = 10)

Before

After

Ignore Low Variance

Sometimes a dataset may have a categorical feature with multiple levels, where distribution of such levels are skewed and one level may dominate over other levels. This means there is not much variation in the information provided by such feature. For a ML model, such feature may not add a lot of information and thus can be ignored for modeling. This can be achieved in PyCaret using low_variance_threshold parameter.

PARAMETERS

  • low_variance_threshold: float or None, default = None

    Remove features with a training-set variance lower than the provided threshold. If 0, keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. If None, skip this transformation step.

Example

# load dataset
from pycaret.datasets import get_data
mice = get_data('mice')

# filter dataset
mice = mice[mice['Genotype'] == 'Control']

# init setup
from pycaret.classification import *
clf1 = setup(data = mice, target = 'class', low_variance_threshold = 0.1)

Before

After

PreviousFeature EngineeringNextOther setup parameters

Last updated 2 years ago

Was this helpful?

⚙️
Dataframe before feature importance
Dataframe after feature importance
Dataframe view before remove multicollinearity
Dataframe view after remove multicollinearity
Dataframe view before pca
Dataframe view after pca
Dataframe view before ignore low variance
Dataframe view after ignore low variance