🚀
Quickstart
Quick start guide to PyCaret
Help us improve the documentation! If you found a broken link or a typo or would like to contribute to this documentation, please submit a pull request on pycaret-docs repo.

Introduction

Select your use case:

Classification

PyCaret’s Classification Module is a supervised machine learning module that is used for classifying elements into groups. The goal is to predict the categorical class labels which are discrete and unordered. Some common use cases include predicting customer default (Yes or No), predicting customer churn (customer will leave or stay), the disease found (positive or negative). This module can be used for binary or multiclass problems. It provides several pre-processing features that prepare the data for modeling through the setup function. It has over 18 ready-to-use algorithms and several plots to analyze the performance of trained models.

Setup

This function initializes the training environment and creates the transformation pipeline. Setup function must be called before executing any other function. It takes two mandatory parameters: data and target. All the other parameters are optional.
1
from pycaret.datasets import get_data
2
data = get_data('diabetes')
Copied!
1
from pycaret.classification import *
2
s = setup(data, target = 'Class variable')
Copied!
When the setup is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To handle this, PyCaret displays a prompt, asking for data types confirmation, once you execute the setup. You can press enter if all data types are correct or type quit to exit the setup.
Ensuring that the data types are correct is really important in PyCaret as it automatically performs multiple type-specific preprocessing tasks which are imperative for machine learning models.
Alternatively, you can also use numeric_features and categorical_features parameters in the setup to pre-define the data types.
Output truncated for display

Compare Models

This function trains and evaluates the performance of all the estimators available in the model library using cross-validation. The output of this function is a scoring grid with average cross-validated scores. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.
1
best = compare_models()
Copied!
1
print(best)
Copied!

Analyze Model

This function analyzes the performance of a trained model on the test set. It may require re-training the model in certain cases.
1
evaluate_model(best)
Copied!
evaluate_model can only be used in Notebook since it uses ipywidget . You can also use the plot_model function to generate plots individually.
1
plot_model(best, plot = 'auc')
Copied!
1
plot_model(best, plot = 'confusion_matrix')
Copied!

Predictions

This function predicts the Label and the Score (probability of predicted class) columns using a trained model. When data is None, it predicts label and score on the test set (created during the setup function).
1
predict_model(best)
Copied!
The evaluation metrics are calculated on the test set. The second output is the pd.DataFrame with predictions on the test set (see the last two columns). To generate labels on the unseen (new) dataset, simply pass the dataset in the predict_model function
1
predictions = predict_model(best, data=data)
2
predictions.head()
Copied!
Score means the probability of the predicted class (NOT the positive class). If Label is 0 and Score is 0.90, it means 90% probability of class 0. If you want to see the probability of both the classes, simply pass raw_score=True in the predict_model function.
1
predictions = predict_model(best, data=data, raw_score=True)
2
predictions.head()
Copied!

Save the model

1
save_model(best, 'my_best_pipeline')
Copied!

To load the model back in environment:

1
loaded_model = load_model('my_best_pipeline')
2
print(loaded_model)
Copied!

Regression

PyCaret’s Regression Module is a supervised machine learning module that is used for estimating the relationships between a dependent variable (often called the ‘outcome variable’, or ‘target’) and one or more independent variables (often called ‘features’, ‘predictors’, or ‘covariates’). The objective of regression is to predict continuous values such as predicting sales amount, predicting quantity, predicting temperature, etc. It provides several pre-processing features that prepare the data for modeling through the setup function. It has over 25 ready-to-use algorithms and several plots to analyze the performance of trained models.

Setup

This function initializes the training environment and creates the transformation pipeline. Setup function must be called before executing any other function. It takes two mandatory parameters: data and target. All the other parameters are optional.
1
from pycaret.datasets import get_data
2
data = get_data('insurance')
Copied!
1
from pycaret.regression import *
2
s = setup(data, target = 'charges')
Copied!
When the setup is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To handle this, PyCaret displays a prompt, asking for data types confirmation, once you execute the setup. You can press enter if all data types are correct or type quit to exit the setup.
Ensuring that the data types are correct is really important in PyCaret as it automatically performs multiple type-specific preprocessing tasks which are imperative for machine learning models.
Alternatively, you can also use numeric_features and categorical_features parameters in the setup to pre-define the data types.
Output truncated for display

Compare Models

This function trains and evaluates the performance of all the estimators available in the model library using cross-validation. The output of this function is a scoring grid with average cross-validated scores. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.
1
best = compare_models()
Copied!
1
print(best)
Copied!

Analyze Model

This function analyzes the performance of a trained model on the test set. It may require re-training the model in certain cases.
1
evaluate_model(best)
Copied!
evaluate_model can only be used in Notebook since it uses ipywidget . You can also use the plot_model function to generate plots individually.
1
plot_model(best, plot = 'residuals')
Copied!
1
plot_model(best, plot = 'feature')
Copied!

Predictions

This function predicts Label using the trained model. When data is None, it predicts label and score on the test set (created during the setup function).
1
predict_model(best)
Copied!
The evaluation metrics are calculated on the test set. The second output is the pd.DataFrame with predictions on the test set (see the last two columns). To generate labels on the unseen (new) dataset, simply pass the dataset in the predict_model function.
1
predictions = predict_model(best, data=data)
2
predictions.head()
Copied!

Save the model

1
save_model(best, 'my_best_pipeline')
Copied!

To load the model back in the environment:

1
loaded_model = load_model('my_best_pipeline')
2
print(loaded_model)
Copied!

Clustering

PyCaret’s Clustering Module is an unsupervised machine learning module that performs the task of grouping a set of objects in such a way that objects in the same group (also known as a cluster) are more similar to each other than to those in other groups. It provides several pre-processing features that prepare the data for modeling through the setup function. It has over 10 ready-to-use algorithms and several plots to analyze the performance of trained models.

Setup

This function initializes the training environment and creates the transformation pipeline. Setup function must be called before executing any other function. It takes one mandatory parameter: data. All the other parameters are optional.
1
from pycaret.datasets import get_data
2
data = get_data('jewellery')
Copied!
1
from pycaret.clustering import *
2
s = setup(data, normalize = True)
Copied!
When the setup is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To handle this, PyCaret displays a prompt, asking for data types confirmation, once you execute the setup. You can press enter if all data types are correct or type quit to exit the setup.
Ensuring that the data types are correct is really important in PyCaret as it automatically performs multiple type-specific preprocessing tasks which are imperative for machine learning models.
Alternatively, you can also use numeric_features and categorical_features parameters in the setup to pre-define the data types.
Output truncated for display

Create Model

This function trains and evaluates the performance of a given model. Metrics evaluated can be accessed using the get_metrics function. Custom metrics can be added or removed using the add_metric and remove_metric function. All the available models can be accessed using the models function.
1
kmeans = create_model('kmeans')
Copied!
1
print(kmeans)
Copied!

Analyze Model

This function analyzes the performance of a trained model.
1
evaluate_model(kmeans)
Copied!
evaluate_model can only be used in Notebook since it uses ipywidget . You can also use the plot_model function to generate plots individually.
1
plot_model(kmeans, plot = 'elbow')
Copied!
1
plot_model(kmeans, plot = 'silhouette')
Copied!

Assign Model

This function assigns cluster labels to the training data, given a trained model.
1
result = assign_model(kmeans)
2
result.head()
Copied!

Predictions

This function generates cluster labels using a trained model on the new/unseen dataset.
1
predictions = predict_model(kmeans, data = data)
2
predictions.head()
Copied!

Save the model

1
save_model(kmeans, 'kmeans_pipeline')
Copied!

To load the model back in the environment:

1
loaded_model = load_model('kmeans_pipeline')
2
print(loaded_model)
Copied!

Anomaly Detection

PyCaret’s Anomaly Detection Module is an unsupervised machine learning module that is used for identifying rare items, events, or observations that raise suspicions by differing significantly from the majority of the data. Typically, the anomalous items will translate to some kind of problems such as bank fraud, a structural defect, medical problems, or errors. It provides several pre-processing features that prepare the data for modeling through the setup function. It has over 10 ready-to-use algorithms and several plots to analyze the performance of trained models.

Setup

This function initializes the training environment and creates the transformation pipeline. The setup function must be called before executing any other function. It takes one mandatory parameter only: data. All the other parameters are optional.
1
from pycaret.datasets import get_data
2
data = get_data('anomaly')
Copied!
1
from pycaret.anomaly import *
2
s = setup(data)
Copied!
When the setup is executed, PyCaret's inference algorithm will automatically infer the data types for all features based on certain properties. The data type should be inferred correctly but this is not always the case. To handle this, PyCaret displays a prompt, asking for data types confirmation, once you execute the setup. You can press enter if all data types are correct or type quit to exit the setup.
Ensuring that the data types are correct is really important in PyCaret as it automatically performs multiple type-specific preprocessing tasks which are imperative for machine learning models.
Alternatively, you can also use numeric_features and categorical_features parameters in the setup to pre-define the data types.
Output truncated for display

Create Model

This function trains an unsupervised anomaly detection model. All the available models can be accessed using the models function.
1
iforest = create_model('iforest')
2
print(iforest)
Copied!
1
models()
Copied!

Analyze Model

1
plot_model(iforest, plot = 'tsne')
Copied!
1
plot_model(iforest, plot = 'umap')
Copied!

Assign Model

This function assigns anomaly labels to the dataset for a given model. (1 = outlier, 0 = inlier).
1
result = assign_model(iforest)
2
result.head()
Copied!

Predictions

This function generates anomaly labels using a trained model on the new/unseen dataset.
1
predictions = predict_model(iforest, data = data)
2
predictions.head()
Copied!
Output from predict_model(iforest, data = data)

Save the model

1
save_model(iforest, 'iforest_pipeline')
Copied!
Output from save_model(iforest, 'iforest_pipeline')
To load the model back in the environment:
1
loaded_model = load_model('iforest_pipeline')
2
print(loaded_model)
Copied!
Output from load_model('iforest_pipeline')

Natural Language Processing

PyCaret’s Natural Language Processing is an unsupervised machine learning module that is used for training topic models on text data. There are several techniques that are used to analyze text data and Topic Modeling is one of them. A topic model is a type of statistical model for discovering abstract topics in a collection of documents.

Setup

This function initializes the training environment and creates the text transformation pipeline. The setup function must be called before executing any other function.
1
# load dataset
2
from pycaret.datasets import get_data
3
data = get_data('kiva')
Copied!
Sample from dataset
1
# print first document
2
print(data['en'][0])
Copied!
Output from print(data['en'][0])
1
# init setup
2
from pycaret.nlp import *
3
s = setup(data, target = 'en')
Copied!
Output from setup(...)

Create Model

This function trains an unsupervised topic model. All the available models can be accessed using the models function.
1
models()
Copied!
Output from models()

To train a model:

1
lda = create_model('lda')
2
print(lda)
Copied!
Output from print(lda)

Analyze Model

1
plot_model(lda, plot = 'frequency')
Copied!
Output from plot_model(...)
1
plot_model(lda, plot = 'sentiment')
Copied!
Output from plot_model(...)
Alternatively, you can also use the evaluate_model function.
1
evaluate_model(lda)
Copied!
Output from evaluate_model(lda)

Assign Model

This function assigns topic labels to the dataset for a given model.
1
lda_results = assign_model(lda)
2
lda_results.head()
Copied!
Output from assign_model(lda)

Save the model

1
save_model(lda, 'my_lda_model')
Copied!
Output from save_model(..)
To load the model back in the environment:
1
loaded_model = load_model('my_lda_model')
Copied!

Association Rules Mining

PyCaret's association rule module is a supervised machine learning module that is used for discovering interesting relations between variables in the dataset. This module automatically transforms any transactional database into a shape that is acceptable for the apriori algorithm. Apriori is an algorithm for frequent itemset mining and association rule learning over relational databases.

Setup

1
from pycaret.datasets import get_data
2
data = get_data('france')
Copied!
Sample rows from the dataset
1
from pycaret.arules import *
2
arules = setup(data, transaction_id = 'InvoiceNo', item_id = 'Description')
Copied!
Output from setup(...)

Create Model

1
create_model(metric = 'confidence', threshold = 0.3)
Copied!
Output from create_model(...)

Analyze Model

1
plot_model(model, plot = '3d')
Copied!
Output from plot_model(...)

Time Series (beta)

NOTE: PyCaret time series forecasting module is in beta. It is recommended to create a separate conda environment for use. You can install it with pip install pycaret-ts-alpha.
PyCaret's new time series module is now available in beta. Staying true to the simplicity of PyCaret, it is consistent with our existing API and fully loaded with functionalities. Statistical testing, model training and selection (30+ algorithms), model analysis, automated hyperparameter tuning, experiment logging, deployment on cloud, and more. All of this with only a few lines of code.

Setup

This function initializes the training environment and creates the transformation pipeline. Setup function must be called before executing any other function.
1
# loading dataset
2
from pycaret.datasets import get_data
3
data = get_data('airline')
Copied!
Output from get_data('airline')
1
from pycaret.time_series import *
2
s = setup(data, fh = 3, fold = 5, session_id = 123)
Copied!
Output from setup(...)

Compare Models

This function trains and evaluates the performance of all the estimators available in the model library using cross-validation. The output of this function is a scoring grid with average cross-validated scores. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.
1
best = compare_models()
Copied!
Output from compare_models()

Analyze Model

1
plot_model(best, plot = 'forecast', data_kwargs = {'fh' : 24})
Copied!
Output from plot_model(...)
1
plot_model(best, plot = 'diagnostics')
Copied!
Output from plot_model(best, plot = 'diagnostics')
1
plot_model(best, plot = 'insample')
Copied!
Output from plot_model(best, plot = 'insample')

Predictions

1
# finalize model
2
final_best = finalize_model(best)
3
predict_model(best, fh = 24)
Copied!
Output from predict_model(best, fh = 24)

Save the model

1
save_model(final_best, 'my_final_best_model')
Copied!
Output from save_model(...)

To load the model back in the environment:

1
loaded_model = load_model('my_final_best_model')
2
print(loaded_model)
Copied!
Output from load_model(...)