Docs
  • PyCaret 3.0
  • GET STARTED
    • 💻Installation
    • 🚀Quickstart
    • ⭐Tutorials
    • 📶Modules
    • ⚙️Data Preprocessing
      • Data Preparation
      • Scale and Transform
      • Feature Engineering
      • Feature Selection
      • Other setup parameters
    • 💡Functions
      • Initialize
      • Train
      • Optimize
      • Analyze
      • Deploy
      • Others
  • LEARN PYCARET
    • 📖Blog
      • Announcing PyCaret 1.0
      • Announcing PyCaret 2.0
      • 5 things you dont know about PyCaret
      • Build and deploy your first machine learning web app
      • Build your own AutoML in Power BI using PyCaret
      • Deploy ML Pipeline on Google Kubernetes
      • Deploy PyCaret and Streamlit on AWS Fargate
      • Anomaly Detector in Power BI using PyCaret
      • Deploy ML App on Google Kubernetes
      • Deploy Machine Learning Pipeline on GKE
      • Deploy Machine Learning Pipeline on AWS Fargate
      • Deploy ML Pipeline on the cloud with Docker
      • Clustering Analysis in Power BI using PyCaret
      • Deploy PyCaret Models on edge with ONNX Runtime
      • GitHub is the best AutoML you will ever need
      • Deploy PyCaret and Streamlit on AWS Fargate
      • Easy MLOps with PyCaret and MLflow
      • Clustering Analysis in Power BI using PyCaret
      • Machine Learning in Alteryx with PyCaret
      • Machine Learning in KNIME with PyCaret
      • Machine Learning in SQL using PyCaret Part I
      • Machine Learning in Power BI using PyCaret
      • Machine Learning in Tableau with PyCaret
      • Multiple Time Series Forecasting with PyCaret
      • Predict Customer Churn using PyCaret
      • Predict Lead Score (the Right Way) Using PyCaret
      • NLP Text Classification in Python using PyCaret
      • Predict Lead Score (the Right Way) Using PyCaret
      • Predicting Crashes in Gold Prices Using PyCaret
      • Predicting Gold Prices Using Machine Learning
      • PyCaret 2.1 Feature Summary
      • Ship ML Models to SQL Server using PyCaret
      • Supercharge Your ML with PyCaret and Gradio
      • Time Series 101 - For beginners
      • Time Series Anomaly Detection with PyCaret
      • Time Series Forecasting with PyCaret Regression
      • Topic Modeling in Power BI using PyCaret
      • Write and train custom ML models using PyCaret
      • Build and deploy ML app with PyCaret and Streamlit
      • PyCaret 2.3.6 is Here! Learn What’s New?
    • 📺Videos
    • 🛩️Cheat sheet
    • ❓FAQs
    • 👩‍💻Examples
  • IMPORTANT LINKS
    • 🛠️Release Notes
    • ⚙️API Reference
    • 🙋 Discussions
    • 📤Issues
    • 👮 License
  • MEDIA
    • 💻Slack
    • 📺YouTube
    • 🔗LinkedIn
    • 😾GitHub
    • 🔅Stack Overflow
Powered by GitBook
On this page
  • compare_models
  • Example
  • Change the sort order
  • Compare only a few models
  • Return more than one model
  • Set the budget time
  • Set the probability threshold
  • Disable cross-validation
  • Distributed training on a cluster
  • create_model
  • Example
  • Changing the fold param
  • Model library
  • Models with custom param
  • Access the scoring grid
  • Disable cross-validation
  • Return train score
  • Set the probability threshold
  • Train models in a loop
  • Train custom models
  • Write your own models

Was this helpful?

  1. GET STARTED
  2. Functions

Train

Training functions in PyCaret

PreviousInitializeNextOptimize

Last updated 2 years ago

Was this helpful?

compare_models

This function trains and evaluates the performance of all estimators available in the model library using cross-validation. The output of this function is a scoring grid with average cross-validated scores. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.

Example

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')

# compare models
best = compare_models()

The compare_models returns only the top-performing model based on the criteria defined in sort parameter. It is Accuracy for classification experiments and R2 for regression. You can change the sort order by passing the name of the metric based on which you want to do model selection.

Change the sort order

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')

# compare models
best = compare_models(sort = 'F1')

Notice that the sort order of scoring grid is changed now and also the best model returned by this function is selected based on F1.

print(best)

Compare only a few models

If you don't want to do horse racing on the entire model library, you can only compare a few models of your choice by using the include parameter.

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')

# compare models
best = compare_models(include = ['lr', 'dt', 'lightgbm'])

Alternatively, you can also use exclude parameter. This will compare all models except for the ones that are passed in the exclude parameter.

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')

# compare models
best = compare_models(exclude = ['lr', 'dt', 'lightgbm'])

Return more than one model

By default, compare_models only return the top-performing model but if you want you can get the Top N models instead of just one model.

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')

# compare models
best = compare_models(n_select = 3)

Notice that there is no change in the results, however, if you check the variable best , it will now contain a list of the top 3 models instead of just one model as seen previously.

type(best)
# >>> list

print(best)

Set the budget time

If you are running short on time and would like to set a fixed budget time for this function to run, you can do that by setting the budget_time parameter.

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')

# compare models
best = compare_models(budget_time = 0.5)

Set the probability threshold

When performing binary classification, you can change the probability threshold or cut-off value for hard labels. By default, all classifiers use 0.5 as a default threshold.

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')

# compare models
best = compare_models(probability_threshold = 0.25)

Notice that all metrics except for AUC are now different. AUC doesn't change because it's not dependent on the hard labels, everything else is dependent on the hard label which is now obtained using probability_threshold=0.25 .

Disable cross-validation

If you don't want to evaluate models using cross-validation and rather just train them and see the metrics on the test/hold-out set you can set the cross_validation=False.

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable')

# compare models
best = compare_models(cross_validation=False)

The output looks pretty similar but if you focus, the metrics are now different and that's because instead of average cross-validated scores, these are now the metrics on the test/hold-out set.

Distributed training on a cluster

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable', n_jobs = 1)

# create pyspark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

# import parallel back-end
from pycaret.parallel import FugueBackend

# compare models
best = compare_models(parallel = FugueBackend(spark))

Note that we need to set n_jobs = 1 in the setup for testing with local Spark because some models will already try to use all available cores, and running such models in parallel can cause deadlocks from resource contention.

For Dask, we can specify the "dask" inside FugueBackend and it will pull the available Dask client.

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable', n_jobs = 1)

# import parallel back-end
from pycaret.parallel import FugueBackend

# compare models
best = compare_models(parallel = FugueBackend("dask"))

create_model

This function trains and evaluates the performance of a given estimator using cross-validation. The output of this function is a scoring grid with CV scores by fold. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function. All the available models can be accessed using the models function.

Example

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable')

# train logistic regression
lr = create_model('lr')

Changing the fold param

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable')

# train logistic regression
lr = create_model('lr', fold = 5)

The model returned by this is the same as above, however, the performance evaluation is done using 5 fold cross-validation.

Model library

To check the list of available models in any module, you can use models function.

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable')

# check available models
models()

Models with custom param

When you just run create_model('dt'), it will train Decision Tree with all default hyperparameter settings. If you would like to change that, simply pass the attributes in the create_model function.

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable')

# train decision tree
dt = create_model('dt', max_depth = 5)
# see models params
print(dt)

Access the scoring grid

The performance metrics/scoring grid you see after the create_model is only displayed and is not returned. As such, if you want to access that grid as pandas.DataFrame you will have to use pull command after create_model.

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable')

# train decision tree
dt = create_model('dt', max_depth = 5)

# access the scoring grid
dt_results = pull()
print(dt_results)
# check type
type(dt_results)
# >>> pandas.core.frame.DataFrame

# select only Mean
dt_results.loc[['Mean']]

Disable cross-validation

If you don't want to evaluate models using cross-validation and rather just train them and see the metrics on the test/hold-out set you can set the cross_validation=False.

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable')

# train model without cv
lr = create_model('lr', cross_validation = False)

These are the metrics on the test/hold-out set. That's why you only see one row as opposed to the 12 rows in the original output. When you disable cross_validation, the model is only trained one time, on the entire training dataset and scored using the test/hold-out set.

Return train score

The default scoring grid shows the performance metrics on the validation set by fold. If you want to see the performance metrics on the training set by fold as well to examine the over-fitting/under-fitting you can use return_train_score parameter.

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable')

# train model without cv
lr = create_model('lr', return_train_score = True)

Set the probability threshold

When performing binary classification, you can change the probability threshold or cut-off value for hard labels. By default, all classifiers use 0.5 as a default threshold.

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable')

# train model with 0.25 threshold
lr = create_model('lr', probability_threshold = 0.25)
# see the model
print(lr)

Train models in a loop

You can use the create_model function in a loop to train multiple models or even the same model with different configurations and compare their results.

import numpy as npimport pandas as pd

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable')

# train models in a loop
lgbs  = [create_model('lightgbm', learning_rate = i) for i in np.arange(0.1,1,0.1)]
type(lgbs)
# >>> list

len(lgbs)
# >>> 9

If you want to keep track of metrics as well, as in most cases, this is how you can do it.

import numpy as np
import pandas as pd

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable')

# start a loop
models = []
results = []

for i in np.arange(0.1,1,0.1):
    model = create_model('lightgbm', learning_rate = i)
    model_results = pull().loc[['Mean']]
    models.append(model)
    results.append(model_results)
    
results = pd.concat(results, axis=0)
results.index = np.arange(0.1,1,0.1)
results.plot()

Train custom models

You can use your own custom models for training or models from other libraries which are not part of pycaret. As long as their API is consistent with sklearn, it will work like a breeze.

# install gplearn library
# pip install gplearn

# load dataset 
from pycaret.datasets import get_data 
diabetes = get_data('diabetes') 

# init setup
from pycaret.classification import * 
clf1 = setup(data = diabetes, target = 'Class variable')

# import custom model
from gplearn.genetic import SymbolicClassifier
sc = SymbolicClassifier()

# train custom model
sc_trained = create_model(sc)
type(sc_trained)
# >>> gplearn.genetic.SymbolicClassifier

print(sc_trained)

Write your own models

You can also write your own class with fit and predict function. PyCaret will be compatible with that. Here is a simple example:

# load dataset 
from pycaret.datasets import get_data 
insurance= get_data('insurance') 

# init setup
from pycaret.regression import * 
reg1 = setup(data = insurance, target = 'charges')

# create custom estimator
import numpy as np
from sklearn.base import BaseEstimator
class MyOwnModel(BaseEstimator):
    
    def __init__(self):
        self.mean = 0
        
    def fit(self, X, y):
        self.mean = y.mean()
        return self
    
    def predict(self, X):
        return np.array(X.shape[0]*[self.mean])
        
# create an instance
my_own_model = MyOwnModel()

# train model
my_model_trained = create_model(my_own_model)

NOTE: This parameter is only available in the module of PyCaret.

NOTE: This function is only available in and modules.

To scale on large datasets you can run compare_models function on a cluster in distributed mode using a parameter called parallel. It leverages the abstraction layer to run compare_models on Spark or Dask clusters.

For the complete example and other features related to distributed execution, check . This example also shows how to get the leaderboard in real-time. In a distributed setting, this involves setting up an RPCClient, but Fugue simplifies that.

This function displays the performance metrics by fold and the average and standard deviation for each metric and returns the trained model. By default, it uses the 10 fold that can either be changed globally in the function or locally within create_model.

NOTE: This function is only available in and modules.

💡
Classification
Classification
Regression
Fugue
this example
Classification
Regression
setup
Output from compare_models
Output from compare_models(sort = 'F1')
Output from print(best)
Output from compare_models(include = ['lr', 'dt', 'lightgbm'])
Output from compare_models(exclude = ['lr', 'dt', 'lightgbm'])
Output from compare_models(n_select = 3)
Output from print(best)
Output from compare_models(budget_time = 0.5)
Output from compare_models(probability_threshold = 0.25)
Output from compare_models(cross_validation=False)
Output from compare_models(parallel = FugueBackend(spark))
Output from create_model('lr')
Output from create_model('lr', fold = 5)
Output from models()
Output from create_model('dt', max_depth = 5)
Output from print(dt_results)
Output from dt_results.loc[['Mean']
Output from create_model('lr', cross_validation = False)
Output from createmodel('lr', return_train_score = True)
Output from create_model('lr', probability_threshold = 0.25)
Output from print(lr)
Output from results.plot()
Output from create_model(sc)
Output from print(sc_trained)
Output from create_model(my_own_model)