This function trains and evaluates the performance of all estimators available in the model library using cross-validation. The output of this function is a scoring grid with average cross-validated scores. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function.
The compare_models returns only the top-performing model based on the criteria defined in sort parameter. It is Accuracy for classification experiments and R2 for regression. You can change the sort order by passing the name of the metric based on which you want to do model selection.
Notice that there is no change in the results, however, if you check the variable best , it will now contain a list of the top 3 models instead of just one model as seen previously.
type(best)# >>> listprint(best)
Set the budget time
If you are running short on time and would like to set a fixed budget time for this function to run, you can do that by setting the budget_time parameter.
When performing binary classification, you can change the probability threshold or cut-off value for hard labels. By default, all classifiers use 0.5 as a default threshold.
Notice that all metrics except for AUC are now different. AUC doesn't change because it's not dependent on the hard labels, everything else is dependent on the hard label which is now obtained using probability_threshold=0.25 .
NOTE: This parameter is only available in the Classification module of PyCaret.
Disable cross-validation
If you don't want to evaluate models using cross-validation and rather just train them and see the metrics on the test/hold-out set you can set the cross_validation=False.
The output looks pretty similar but if you focus, the metrics are now different and that's because instead of average cross-validated scores, these are now the metrics on the test/hold-out set.
To scale on large datasets you can run compare_models function on a cluster in distributed mode using a parameter called parallel. It leverages the Fugue abstraction layer to run compare_models on Spark or Dask clusters.
Note that we need to set n_jobs = 1 in the setup for testing with local Spark because some models will already try to use all available cores, and running such models in parallel can cause deadlocks from resource contention.
For Dask, we can specify the "dask" inside FugueBackend and it will pull the available Dask client.
For the complete example and other features related to distributed execution, check this example. This example also shows how to get the leaderboard in real-time. In a distributed setting, this involves setting up an RPCClient, but Fugue simplifies that.
create_model
This function trains and evaluates the performance of a given estimator using cross-validation. The output of this function is a scoring grid with CV scores by fold. Metrics evaluated during CV can be accessed using the get_metrics function. Custom metrics can be added or removed using add_metric and remove_metric function. All the available models can be accessed using the models function.
This function displays the performance metrics by fold and the average and standard deviation for each metric and returns the trained model. By default, it uses the 10 fold that can either be changed globally in the setup function or locally within create_model.
When you just run create_model('dt'), it will train Decision Tree with all default hyperparameter settings. If you would like to change that, simply pass the attributes in the create_model function.
The performance metrics/scoring grid you see after the create_model is only displayed and is not returned. As such, if you want to access that grid as pandas.DataFrame you will have to use pull command after create_model.
# check typetype(dt_results)# >>> pandas.core.frame.DataFrame# select only Meandt_results.loc[['Mean']]
Disable cross-validation
If you don't want to evaluate models using cross-validation and rather just train them and see the metrics on the test/hold-out set you can set the cross_validation=False.
# load dataset from pycaret.datasets import get_data diabetes =get_data('diabetes')# init setupfrom pycaret.classification import*clf1 =setup(data = diabetes, target ='Class variable')# train model without cvlr =create_model('lr', cross_validation =False)
These are the metrics on the test/hold-out set. That's why you only see one row as opposed to the 12 rows in the original output. When you disable cross_validation, the model is only trained one time, on the entire training dataset and scored using the test/hold-out set.
The default scoring grid shows the performance metrics on the validation set by fold. If you want to see the performance metrics on the training set by fold as well to examine the over-fitting/under-fitting you can use return_train_score parameter.
# load dataset from pycaret.datasets import get_data diabetes =get_data('diabetes')# init setupfrom pycaret.classification import*clf1 =setup(data = diabetes, target ='Class variable')# train model without cvlr =create_model('lr', return_train_score =True)
Set the probability threshold
When performing binary classification, you can change the probability threshold or cut-off value for hard labels. By default, all classifiers use 0.5 as a default threshold.
# load dataset from pycaret.datasets import get_data diabetes =get_data('diabetes')# init setupfrom pycaret.classification import*clf1 =setup(data = diabetes, target ='Class variable')# train model with 0.25 thresholdlr =create_model('lr', probability_threshold =0.25)
# see the modelprint(lr)
Train models in a loop
You can use the create_model function in a loop to train multiple models or even the same model with different configurations and compare their results.
import numpy as npimport pandas as pd# load dataset from pycaret.datasets import get_data diabetes =get_data('diabetes')# init setupfrom pycaret.classification import*clf1 =setup(data = diabetes, target ='Class variable')# train models in a looplgbs = [create_model('lightgbm', learning_rate = i)for i in np.arange(0.1,1,0.1)]
type(lgbs)# >>> listlen(lgbs)# >>> 9
If you want to keep track of metrics as well, as in most cases, this is how you can do it.
import numpy as npimport pandas as pd# load dataset from pycaret.datasets import get_data diabetes =get_data('diabetes')# init setupfrom pycaret.classification import*clf1 =setup(data = diabetes, target ='Class variable')# start a loopmodels = []results = []for i in np.arange(0.1,1,0.1): model =create_model('lightgbm', learning_rate = i) model_results =pull().loc[['Mean']] models.append(model) results.append(model_results)results = pd.concat(results, axis=0)results.index = np.arange(0.1,1,0.1)results.plot()
Train custom models
You can use your own custom models for training or models from other libraries which are not part of pycaret. As long as their API is consistent with sklearn, it will work like a breeze.