Initialize

Initialize experiment in PyCaret

setup

This function initializes the experiment in PyCaret and creates the transformation pipeline based on all the parameters passed in the function. The setup function must be called before executing any other function. It takes two required parameters: data and target. All the other parameters are optional.

PyCaret 3.0 has two API's. You can choose one of it based on your preference. The functionalities and experiment results are consistent.

Functional API

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data = diabetes, target = 'Class variable', session_id = 123)

OOP API

# load dataset
from pycaret.datasets import get_data
diabetes = get_data('diabetes')

# init setup
from pycaret.classification import ClassificationExperiment
clf1 = ClassificationExperiment()
clf1.setup(data = diabetes, target = 'Class variable', session_id = 123)

Required Parameters

There are only two required parameters in the setup:

  • target: float, int, str or sequence, default = -1

    If int or str, respectively index or name of the target column in data. The default value selects the last column in the dataset. If sequence, it should have shape (n_samples,).

  • data_func: Callable[[], DATAFRAME_LIKE] = None

    The function that generate data (the dataframe-like input). This is useful when the dataset is large, and you need parallel operations such as compare_models. It can avoid broadcasting large dataset from driver to workers. Notice one and only one of data and data_func must be set.

  • data: dataframe-like = None

    Data set with shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features. If data is not a pandas dataframe, it's converted to one using default column names.

NOTE: target parameter is not required in pycaret.clustering and pycaret.anomaly module.

Experiment Logging

You can automatically track entire experiments in PyCaret. A parameter in the setup can be enabled to automatically track all the metrics, hyperparameters, and model artifacts. By default, PyCaret uses MLFlow for experiment logging. Other available options are wandb cometml dagshub.

Example

# load dataset
from pycaret.datasets import get_data
data = get_data('diabetes')

# init setup
from pycaret.classification import *
clf1 = setup(data, target = 'Class variable', log_experiment = True, experiment_name = 'diabetes1')

# model training
best_model = compare_models() 

Initialize the MLflow server on localhost:

# init server
!mlflow ui

To learn more about experiment tracking in PyCaret, see this page.

Model Validation

There are quite a few parameters in the setup function that are not directly related to preprocessing or data transformation but are used as part of model validation and selection strategy such as train_size, fold_strategy, or number of fold for cross-validation. To learn more about all the model validation and selection settings in the setup, see this page.

GPU Support

With PyCaret, you can train models on GPU and speed up your workflow by 10x. To train models on GPU simply pass use_gpu = True in the setup function. There is no change in the use of the API, however, in some cases, additional libraries have to be installed as they are not installed with the default version or the full version. To learn more about GPU support, see this page.

Examples

To see the use of the setup in other modules of PyCaret, see below:

All the examples in the following sections are shown using Functional API only.

Last updated