Comment on page
Other setup parameters
All other setup related parameters
There are only two required parameters in the setup function.
- data: dataframe-like = NoneData set with shape (n_samples, n_features), where n_samples is the number of samples and n_features is the number of features. If data is not a pandas dataframe, it's converted to one using default column names.
- data_func: Callable[[], DATAFRAME_LIKE] = NoneThe function that generate
data
(the dataframe-like input). This is useful when the dataset is large, and you need parallel operations such ascompare_models
. It can avoid broadcasting large dataset from driver to workers. Notice one and only one ofdata
anddata_func
must be set. - target: float, int, str or sequence, default = -1If int or str, respectively index or name of the target column in data. The default value selects the last column in the dataset. If sequence, it should have shape (n_samples,).
- index: bool, int, str or sequence, default = FalseHandle indices in the
data
dataframe.- If False: Reset to RangeIndex.
- If True: Keep the provided index.
- If int: Position of the column to use as index.
- If str: Name of the column to use as index.
- If sequence: Array with shape=(n_samples,) to use as index.
PyCaret can automatically log entire experiments, including setup parameters, model hyperparameters, performance metrics, and pipeline artifacts. The default settings use
MLflow
as the logging backend. wandb
, cometml
, dagshub
is also available for backend. A parameter in the setup can be enabled to automatically track all the metrics, hyperparameters, and model artifacts.- log_experiment: bool or str or BaseLogger or list of str or BaseLogger, default = False A (list of) PyCaret
BaseLogger
or str (one ofmlflow
,wandb
,comet_ml
, ordagshub
) corresponding to a logger to determine which experiment loggers to use. Setting toTrue
will use justMLFlow
. - experiment_name: str, default = NoneName of the experiment for logging. Ignored when
log_experiment = False
. - experiment_custom_tags: dict, default = None Dictionary of tag_name: String -> value: (String, but will be string-ified if not) passed to the mlflow.set_tags to add new custom tags for the experiment.
- log_plots: bool or list, default = FalseWhen set to True, certain plots are logged automatically in the MLFlow server. To change the type of plots to be logged, pass a list containing plot IDs. Refer to documentation of
plot_model
. Ignored whenlog_experiment = False
. - log_profile: bool, default = FalseWhen set to True, data profile is logged on the MLflow server as a html file.Ignored when
log_experiment = False
- log_data: bool, default = False When set to
True
, train and test dataset are logged as a CSV file.
1
# load dataset
2
from pycaret.datasets import get_data
3
data = get_data('diabetes')
4
5
# init setup
6
from pycaret.classification import *
7
clf1 = setup(data, target = 'Class variable', log_experiment = True, experiment_name = 'diabetes1')
8
9
# model training
10
best_model = compare_models()
To initialize
MLflow
server you must run the following command from within the notebook or from the command line. Once the server is initialized, you can track your experiment on https://localhost:5000
.# init server
!mlflow ui

When no backend is configured Data is stored locally at the provided file (or ./mlruns if empty). To configure the backend use
mlflow.set_tracking_uri
before executing the setup function.- An empty string, or a local file path, prefixed with file:/. Data is stored locally at the provided file (or ./mlruns if empty).
- An HTTP URI like https://my-tracking-server:5000.
- A Databricks workspace, provided as the string “databricks” or, to use a Databricks CLI profile, “databricks://<profileName>”.
1
# set tracking uri
2
import mlflow
3
mlflow.set_tracking_uri('file:/c:/users/mlflow-server')
4
5
# load dataset
6
from pycaret.datasets import get_data
7
data = get_data('diabetes')
8
9
# init setup
10
from pycaret.classification import *
11
clf1 = setup(data, target = 'Class variable', log_experiment = True, experiment_name = 'diabetes1')
When using PyCaret on Databricks
experiment_name
parameter in the setup must include complete path to storage. See example below on how to log experiments when using Databricks:1
# load dataset
2
from pycaret.datasets import get_data
3
data = get_data('diabetes')
4
5
# init setup
6
from pycaret.classification import *
7
clf1 = setup(data, target = 'Class variable', log_experiment = True, experiment_name = '/Users/[email protected]/experiment-name-here')
Following parameters in the setup can be used for setting parameters for model selection process. These are not related to data preprocessing but can influence your model selection process.
- train_size: float, default = 0.7 The proportion of the dataset to be used for training and validation.
- test_data: dataframe-like or None, default = NoneIf not None,
test_data
is used as a hold-out set andtrain_size
parameter is ignored. The columns in thedata
andtest_data
must match. - data_split_shuffle: bool, default = True When set to
False
, prevents shuffling of rows duringtrain_test_split
. - data_split_stratify: bool or list, default = True Controls stratification during the
train_test_split
. When set toTrue
, it will stratify by target column. To stratify on any other columns, pass a list of column names. Ignored whendata_split_shuffle
isFalse
. - fold_strategy: str or scikit-learn CV generator object, default = ‘stratifiedkfold’ Choice of cross-validation strategy. Possible values are:
- ‘kfold’
- ‘stratifiedkfold’
- ‘groupkfold’
- ‘timeseries’
- a custom CV generator object compatible with scikit-learn.
- For
groupkfold
, column name must be passed infold_groups
parameter. Example:setup(fold_strategy="groupkfold", fold_groups="COLUMN_NAME")
- fold: int, default = 10 The number of folds to be used in cross-validation. Must be at least 2. This is a global setting that can be over-written at the function level by using the
fold
parameter. Ignored whenfold_strategy
is a custom object. - fold_shuffle: bool, default = False Controls the shuffle parameter of CV. Only applicable when
fold_strategy
iskfold
orstratifiedkfold
. Ignored whenfold_strategy
is a custom object. - fold_groups: str or array-like, with shape (n_samples,), default = NoneOptional group labels when ‘GroupKFold’ is used for the cross-validation. It takes an array with shape (n_samples, ) where n_samples is the number of rows in the training dataset. When the string is passed, it is interpreted as the column name in the dataset containing group labels.
Following parameters in the setup can be used for controlling other experiment settings such as using GPU for training or setting verbosity of the experiment. They do not affect the data in any way.
- n_jobs: int, default = -1 The number of jobs to run in parallel (for functions that support parallel processing) -1 means using all processors. To run all functions on single processor set
n_jobs = None
- use_gpu: bool or str, default = False When set to
True
, it will use GPU for training with algorithms that support it and fall back to CPU if they are unavailable. When set toforce
it will only use GPU-enabled algorithms and raise exceptions when they are unavailable. WhenFalse
all algorithms are trained using CPU only. GPU enabled algorithms:- Extreme Gradient Boosting, requires no further installation
- CatBoost Classifier, requires no further installation (GPU training is only enabled when data > 50,000 rows)
- Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, Support Vector Machine, requires cuML >= 0.15 cuML
- session_id: int, default = None Controls the randomness of the experiment. It is equivalent to
random_state
in scikit-learn. WhenNone
, a pseudo-random number is generated. This can be used for later reproducibility of the entire experiment. - verbose: bool, default = True When set to
False
, Information grid is not printed. - profile: bool, default = False When set to
True
, an interactive EDA report is displayed. - profile_kwargs: dict, default = {} (empty dict) Dictionary of arguments passed to the
ProfileReport
method used to create the EDA report. Ignored ifprofile
is False. - custom_pipeline: list of (str, transformer), dict or Pipeline, default = NoneAdditional custom transformers. If passed, they are applied to the pipeline last, after all the build-in transformers.
- custom_pipeline_position: int, default = -1 Position of the custom pipeline in the overall preprocessing pipeline. The default value adds the custom pipeline last.
- preprocess: bool, default = True When set to
False
, no transformations are applied except fortrain_test_split
and custom transformations passed incustom_pipeline
parameter. Data must be ready for modeling (no missing values, no dates, categorical data encoding) when preprocess is set toFalse
. - system_log: bool or str or logging.Logger, default = TrueWhether to save the system logging file (as logs.log). If the input is a string, use that as the path to the logging file. If the input already is a logger object, use that one instead.
- memory: str, bool or Memory, default=TrueUsed to cache the fitted transformers of the pipeline.
- If False: No caching is performed.
- If True: A default temp directory is used.
- If str: Path to the caching directory.
Last modified 8mo ago