PyCaret Official
Searchโ€ฆ
LEARN PYCARET
๐Ÿ“–
Official Blog
Time Series Anomaly Detection with PyCaret

โ€‹Hands-on Tutorialsโ€‹

Time Series Anomaly Detection with PyCaret

A step-by-step tutorial on unsupervised anomaly detection for time series data using PyCaret

PyCaret โ€” An open-source, low-code machine learning library in Python

๐Ÿ‘‰ Introduction

This is a step-by-step, beginner-friendly tutorial on detecting anomalies in time series data using PyCaretโ€™s Unsupervised Anomaly Detection Module.

Learning Goals of this Tutorial

  • What is Anomaly Detection? Types of Anomaly Detection.
  • Anomaly Detection use-case in business.
  • Training and evaluating anomaly detection model using PyCaret.
  • Label anomalies and analyze the results.

๐Ÿ‘‰ PyCaret

PyCaret is an open-source, low-code machine learning library and end-to-end model management tool built-in Python for automating machine learning workflows. It is incredibly popular for its ease of use, simplicity, and ability to build and deploy end-to-end ML prototypes quickly and efficiently.
PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few lines only. This makes the experiment cycle exponentially fast and efficient.
PyCaret is simple and easy to use. All the operations performed in PyCaret are sequentially stored in a Pipeline that is fully automated for **deployment. **Whether itโ€™s imputing missing values, one-hot-encoding, transforming categorical data, feature engineering, or even hyperparameter tuning, PyCaret automates all of it.
To learn more about PyCaret, check out their GitHub.

๐Ÿ‘‰ Installing PyCaret

Installing PyCaret is very easy and takes only a few minutes. We strongly recommend using a virtual environment to avoid potential conflicts with other libraries.
PyCaretโ€™s default installation is a slim version of pycaret which only installs hard dependencies that are listed here.
1
**# install slim version (default)
2
**pip install pycaret
3
โ€‹
4
**# install the full version**
5
pip install pycaret[full]
Copied!
When you install the full version of pycaret, all the optional dependencies as listed here are also installed.

๐Ÿ‘‰ What is Anomaly Detection

Anomaly Detection is a technique used for identifying rare items, events, or observations that raise suspicions by differing significantly from the majority of the data.
Typically, the anomalous items will translate to some kind of problem such as:
  • bank fraud,
  • structural defect,
  • medical problem,
  • Error, etc.
Anomaly detection algorithms can broadly be categorized into these groups:
**(a) Supervised: **Used when the data set has labels identifying which transactions are an anomaly and which are normal. (this is similar to a supervised classification problem).
**(b) Unsupervised: **Unsupervised means no labels and a model is trained on the complete data and assumes that the majority of the instances are normal.
(c) Semi-Supervised: A model is trained on normal data only (without any anomalies). When the trained model used on the new data points, it can predict whether the new data point is normal or not (based on the distribution of the data in the trained model).
Anomaly Detection Business use-cases

๐Ÿ‘‰ PyCaret Anomaly Detection Module

PyCaretโ€™s **Anomaly Detection** Module is an unsupervised machine learning module that is used for identifying rare items, events, or **observations. **It provides over 15 algorithms and several plots to analyze the results of trained models.

๐Ÿ‘‰ Dataset

I will be using the NYC taxi passengers dataset that contains the number of taxi passengers from July 2014 to January 2015 at half-hourly intervals. You can download the dataset from here.
1
import pandas as pd
2
data = pd.read_csv('[https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv](https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/nyc_taxi.csv)')
3
โ€‹
4
data['timestamp'] = pd.to_datetime(data['timestamp'])
5
โ€‹
6
data.head()
Copied!
Sample raws from the data
1
**# create moving-averages
2
**data['MA48'] = data['value'].rolling(48).mean()
3
data['MA336'] = data['value'].rolling(336).mean()
4
โ€‹
5
# plot
6
import plotly.express as px
7
fig = px.line(data, x="timestamp", y=['value', 'MA48', 'MA336'], title='NYC Taxi Trips', template = 'plotly_dark')
8
fig.show()
Copied!
value, moving_average(48), and moving_average(336)

๐Ÿ‘‰ Data Preparation

Since algorithms cannot directly consume date or timestamp data, we will extract the features from the timestamp and will drop the actual timestamp column before training models.
1
**# drop moving-average columns
2
**data.drop(['MA48', 'MA336'], axis=1, inplace=True)
3
โ€‹
4
**# set timestamp to index**
5
data.set_index('timestamp', drop=True, inplace=True)
6
โ€‹
7
**# resample timeseries to hourly **
8
data = data.resample('H').sum()
9
โ€‹
10
**# creature features from date**
11
data['day'] = [i.day for i in data.index]
12
data['day_name'] = [i.day_name() for i in data.index]
13
data['day_of_year'] = [i.dayofyear for i in data.index]
14
data['week_of_year'] = [i.weekofyear for i in data.index]
15
data['hour'] = [i.hour for i in data.index]
16
data['is_weekday'] = [i.isoweekday() for i in data.index]
17
โ€‹
18
data.head()
Copied!
Sample rows from data after transformations

๐Ÿ‘‰ Experiment Setup

Common to all modules in PyCaret, the setup function is the first and the only mandatory step to start any machine learning experiment in PyCaret. Besides performing some basic processing tasks by default, PyCaret also offers a wide array of pre-processing features. To learn more about all the preprocessing functionalities in PyCaret, you can see this link.
1
**# init setup**
2
from pycaret.anomaly import *
3
s = setup(data, session_id = 123)
Copied!
setup function in pycaret.anomaly module
Whenever you initialize the setup function in PyCaret, it profiles the dataset and infers the data types for all input features. In this case, you can see day_name and is_weekday is inferred as categorical and remaining as numeric. You can press enter to continue.
Output from setup โ€” truncated for display

๐Ÿ‘‰ Model Training

To check the list of all available algorithms:
1
**# check list of available models**
2
models()
Copied!
Output from models() function
In this tutorial, I am using Isolation Forest, but you can replace the ID โ€˜iforestโ€™ in the code below with any other model ID to change the algorithm. If you want to learn more about the Isolation Forest algorithm, you can refer to this.
1
**# train model
2
**iforest = create_model('iforest', fraction = 0.1)
3
iforest_results = assign_model(iforest)
4
iforest_results.head()
Copied!
Sample rows from iforest_results
Notice that two new columns are appended i.e. **Anomaly **that contains value 1 for outlier and 0 for inlier and **Anomaly_Score **which is a continuous value a.k.a as decision function (internally, the algorithm calculates the score based on which the anomaly is determined).
1
**# check anomalies
2
**iforest_results[iforest_results['Anomaly'] == 1].head()
Copied!
sample rows from iforest_results (FILTER to Anomaly == 1)
We can now plot anomalies on the graph to visualize.
1
import plotly.graph_objects as go
2
โ€‹
3
**# plot value on y-axis and date on x-axis**
4
fig = px.line(iforest_results, x=iforest_results.index, y="value", title='NYC TAXI TRIPS - UNSUPERVISED ANOMALY DETECTION', template = 'plotly_dark')
5
โ€‹
6
**# create list of outlier_dates**
7
outlier_dates = iforest_results[iforest_results['Anomaly'] == 1].index
8
โ€‹
9
**# obtain y value of anomalies to plot**
10
y_values = [iforest_results.loc[i]['value'] for i in outlier_dates]
11
โ€‹
12
fig.add_trace(go.Scatter(x=outlier_dates, y=y_values, mode = 'markers',
13
name = 'Anomaly',
14
marker=dict(color='red',size=10)))
15
16
fig.show()
Copied!
NYC Taxi Trips โ€” Unsupervised Anomaly Detection
Notice that the model has picked several anomalies around Jan 1st which is a new year eve. The model has also detected a couple of anomalies around Jan 18โ€” Jan 22 which is when the North American blizzard** **(a ****fast-moving disruptive blizzard) moved through the Northeast dumping 30 cm in areas around the New York City area.
If you google the dates around the other red points on the graph, you will probably be able to find the leads on why those points were picked up as anomalous by the model (hopefully).
I hope you will appreciate the ease of use and simplicity in PyCaret. In just a few lines of code and few minutes of experimentation, I have trained an unsupervised anomaly detection model and have labeled the dataset to detect anomalies on a time series data.

Coming Soon!

Next week I will be writing a tutorial on training custom models in PyCaret using PyCaret Regression Module. You can follow me on Medium, LinkedIn, and Twitter to get instant notifications whenever a new tutorial is released.
There is no limit to what you can achieve using this lightweight workflow automation library in Python. If you find this useful, please do not forget to give us โญ๏ธ on our GitHub repository.
To hear more about PyCaret follow us on LinkedIn and Youtube.
Join us on our slack channel. Invite link here.

You may also be interested in:

Want to learn about a specific module?

Click on the links below to see the documentation and working examples.