Time Series 101 - For beginners
Photo by Chris Liverani on Unsplash
Time series data is data collected on the same subject at different points in time, such as GDP of a country by year, a stock price of a particular company over a period of time, or your own heartbeat recorded at each second, as a matter of fact, anything that you can capture continuously at different time-intervals is a time series data.
See below as an example of time series data, the chart below is the daily stock price of Tesla Inc. (Ticker Symbol: TSLA) for last year. The y-axis on the right-hand side is the value in US$ (The last point on the chart i.e. $701.91 is the latest stock price as of the writing of this article on April 12, 2021).
Example of Time Series Data — Tesla Inc. (ticker symbol: TSLA) daily stock price 1Y interval.
On the other hand, more conventional datasets such as customer information, product information, company information, etc. which store information at a single point in time are known as cross-sectional data.
See the example below of a dataset that tracks America’s best-selling electric cars in the first half of 2020. Notice that instead of tracking the cars sold over a period of time, the chart below tracks different cars such as Tesla, Chevy, and Nissan in the same time period.
It is not very hard to distinguish the difference between cross-sectional and time-series data as the objective of analysis for both datasets are widely different. For the first analysis, we were interested in tracking Tesla stock price over a period of time, whereas for the latter, we wanted to analyze different companies in the same time period i.e. first half of 2020.
However, a typical real-world dataset is likely to be a hybrid. Imagine a retailer like Walmart that sold thousand’s of products every day. If you analyze the sale by-product on a particular day, for example, if you want to find out what’s the number 1 selling item on Christmas eve, this will be a cross-sectional analysis. As opposed to, If you want to find out the sale of one particular item such as PS4 over a period of time (let’s say last 5 years), this now becomes a time-series analysis.
Precisely, the objective of the analysis for time-series and cross-sectional data is different and a real-world dataset is likely to be a hybrid of both time-series as well as cross-sectional data.
Time series forecasting is exactly what it sounds like i.e. predicting the future unknown values. However, unlike sci-fi movies, it’s a little less thrilling in the real world. It involves the collection of historical data, preparing it for algorithms to consume (the algorithm is simply put the maths that goes behind the scene), and then predict the future values based on patterns learned from the historical data.
Can you think of a reason why would companies or anybody be interested in forecasting future values for any time series (GDP, monthly sales, inventory, unemployment, global temperatures, etc.). Let me give you some business perspective:
- A retailer may be interested in predicting future sales at an SKU level for planning and budgeting.
- A small merchant may be interested in forecasting sales by store, so it can schedule the right resources (more people during busy periods and vice versa).
- A software giant like Google may be interested in knowing the busiest hour of the day or busiest day of the week so that it can schedule server resources accordingly.
- The health department may be interested in predicting the cumulative COVID vaccination administered so that it can know the point of consolidation where herd immunity is expected to kick in.
Time series forecasting can broadly be categorized into the following categories:
- Classical / Statistical Models — Moving Averages, Exponential smoothing, ARIMA, SARIMA, TBATS
- **Machine Learning **— Linear Regression, XGBoost, Random Forest, or any ML model with reduction methods
- **Deep Learning **— RNN, LSTM
This tutorial is focused on forecasting time series using Machine Learning. For this tutorial, I will use the Regression Module of an open-source, low-code machine library in Python called PyCaret. If you haven’t used PyCaret before, you can get quickly started here. Although, you don’t require any prior knowledge of PyCaret to follow along with this tutorial.
PyCaret Regression Module is a supervised machine learning module used for estimating the relationships between a dependent variable (often called the ‘outcome variable’, or ‘target’) and one or more independent variables (often called ‘features’, or ‘predictors’).
The objective of regression is to predict continuous values such as sales amount, quantity, temperature, number of customers, etc. All modules in PyCaret provide many pre-processing features to prepare the data for modeling through the setup function. It has over 25 ready-to-use algorithms and several plots to analyze the performance of trained models.
**# read csv file
**import pandas as pd
data = pd.read_csv('AirPassengers.csv')
data['Date'] = pd.to_datetime(data['Date'])
**# create 12 month moving average
**data['MA12'] = data['Passengers'].rolling(12).mean()
**# plot the data and MA
**import plotly.express as px
fig = px.line(data, x="Date", y=["Passengers", "MA12"], template = 'plotly_dark')
US Airline Passenger Dataset Time Series Plot with Moving Average = 12
Since machine learning algorithms cannot directly deal with dates, let’s extract some simple features from dates such as month and year, and drop the original date column.
**# extract month and year from dates**
data['Month'] = [i.month for i in data['Date']]
data['Year'] = [i.year for i in data['Date']]
**# create a sequence of numbers
**data['Series'] = np.arange(1,len(data)+1)
**# drop unnecessary columns and re-arrange
**data.drop(['Date', 'MA12'], axis=1, inplace=True)
data = data[['Series', 'Year', 'Month', 'Passengers']]
**# check the head of the dataset**
Sample rows after extracting features
**# split data into train-test set
**train = data[data['Year'] < 1960]
test = data[data['Year'] >= 1960]
**# check shape
>>> ((132, 4), (12, 4))
I have manually split the dataset before initializing the setup . An alternate would be to pass the entire dataset to PyCaret and let it handle the split, in which case you will have to pass data_split_shuffle = False in the setup function to avoid shuffling the dataset before the split.
Now it’s time to initialize the setup function, where we will explicitly pass the training data, test data, and cross-validation strategy using the fold_strategy parameter.
**# import the regression module**
from pycaret.regression import *
**# initialize setup**
s = setup(data = train, test_data = test, target = 'Passengers', fold_strategy = 'timeseries', numeric_features = ['Year', 'Series'], fold = 3, transform_target = True, session_id = 123)
best = compare_models(sort = 'MAE')
Results from compare_models
The best model based on cross-validated MAE is **Least Angle Regression **(MAE: 22.3). Let’s check the score on the test set.
prediction_holdout = predict_model(best);
Results from predict_model(best) function
MAE on the test set is 12% higher than the cross-validated MAE. Not so good, but we will work with it. Let’s plot the actual and predicted lines to visualize the fit.
**# generate predictions on the original dataset**
predictions = predict_model(best, data=data)
**# add a date column in the dataset**
predictions['Date'] = pd.date_range(start='1949-01-01', end = '1960-12-01', freq = 'MS')
**# line plot**
fig = px.line(predictions, x='Date', y=["Passengers", "Label"], template = 'plotly_dark')
**# add a vertical rectange for test-set separation**
fig.add_vrect(x0="1960-01-01", x1="1960-12-01", fillcolor="grey", opacity=0.25, line_width=0)fig.show()
Actual and Predicted US airline passengers (1949–1960)
The grey backdrop towards the end is the test period (i.e. 1960). Now let’s finalize the model i.e. train the best model i.e. Least Angle Regression on the entire dataset (this time, including the test set).
final_best = finalize_model(best)
Now that we have trained our model on the entire dataset (1949 to 1960), let’s predict five years out in the future through 1964. To use our final model to generate future predictions, we first need to create a dataset consisting of the Month, Year, Series column on the future dates.
future_dates = pd.date_range(start = '1961-01-01', end = '1965-01-01', freq = 'MS')
future_df = pd.DataFrame()
future_df['Month'] = [i.month for i in future_dates]
future_df['Year'] = [i.year for i in future_dates]
future_df['Series'] = np.arange(145 (145+len(future_dates)))future_df.head()
Sample rows from future_df
Now, let’s use the future_df to score and generate predictions.
predictions_future = predict_model(final_best, data=future_df)
Sample rows from predictions_future
concat_df = pd.concat([data,predictions_future], axis=0)
concat_df_i = pd.date_range(start='1949-01-01', end = '1965-01-01', freq = 'MS')
concat_df.set_index(concat_df_i, inplace=True)fig =
px.line(concat_df, x=concat_df.index, y=["Passengers", "Label"], template = 'plotly_dark')
Actual (1949–1960) and Predicted (1961–1964) US airline passengers
There is no limit to what you can achieve using this lightweight workflow automation library in Python. If you find this useful, please do not forget to give ⭐️ on our GitHub repository.
Build your own AutoML in Power BI using PyCaret 2.0 Deploy Machine Learning Pipeline on Azure using Docker Deploy Machine Learning Pipeline on Google Kubernetes Engine Deploy Machine Learning Pipeline on AWS Fargate Build and deploy your first machine learning web app Deploy PyCaret and Streamlit app using AWS Fargate serverless Build and deploy machine learning web app using PyCaret and Streamlit Deploy Machine Learning App built using Streamlit and PyCaret on GKE
Click on the links below to see the documentation and working examples.