Clustering Analysis in Power BI using PyCaret

How to implement Clustering in Power BI using PyCaret

by Moez Ali

In our last post, we demonstrated how to build an anomaly detector in Power BI by integrating it with PyCaret, thus allowing analysts and data scientists to add a layer of machine learning to their reports and dashboards without any additional license costs.

In this post, we will see how we can implement Clustering Analysis in Power BI using PyCaret. If you havenโ€™t heard about PyCaret before, please read this announcement to learn more.

Learning Goals of this Tutorial

  • What is Clustering? Types of Clustering.

  • Train and implement an unsupervised Clustering model in Power BI.

  • Analyze results and visualize information in a dashboard.

  • How to deploy the Clustering model in Power BI production?

Before we start

If you have used Python before, it is likely that you already have Anaconda Distribution installed on your computer. If not, click here to download Anaconda Distribution with Python 3.7 or greater.

Setting up the Environment

Before we start using PyCaretโ€™s machine learning capabilities in Power BI we have to create a virtual environment and install pycaret. Itโ€™s a three-step process:

โœ… Step 1 โ€” Create an anaconda environment

Open **Anaconda Prompt **from start menu and execute the following code:

conda create --name **myenv** python=3.7

โœ… Step 2 โ€” Install PyCaret

Execute the following code in Anaconda Prompt:

pip install pycaret

Installation may take 15โ€“20 minutes. If you are having issues with installation, please see our GitHub page for known issues and resolutions.

โœ…Step 3 โ€” Set Python Directory in Power BI

The virtual environment created must be linked with Power BI. This can be done using Global Settings in Power BI Desktop (File โ†’ Options โ†’ Global โ†’ Python scripting). Anaconda Environment by default is installed under:

C:\Users*username*\AppData\Local\Continuum\anaconda3\envs\myenv

What is Clustering?

Clustering is a technique that groups data points with similar characteristics. These groupings are useful for exploring data, identifying patterns and analyzing a subset of data. Organising data into clusters helps in identify underlying structures in the data and finds applications across many industries. Some common business use cases for clustering are:

โœ” Customer segmentation for the purpose of marketing.

โœ” Customer purchasing behavior analysis for promotions and discounts.

โœ” Identifying geo-clusters in an epidemic outbreak such as COVID-19.

Types of Clustering

Given the subjective nature of clustering tasks, there are various algorithms that suit different types of problems. Each algorithm has its own rules and the mathematics behind how clusters are calculated.

This tutorial is about implementing a clustering analysis in Power BI using a Python library called PyCaret. Discussion of the specific algorithmic details and mathematics behind these algorithms are out-of-scope for this tutorial.

In this tutorial we will use a K-Means algorithm which is one of the simplest and most popular unsupervised machine learning algorithms. If you would like to learn more about K-Means, you can read this paper.

Setting the Business Context

In this tutorial we will use the current health expenditure dataset from the World Health Organizationโ€™s*** ***Global Health Expenditure database. The dataset contains health expenditure as a % of National GDP for over 200 countries from year 2000 through 2017.

Our objective is to find patterns and groups in this data by using a K-Means clustering algorithm.

Source Data

๐Ÿ‘‰ Letโ€™s get started

Now that you have set up the Anaconda Environment, installed PyCaret, understand the basics of Clustering Analysis and have the business context for this tutorial, letโ€™s get started.

1. Get Data

The first step is importing the dataset into Power BI Desktop. You can load the data using a web connector. (Power BI Desktop โ†’ Get Data โ†’ From Web).

Link to csv file: https://github.com/pycaret/powerbi-clustering/blob/master/clustering.csv

2. Model Training

To train a clustering model in Power BI we will have to execute a Python script in Power Query Editor (Power Query Editor โ†’ Transform โ†’ Run python script). Run the following code as a Python script:

from **pycaret.clustering** import *
dataset = **get_clusters**(dataset, num_clusters=5, ignore_features=['Country'])

We have ignored the โ€˜Countryโ€™ column in the dataset using the ignore_features parameter. There could be many reasons for which you might not want to use certain columns for training a machine learning algorithm.

PyCaret allows you to hide instead of drop unneeded columns from a dataset as you might require those columns for later analysis. For example, in this case we donโ€™t want to use โ€˜Countryโ€™ for training an algorithm and hence we have passed it under ignore_features.

There are over 8 ready-to-use clustering algorithms available in PyCaret.

By default, PyCaret trains a **K-Means Clustering model **with 4 clusters. Default values can be changed easily:

  • To change the model type use the ***model ***parameter within get_clusters().

  • To change the cluster number, use the ***num_clusters ***parameter.

See the example code for K-Modes Clustering with 6 clusters.

from **pycaret.clustering **import *
dataset = **get_clusters**(dataset, model='kmodes', num_clusters=6, ignore_features=['Country'])

Output:

A new column which contains the cluster label is attached to the original dataset. All the year columns are then *unpivoted *to normalize the data so it can be used for visualization in Power BI.

Hereโ€™s how the final output looks like in Power BI.

3. Dashboard

Once you have cluster labels in Power BI, hereโ€™s an example of how you can visualize it in dashboard to generate insights:

You can download the PBIX file and the data set from our GitHub.

๐Ÿ‘‰ Implementing Clustering in Production

What has been demonstrated above was one simple way to implement Clustering in Power BI. However, it is important to note that the method shown above trains the clustering model every time the Power BI dataset is refreshed. This may be a problem for two reasons:

  • When the model is re-trained with new data, the cluster labels may change (eg: some data points that were labeled as Cluster 1 earlier might be labelled as Cluster 2 when re-trained)

  • You donโ€™t want to spend hours of time everyday re-training the model.

A more productive way to implement clustering in Power BI is to use a pre-trained model for generating cluster labels instead of re-training the model every time.

Training Model before-hand

You can use any Integrated Development Environment (IDE)or Notebook for training machine learning models. In this example, we have used Visual Studio Code to train a clustering model.

A trained model is then saved as a pickle file and imported into Power Query for generating cluster labels.

If you would like to learn more about implementing Clustering Analysis in Jupyter notebook using PyCaret, watch this 2 minute video tutorial:

Using the pre-trained model

Execute the below code as a Python script to generate labels from the pre-trained model.

from **pycaret.clustering **import *
dataset = **predict_model**('c:/.../clustering_deployment_20052020, data = dataset)

The output of this will be the same as the one we saw above. The difference is that when you use a pre-trained model, the label is generated on a new dataset using the same model instead of re-training the model.

Making it work on Power BI Service

Once youโ€™ve uploaded the .pbix file to the Power BI service, a couple more steps are necessary to enable seamless integration of the machine learning pipeline into your data pipeline. These include:

  • Enable scheduled refresh for the dataset โ€” to enable a scheduled refresh for the workbook that contains your dataset with Python scripts, see Configuring scheduled refresh, which also includes information about Personal Gateway.

  • Install the Personal Gateway โ€” you need a Personal Gateway installed on the machine where the file is located, and where Python is installed; the Power BI service must have access to that Python environment. You can get more information on how to install and configure Personal Gateway.

If you are Interested in learning more about Clustering Analysis, checkout our Notebook Tutorial.

PyCaret 1.0.1 is coming!

We have received overwhelming support and feedback from the community. We are actively working on improving PyCaret and preparing for our next release. PyCaret 1.0.1 will be bigger and better. If you would like to share your feedback and help us improve further, you may fill this form on the website or leave a comment on our GitHub or LinkedIn page.

Follow our LinkedIn and subscribe to our Youtube channel to learn more about PyCaret.

User Guide / Documentation GitHub Repository Install PyCaret Notebook Tutorials Contribute in PyCaret

Want to learn about a specific module?

As of the first release 1.0.0, PyCaret has the following modules available for use. Click on the links below to see the documentation and working examples in Python.

Classification Regression Clustering Anomaly Detection Natural Language Processing Association Rule Mining

Also see:

PyCaret getting started tutorials in Notebook:

Clustering Anomaly Detection Natural Language Processing Association Rule Mining Regression Classification

Would you like to contribute?

PyCaret is an open source project. Everybody is welcome to contribute. If you would like to contribute, please feel free to work on open issues. Pull requests are accepted with unit tests on dev-1.0.1 branch.

Please give us โญ๏ธ on our GitHub repo if you like PyCaret.

Medium : https://medium.com/@moez_62905/

LinkedIn : https://www.linkedin.com/in/profile-moez/

Twitter : https://twitter.com/moezpycaretorg1

Last updated