Topic Modeling in Power BI using PyCaret
Last updated
Last updated
In our last post, we demonstrated how to implement clustering analysis in Power BI by integrating it with PyCaret, thus allowing analysts and data scientists to add a layer of machine learning to their reports and dashboards without any additional license costs.
In this post, we will see how we can implement topic modeling in Power BI using PyCaret. If you haven’t heard about PyCaret before, please read this announcement to learn more.
What is Natural Language Processing?
What is Topic Modeling?
Train and implement a Latent Dirichlet Allocation model in Power BI.
Analyze results and visualize information in a dashboard.
If you have used Python before, it is likely that you already have Anaconda Distribution installed on your computer. If not, click here to download Anaconda Distribution with Python 3.7 or greater.
Before we start using PyCaret’s machine learning capabilities in Power BI we have to create a virtual environment and install pycaret. It’s a four-step process:
✅ Step 1 — Create an anaconda environment
Open **Anaconda Prompt **from start menu and execute the following code:
“powerbi” is the name of environment we have chosen. You can keep whatever name you would like.
✅ Step 2 — Install PyCaret
Execute the following code in Anaconda Prompt:
Installation may take 15–20 minutes. If you are having issues with installation, please see our GitHub page for known issues and resolutions.
✅Step 3 — Set Python Directory in Power BI
The virtual environment created must be linked with Power BI. This can be done using Global Settings in Power BI Desktop (File → Options → Global → Python scripting). Anaconda Environment by default is installed under:
C:\Users*username*\Anaconda3\envs\
✅Step 4 — Install Language Model
In order to perform NLP tasks you must download language model by executing following code in your Anaconda Prompt.
First activate your conda environment in Anaconda Prompt:
Download English Language Model:
Natural language processing (NLP) is a subfield of computer science and artificial intelligence that is concerned with the interactions between computers and human languages. In particular, NLP covers broad range of techniques on how to program computers to process and analyze large amounts of natural language data.
NLP-powered software helps us in our daily lives in various ways and it is likely that you have been using it without even knowing. Some examples are:
Personal assistants: Siri, Cortana, Alexa.
Auto-complete: In search engines (e.g: Google, Bing, Baidu, Yahoo).
Spell checking: Almost everywhere, in your browser, your IDE (e.g: Visual Studio), desktop apps (e.g: Microsoft Word).
Machine Translation: Google Translate.
**Document Summarization Software: **Text compactor, Autosummarizer.
Topic Modeling is a type of statistical model used for discovering abstract topics in text data. It is one of many practical applications within NLP.
A topic model is a type of statistical model that falls under unsupervised machine learning and is used for discovering abstract topics in text data. The goal of topic modeling is to automatically find the topics / themes in a set of documents.
Some common use-cases for topic modeling are:
Summarizing large text data by classifying documents into topics (the idea is pretty similar to clustering).
**Exploratory Data Analysis **to gain understanding of data such as customer feedback forms, amazon reviews, survey results etc.
**Feature Engineering **creating features for supervised machine learning experiments such as classification or regression
There are several algorithms used for topic modeling. Some common ones are Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Non-Negative Matrix Factorization (NMF). Each algorithm has its own mathematical details which will not be covered in this tutorial. We will implement a Latent Dirichlet Allocation (LDA) model in Power BI using PyCaret’s NLP module.
If you are interested in learning the technical details of the LDA algorithm, you can read this paper.
In order to get meaningful results from topic modeling text data must be processed before feeding it to the algorithm. This is common with almost all NLP tasks. The preprocessing of text is different from the classical preprocessing techniques often used in machine learning when dealing with structured data (data in rows and columns).
PyCaret automatically preprocess text data by applying over 15 techniques such as stop word removal, tokenization, lemmatization, bi-gram/tri-gram extraction etc. If you would like to learn more about all the text preprocessing features available in PyCaret, click here.
Kiva is an international non-profit founded in 2005 in San Francisco. Its mission is to expand financial access to underserved communities in order to help them thrive.
In this tutorial we will use the open dataset from Kiva which contains loan information on 6,818 approved loan applicants. The dataset includes information such as loan amount, country, gender and some text data which is the application submitted by the borrower.
Our objective is to analyze the text data in the ‘en’ column to find abstract topics and then use them to evaluate the effect of certain topics (or certain types of loans) on the default rate.
Now that you have set up the Anaconda Environment, understand topic modeling and have the business context for this tutorial, let’s get started.
The first step is importing the dataset into Power BI Desktop. You can load the data using a web connector. (Power BI Desktop → Get Data → From Web).
Link to csv file: https://raw.githubusercontent.com/pycaret/pycaret/master/datasets/kiva.csv
To train a topic model in Power BI we will have to execute a Python script in Power Query Editor (Power Query Editor → Transform → Run python script). Run the following code as a Python script:
There are 5 ready-to-use topic models available in PyCaret.
By default, PyCaret trains a **Latent Dirichlet Allocation (LDA) **model with 4 topics. Default values can be changed easily:
To change the model type use the ***model ***parameter within get_topics().
To change the number of topics, use the ***num_topics ***parameter.
See the example code for a Non-Negative Matrix Factorization model with 6 topics.
Output:
New columns containing topic weights are attached to the original dataset. Here’s how the final output looks like in Power BI once you apply the query.
Once you have topic weights in Power BI, here’s an example of how you can visualize it in dashboard to generate insights:
You can download the PBIX file and the data set from our GitHub.
If you would like to learn more about implementing Topic Modeling in Jupyter notebook using PyCaret, watch this 2 minute video tutorial:
If you are Interested in learning more about Topic Modeling, you can also checkout our NLP 101 Notebook Tutorial for beginners.
Follow our LinkedIn and subscribe to our Youtube channel to learn more about PyCaret.
User Guide / Documentation GitHub Repository Install PyCaret Notebook Tutorials Contribute in PyCaret
As of the first release 1.0.0, PyCaret has the following modules available for use. Click on the links below to see the documentation and working examples in Python.
Classification Regression Clustering Anomaly Detection Natural Language Processing Association Rule Mining
PyCaret getting started tutorials in Notebook:
Clustering Anomaly Detection Natural Language Processing Association Rule Mining Regression Classification
PyCaret is an open source project. Everybody is welcome to contribute. If you would like to contribute, please feel free to work on open issues. Pull requests are accepted with unit tests on dev-1.0.1 branch.
Please give us ⭐️ on our GitHub repo if you like PyCaret.
Medium : https://medium.com/@moez_62905/
LinkedIn : https://www.linkedin.com/in/profile-moez/
Twitter : https://twitter.com/moezpycaretorg1