Machine Learning in SQL using PyCaret Part I
Last updated
Last updated
This post is a step-by-step tutorial on how to train and deploy an Unsupervised Machine Learning Clustering model in SQL Server using **PyCaret(a low-code ML library in Python).**
Things we will cover in this article:
How to download and install SQL Server for free
How to create a new database and importing data into a database
How to enable and use Python scripting in database
How to train a clustering algorithm in order to assign cluster labels to each observation in the dataset
The go-to tools/ environments for performing ML experiments are Command-Line, IDEs, or Notebooks. However, such tools/environments may pose limitations when the data size gets very large, or when the ML model is required to be put in production. There has been a dire need to have the ability to programme and train models where data reside. MS SQL Server introduced this capability in their SQL Server version 2019. The distinct advantages of using SQL Server for Machine Learning are:
i. Extracting a large amount of data from the system is tedious and time-consuming. Conducting ML experiments on a server brings the code to data, rather than taking data to the code
ii. ML experiments are executed mostly in computer/cpu memory. Most of the machines hit a performance ceiling when training an ML algorithm on large data sets. ML on the SQL Server database avoids this
iii. It is easy to integrate and deploy ML Pipelines along with other ETL processes
SQL Server is a Microsoft relational database management system. As a database server, it performs the primary function of storing and retrieving data as requested by different applications. In this tutorial, we will use **SQL Server 2019 ****Developer** for machine learning by importing PyCaret library into SQL Server.
If you have used SQL Server before, it is likely that you have it installed and have access to the database. If not, **click here** to download SQL Server 2019 Developer or other edition.
Before using PyCaret functionality into SQL Server, you’ll need to install SQL Server and PyCaret. This is a multi-step process:
Download the SQL Server 2019 Developer Edition file “SQL2019-SSEI-Dev.exe”
Open the file and follow the instructions to install (recommended to use Custom install option)
Choose New SQL Server stand-alone installation
In the Instance Features option, select the features including “Python” under Machine Learning Services and Language Extensions and Machine Learning Server (Standalone)
Click “Accept” to provide consent to install Python
Installation may take 15–20 minutes
**Click here** or Open SQL Server Installation Center to download “SQL Server Management Tools” file “SSMS-Setup-ENU.exe”
Open “SSMS-Setup-ENU.exe” file to start the installation
Installation may take 5–10 minutes
Once you have everything installed, you will need to start an instance of the server. To do so, start SSMS. At the login stage, you’ll be asked to the name of the SQL Server that you can choose from the drop-down menu. Once a connection is established, you can see all the objects from the server. If you have downloaded SQL Server for the first time and you do not have a database to work with, you will need to create a new database first.
In the Object Explorer panel, right-click on Databases, and choose New Database
Enter the database name and other information
The setup may take 2–3 minutes including creating a database, user and setting ownership
You will now have to import a CSV file into a database using SQL Server Management Studio.
Create a table “jewellery” in the database
Right-click the database and select Tasks -> Import Data
For Data Source, select Flat File Source. Then use the Browse button to select the CSV file. Spend some time configuring the data import before clicking the **Next **button.
For Destination, select the correct database provider (e.g. SQL Server Native Client 11.0). Enter the Server name; check Use SQL Server Authentication, enter the Username, Password, and **Database **before clicking the **Next **button.
In the Select Source Tables and Views window, you can Edit Mappings before clicking the **Next **button.
Check Run immediately and click the **Next **button
Click the Finish button to run the package
We will run Python “inside” the SQL Server by using the **sp_execute_external_script **system stored procedure. To begin, you need to open a ‘New Query’. Execute the following query in your instance to enable the use of the procedure for remote script execution:
Note: Restart the instance before proceeding to the next steps.
Following SQL Statements can be executed to check the Python path and list installed packages.
Check Python Path:
List Installed Packages:
To install PyCaret package, open a command prompt and browse to the location of Python packages where SQL Server is installed. The default location is:
Navigate to “Scripts” directory and use pip command to install PyCaret package
Note: Make sure, you have access to the SQL Server directory to install package and/or change configurations. Otherwise, the package installation will fail.
Installation may take 5–10 minutes
Note: In case encounter issue about missing “lightgbm” module when running SQL script. Follow the instructions below:
i. Uninstall “lightgbm”
ii. Reinstall “lightgbm”
Execute the following SQL to verify the PyCaret installation from SQL Server:
Clustering is a machine learning technique that groups data points with similar characteristics. These groupings are useful for exploring data, identifying patterns, and analyzing a subset of data. Some common business use cases for clustering are:
✔ Customer segmentation for the purpose of marketing.
✔ Customer purchasing behaviour analysis for promotions and discounts.
✔ Identifying geo-clusters in an epidemic outbreak such as COVID-19.
In this tutorial, we will use the ‘**jewellery.csv’ **file that is available on PyCaret’s Github repository.
Run the following SQL code in SQL Server:
A new column ‘Cluster’ containing the label is attached to the original table.
By default, PyCaret trains a K-Means clustering model with 4 clusters (i.e. all the data points in the table are categorized into 4 groups). Default values can be changed easily:
To change the number of clusters you can use num_clusters parameter within get_clusters( ) function.
To change model type use model parameter within get_clusters( ).
See the following code for training K-Modes model with 6 clusters:
Following these steps, you can assign cluster value to every observation point in the jewellery dataset. You can use similar steps on other datasets too, to perform clustering on them.
In this post, we learnt how to build a clustering model using running a Python library (PyCaret) in SQL Server. Similarly, you can build and run other types of supervised and unsupervised ML models depending on the need of the business problem.
You can further check out the PyCaret website for documentation on other supervised and unsupervised experiments that can be implemented in a similar manner within SQL Server.
My future posts will be tutorials on exploring supervised learning techniques (regression/classification) using Python and Pycaret within a SQL Server.