Trending March 2024 # Machine Learning Model Deployment Using Streamlit # Suggested April 2024 # Top 4 Popular

You are reading the article Machine Learning Model Deployment Using Streamlit updated in March 2024 on the website Hatcungthantuong.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested April 2024 Machine Learning Model Deployment Using Streamlit

This article was published as a part of the Data Science Blogathon

Overview of Streamlit

If you are someone who has built ML models for real-time predictions and wondering how to deploy models in the form of web applications, to increase their  accessibility. You are at the right place as in this article you will be seeing how to deploy models that are already built on Machine Learning or Deep Learning.

Article overview:

Understand the concept of Model Deployment.

Perform Model deployment using Streamlit for the dog-breed classifier.

Once you are done training the model you have several options of deploying a project on the web that are Flask, Django, Streamlit.

Flask and Django are somewhat heavy so it takes more than one article and time to understand them better (we would be focusing on this one also), but for now, we would be discussing Streamlit. So let’s start with a question.

Why Streamlit?

Streamlit lets you create apps for your Machine Learning project using simple code.

It also supports hot-reloading that lets your app update live as you edit and save your file.

Using streamlit creating an app is very easy, adding a widget is as simple as declaring a variable.

No need to write a backend, No need to define different routes or handle HTTP requests.

We would be discussing how you can deploy a Deep Learning Classifier using Streamlit. For this article let’s take a Dog Bread Classifier, you can check how to create a Dog Bread classifier in the given article link.

train your model and save feature_extractor.h5, dog_breed.h5, dog_breeds_category.pickle.

feature_extractor.h5 is a saved model which will extract features from images,

dog_breed.h5 is another saved model which will be used for prediction.

dog_breeds_category.pickle the file will be used to covert class_num to class_label.

Model Deployment Using Streamlit

Once you have all the required files let’s start with Streamlit installation procedure and build a web application.

Installing Streamlit pip install streamlit Setting up the Project Structure for Model Deployment using Streamlit

Creating a Directory tree is not required but it is a good practice to organize your files and folders.

Source: Local

Start by creating a project_folder, inside the project folder create another folder static and put all the downloaded files inside static and also create folder images inside static. now create an empty chúng tôi and chúng tôi file and place it in the project directory.

Create prediction Pipeline

Creating a predictor function that will take an uploaded picture’s path as input and give different dog breed classes as output.

The predictor function will handle all the image processing, model loading require for the prediction.

The predictor function will be coded in helper.py to keep our structure look ordered.

Let’s begin by loading all of the required libraries:

import cv2 import os import numpy as np import pickle import tensorflow as tf from tensorflow.keras import layers from tensorflow.keras import models,utils import pandas as pd from tensorflow.keras.models import load_model from tensorflow.keras.preprocessing.image import load_img,img_to_array from tensorflow.python.keras import utils

Loading the saved model from the directory:

current_path = os.getcwd() # getting the current path dog_breeds_category_path = os.path.join(current_path, 'staticdog_breeds_category.pickle') # loading class_to_num_category predictor_model = load_model(r'staticdogbreed.h5') with open(dog_breeds_category_path, 'rb') as handle:     dog_breeds = pickle.load(handle) # loading the feature extractor model feature_extractor = load_model(r'staticfeature_extractor.h5')

In the above code chunk, we are loading the different categories of Dog Breeds using the pickle file and then we are loading the weights file (.h5 file) that has training weights. Now you will be defining a predictor function that takes the image’s path as input and returns prediction.

def predictor(img_path): # here image is file name      img = load_img(img_path, target_size=(331,331))     img = img_to_array(img)     img = np.expand_dims(img,axis = 0)     features = feature_extractor.predict(img)     prediction = predictor_model.predict(features)*100     prediction = pd.DataFrame(np.round(prediction,1),columns = dog_breeds).transpose()     prediction.columns = ['values']     prediction  = prediction.nlargest(5, 'values')     prediction = prediction.reset_index()     prediction.columns = ['name', 'values']     return(prediction)

In the above block of code, we have performed the following operations:-

First passing image path to predictor function.

Then it converts the image array into tensors (4-d) array for prediction.

Then it passes tensors to feature_extractor function to get extracted features for prediction, now our extracted features will become input for predictor_model

And finally, it passes extracted features to predictor_model and gets final prediction then converts into data frame to get our prediction in the desired format

The predictor_function returns the top 5 detected dog breeds with their prediction confidence in a data frame.

Now you have a function ready that takes image path and gives prediction which we will call from our web app.

Creating Frontend

Our goal is to create a web app where we can upload a picture and then save that picture in the static/images directory for the prediction part.

Pipeline

Create an upload button and save uploaded pics in the directory.

The function predictor will take an uploaded image’s path as input and would give the output.

Show the uploaded image.

Show the top-5 predictions with their confidence percentage in a barplot.

After prediction, delete the uploaded picture from the directory

Frontend Streamlit code will be written in main.py and to make use of the predictor function created in chúng tôi we need to import the predictor function in main.py file. Let’s check the code part for the given pipeline.

from helper import * #importing all the helper fxn from chúng tôi which we will create later import streamlit as st import os import matplotlib.pyplot as plt import seaborn as sns sns.set_theme(style="darkgrid") sns.set() from PIL import Image st.title('Dog Breed Classifier')

In the above code, we have first imported all the dependencies and then created an app with the title “Dog Breed Classifier”. It’s time to define a function to save uploaded images.

def save_uploaded_file(uploaded_file):     try:         with open(os.path.join('static/images',uploaded_file.name),'wb') as f:             f.write(uploaded_file.getbuffer())         return 1         except:         return 0

This function saves the uploaded pics to the static/images folder.

Creating Upload button, display uploaded image on the app, and call the predictor function which we had just created.

uploaded_file = st.file_uploader("Upload Image") # text over upload button "Upload Image" if uploaded_file is not None:     if save_uploaded_file(uploaded_file):          # display the image         display_image = Image.open(uploaded_file)         st.image(display_image)         prediction = predictor(os.path.join('static/images',uploaded_file.name))         os.remove('static/images/'+uploaded_file.name)         # deleting uploaded saved picture after prediction         # drawing graphs         st.text('Predictions :-')         fig, ax = plt.subplots()         ax  = sns.barplot(y = 'name',x='values', data = prediction,order = prediction.sort_values('values',ascending=False).name)         ax.set(xlabel='Confidence %', ylabel='Breed')         st.pyplot(fig)

Let’s discuss the above code:

You can write text anywhere in the program using the st.write() method.

os.remove() removes the uploaded file after prediction.

st.file_uploader(‘Upload Image’) creates an upload button.

Whatever file will be uploaded will be passed to save_uploaed function in order to save the uploaded file.

For plotting the prediction bar-plot we are using seaborn in Streamlit.

sns.barplot() creates barplot.

Plotted bar-plot will be sorted according to their confidence percentage.

Source: Local

Run Web App

Run the web app on your browser by running the command:-

streamlit run main.py

Here  chúng tôi is the file containing all the frontend code.

Source: Local

So Far we build the web app using Streamlit and it runs as a website on your local computer.

Model Deployment using Streamlit Over Internet

Deploying over the internet increases the accessibility of your application. After deployment app can be accessed from mobile, computer from anywhere in the world.

Streamlit gives you a Streamlit Share feature to deploy your Streamlit web app for free on the internet.

care of the rest. as you can see down below the deployment page on streamlit.io.

I personlally would not suggest deploying the model on Streamlit share as it is not much flexible.

You have multiple choices when it comes to hosting your model on the cloud. AWS, GCD, AZURE CLOUD are some popular services nowadays.

Heroku is a free online model hosting service. you can host the model on the cloud for free. Deploying apps on Heroku is more flexible, Managing your app, package versions, storage is a lot easier with Heroku.

Conclusion

Creating a Machine Learning model is not enough until you make it available to general use or to a specific client. If you are working for a client you probably need to deploy the model in the client’s environment, but if you are working on a project that needs to be publicly available you should use technology to deploy it on the web. Streamlit is the best lightweight technology for web deployment.

So by this, we come to an end to the article. I hope you enjoyed it and now can start creating beautiful apps yourself.

The whole code and architecture can be downloaded from this link.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

You're reading Machine Learning Model Deployment Using Streamlit

Building Machine Learning Model Is Fun Using Orange

Introduction

With growing need of data science managers, we need tools which take out difficulty from doing data science and make it fun. Not everyone is willing to learn coding, even though they would want to learn / apply data science. This is where GUI based tools can come in handy.

Today, I will introduce you to another GUI based tool – Orange. This tool is great for beginners who wish to visualize patterns and understand their data without really knowing how to code.

In my previous article, I presented you with another GUI based tool KNIME. If you do not want to learn to code but still apply data science, you can try out any of these tools.

By the end of this tutorial, you’ll be able to predict which person out of a certain set of people is eligible for a loan with Orange!

Table of Contents:

Why Orange?

Setting up your System:

Creating your first Workflow

Familiarizing yourself with the basics

Problem Statement

Importing the data files

Understanding the data

How do you clean your data?

Training your first model

1. Why Orange?

Orange is a platform built for mining and analysis on a GUI based workflow. This signifies that you do not have to know how to code to be able to work using Orange and mine data, crunch numbers and derive insights.

You can perform tasks ranging from basic visuals to data manipulations, transformations, and data mining. It consolidates all the functions of the entire process into a single workflow.

The best part and the differentiator about Orange is that it has some wonderful visuals. You can try silhouettes, heat-maps, geo-maps and all sorts of visualizations available.

2. Setting up your System

Orange comes built-in with the Anaconda tool if you’ve previously installed it. If not, follow these steps to download Orange.

Step 2: Install the platform and set the working directory for Orange to store its files.

This is what the start-up page of Orange looks like. You have options that allow you to create new projects, open recent ones or view examples and get started.

Before we delve into how Orange works, let’s define a few key terms to help us in our understanding:

A widget is the basic processing point of any data manipulation. It can do a number of actions based on what you choose in your widget selector on the left of the screen.

A workflow is the sequence of steps or actions that you take in your platform to accomplish a particular task.

You can also go to “Example Workflows” on your start-up screen to check out more workflows once you have created your first one.

3. Creating Your First Workflow

This is your blank Workflow on Orange. Now, you’re ready to explore and solve any problem by dragging any widget from the widget menu to your workflow.

4. Familiarising yourself with the basics

Orange is a platform that can help us solve most problems in Data Science today. Topics that range from the most basic visualizations to training models. You can even evaluate and perform unsupervised learning on datasets:

4.1 Problem

The problem we’re looking to solve in this tutorial is the practice problem Loan Prediction that can be accessed via this link on Datahack.

4.2 Importing the data files

We begin with the first and the necessary step to understand our data and make predictions: importing our data

Step 3: Once you can see the structure of your dataset using the widget, go back by closing this menu.

Neat! Isn’t it?

Let’s now visualize some columns to find interesting patterns in our data.

4.3 Understanding our Data

The plot I’ve explored is a Gender by Income plot, with the colors set to the education levels. As we can see in males, the higher income group naturally belongs to the Graduates!

Although in females, we see that a lot of the graduate females are earning low or almost nothing at all. Any specific reason? Let’s find out using the scatterplot.

One possible reason I found was marriage. A huge number graduates who were married were found to be in lower income groups; this may be due to family responsibilities or added efforts. Makes perfect sense, right?

4.3.2 Distribution

What we see is a very interesting distribution. We have in our dataset, more number of married males than females.

4.3.3 Sieve diagram

Let’s visualize using a sieve diagram.

This plot divides the sections of distribution into 4 bins. The sections can be investigated by hovering the mouse over it.

Let’s now look at how to clean our data to start building our model.

5. How do you clean your data?

Here for cleaning purpose, we will impute missing values. Imputation is a very important step in understanding and making the best use of our data.

Here, I have selected the default method to be Average for numerical values and Most Frequent for text based values (categorical).

You can select from a variety of imputations like:

Distinct Value

Random Values

Remove the rows with missing values

Model-Based

6. Training your First Model

Beginning with the basics, we will first train a linear model encompassing all the features just to understand how to select and build models.

Step 1: First, we need to set a target variable to apply Logistic Regression on it.

Step 4: Once we have set our target variable, find the clean data from the “Impute” widget as follows and place the “Logistic Regression” widget.

Ridge Regression:

Performs L2 regularization, i.e. adds penalty equivalent to square of the magnitude of coefficients

Minimization objective = LS Obj + α * (sum of square of coefficients)

Lasso Regression:

Performs L1 regularization, i.e. adds penalty equivalent to absolute value of the magnitude of coefficients

Minimization objective = LS Obj + α * (sum of absolute value of coefficients)

I have chosen Ridge for my analysis, you are free to choose between the two.

Step 8: To visualize the results better, drag and drop from the “Test and Score” widget to fin d “Confusion Matrix”.

This way, you can test out different models and see how accurately they perform.

Let’s try to evaluate, how a Random Forest would do? Change the modeling method to Random Forest and look at the confusion matrix.

Looks decent, but the Logistic Regression performed better.

We can try again with a Support Vector Machine.

Better than the Random Forest, but still not as good as the Logistic Regression model.

Sometimes the simpler methods are the better ones, isn’t it?

This is how your final workflow would look after you are done with the complete process.

For people who wish to work in groups, you can also export your workflows and send it to friends who can work alongside you!

The resulting file is of the (.ows) extension and can be opened in any other Orange setup.

End Notes

Orange is a platform that can be used for almost any kind of analysis but most importantly, for beautiful and easy visuals. In this article, we explored how to visualize a dataset. Predictive modeling was undertaken as well, using a logistic regression predictor, SVM, and a random forest predictor to find loan statuses for each person accordingly.

Hope this tutorial has helped you figure out aspects of the problem that you might not have understood or missed out on before. It is very important to understand the data science pipeline and the steps we take to train a model, and this should surely help you build better predictive models soon!

Related

Why Is Machine Learning Important?

Machine learning can be considered a component of artificial intelligence and involves training the machine to be more intelligent in its operations. AI technology focuses on incorporating human intelligence while machine learning is focused on making the machines learn faster. So we can say that machine learning engineers can provide faster and better optimizations to AI solutions.

AI technology has had a massive impact on society and has transformed almost every industrial sector from planning to production. Thus machine learning engineers and experts are also of great value to this growing industry.

Why is Machine Learning So Useful?

Machine learning is comparatively new but it has existed for many years. Recently gaining a lot of attention, it is essential for many significant technological improvements.

When it comes to business operations, you can access a lot of data with the help of machine learning algorithms. Machine learning also offers more affordable data storage options that have made big data sets possible and accessible for organizations. It has also helped maximize the processing power of computers to be able to perform calculations and operations faster.

Wherever you find AI technology, you will find machine learning experts working to improve the efficiency and results of the AI technologies and machines involved.

Where can Machine Learning be Applied?

Machine learning has a lot of applications in a variety of tasks and operations. It plays a central role in the collection, analysis, and processing the large sets of data. It is not just restricted to the businesses and organizations, you have already interacted with them. However, you might not be aware of the fact that you have already been using machine learning technology. Here are a few examples you can relate to as part of our daily lives.

Machine learning solutions are being incorporated into the medical sciences for better detection and diagnosis of diseases. Here is the interesting part. Machine learning can even be used to keep a check on the emotional states with the help of a smartphone.

This technology is also widely used by manufacturers to minimize losses during operations and maximize production while reducing the cost of maintenances through timely predictions.

The banking industry is also utilizing machine learning to identify any fraudulent practices or transactions to avoid losses. Machine learning can also be used to give significant insights into financial data. This in turn results in better investments and better trades.

When it comes to transportation, the self-driving cars of Google or Tesla are powered by Lachine learning. Thus it can be extremely beneficial for autonomous driving and better interpretations.

What do Machine Learning Engineers do?

Why Pursue a Career in Machine Learning?

There are many reasons to pursue a career in machine learning. It is not only getting [popular and high in demand, but also an interesting discipline where you can be innovative once you have acquired the necessary skills.

Wrapping Up

The aforementioned discussion describes the significant role of the growing machine learning and AI technology in the industrial and business sector and why you should consider pursuing a career in it.

10 Automated Machine Learning For Supervised Learning (Part 2)

This article was published as a part of the Data Science Blogathon

Introduction

This post will discuss 10 Automated Machine Learning (autoML) packages that we can run in Python. If you are tired of running lots of Machine Learning algorithms just to find the best one, this post might be what you are looking for. This post is the second part of this first post. The first part explains the general concept of Machine Learning from defining the objective, pre-processing, model creation and selection, hyperparameter-tuning, and model evaluation. At the end of that post, Auto-Sklearn is introduced as an autoML. If you are already familiar with Machine Learning, you can skip that part 1.

The main point of the first part is that we require a relatively long time and many lines of code to run all of the classification or regression algorithms before finally selecting or ensembling the best models. Instead, we can run AutoML to automatically search for the best model with a much shorter time and, of course, less code. Please find my notebooks for conventional Machine Learning algorithms for regression (predicting house prices) and classification (predicting poverty level classes) tasks in the table below. My AutoML notebooks are also in the table below. Note that this post will focus only on regression and classification AutoML while AutoML also can be applied for image, NLP or text, and time series forecasting.

Now, we will start discussing the 10 AutoML packages that can replace those long notebooks.

 AutoSklearn

This autoML, as mentioned above, has been discussed before. Let’s do a quick recap. Below is the code for searching for the best model in 3 minutes. More details are explained in the first part.

Regression

!apt install -y build-essential swig curl !pip install auto-sklearn !pip install scipy==1.7.0 !pip install -U scikit-learn from autosklearn.regression import AutoSklearnRegressor from sklearn.metrics import mean_squared_error, mean_absolute_error

Running AutoSklearn

# Create the model sklearn = AutoSklearnRegressor(time_left_for_this_task=3*60, per_run_time_limit=30, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the RMSE rmse_sklearn=MSE(y_val, pred_sklearn)**0.5 print('RMSE: ' + str(rmse_sklearn))

Output:

auto-sklearn results: Dataset name: c40b5794-fa4a-11eb-8116-0242ac130202 Metric: r2 Best validation score: 0.888788 Number of target algorithm runs: 37 Number of successful target algorithm runs: 23 Number of crashed target algorithm runs: 0 Number of target algorithms that exceeded the time limit: 8 Number of target algorithms that exceeded the memory limit: 6 RMSE: 27437.715258009852

Result:

# Scatter plot true and predicted values plt.scatter(pred_sklearn, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(rmse_sklearn))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_sklearn)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(y_val, pred_sklearn)[0,1],4))) plt.show()

 Fig. 1 AutoSklearn Regression Result

Classification

Applying AutoML

!apt install -y build-essential swig curl from autosklearn.classification import AutoSklearnClassifier # Create the model sklearn = AutoSklearnClassifier(time_left_for_this_task=3*60, per_run_time_limit=15, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the accuracy print('Accuracy: ' + str(accuracy_score(y_val, pred_sklearn)))

Output:

auto-sklearn results: Dataset name: 576d4f50-c85b-11eb-802c-0242ac130202 Metric: accuracy Best validation score: 0.917922 Number of target algorithm runs: 40 Number of successful target algorithm runs: 8 Number of crashed target algorithm runs: 0 Number of target algorithms that exceeded the time limit: 28 Number of target algorithms that exceeded the memory limit: 4 Accuracy: 0.923600209314495

Result:

# Prediction results print('Confusion Matrix') print(pd.DataFrame(confusion_matrix(y_val, pred_sklearn), index=[1,2,3,4], columns=[1,2,3,4])) print('') print('Classification Report') print(classification_report(y_val, pred_sklearn))

Output:

Confusion Matrix 1 2 3 4 1 123 14 3 11 2 17 273 11 18 3 3 17 195 27 4 5 15 5 1174 Classification Report precision recall f1-score support 1 0.83 0.81 0.82 151 2 0.86 0.86 0.86 319 3 0.91 0.81 0.86 242 4 0.95 0.98 0.97 1199 accuracy 0.92 1911 macro avg 0.89 0.86 0.88 1911 weighted avg 0.92 0.92 0.92 1911 Tree-based Pipeline Optimization Tool (TPOT)

TPOT is built on top of scikit-learn. TPOT uses a genetic algorithm to search for the best model according to the “generations” and “population size”. The higher the two parameters are set, the longer will it take time. Unlike AutoSklearn, we do not set the specific running time for TPOT. As its name suggests, after the TPOT is run, it exports lines of code containing a pipeline from importing packages, splitting the dataset, creating the tuned model, fitting the model, and finally predicting the validation dataset. The pipeline is exported in .py format.

In the code below, I set the generation and population_size to be 5. The output gives 5 generations with increasing “scoring”. I set the scoring to be “neg_mean_absolute_error” and” accuracy” for regression and classification tasks respectively. Neg_mean_absolute_error means Mean Absolute Error (MAE) in negative form. The algorithm chooses the highest scoring value. Making the MAE negative will make the algorithm selecting the MAE closest to zero.

Regression

from tpot import TPOTRegressor # Create model cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=123) tpot = TPOTRegressor(generations=5, population_size=5, cv=cv, scoring='neg_mean_absolute_error', verbosity=2, random_state=123, n_jobs=-1) # Fit the training data tpot.fit(X_train, y_train) # Export the result tpot.export('tpot_model.py')

Output:

Generation 1 - Current best internal CV score: -20390.588131563232 Generation 2 - Current best internal CV score: -19654.82630417806 Generation 3 - Current best internal CV score: -19312.09139004322 Generation 4 - Current best internal CV score: -19312.09139004322 Generation 5 - Current best internal CV score: -18752.921100941825 Best pipeline: RandomForestRegressor(input_matrix, bootstrap=True, max_features=0.25, min_samples_leaf=3, min_samples_split=2, n_estimators=100)

Classification

from tpot import TPOTClassifier # TPOT that are stopped earlier. It still gives temporary best pipeline. # Create the model cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=123) tpot = TPOTClassifier(generations=5, population_size=5, cv=cv, scoring='accuracy', verbosity=2, random_state=123, n_jobs=-1) # Fit the training data tpot.fit(X_train, y_train) # Export the result tpot.export('tpot_model.py')

Output:

Generation 1 - Current best internal CV score: 0.7432273262661955 Generation 2 - Current best internal CV score: 0.843824979278454 Generation 3 - Current best internal CV score: 0.8545565589146273 Generation 4 - Current best internal CV score: 0.8545565589146273 Generation 5 - Current best internal CV score: 0.859616978580465 Best pipeline: RandomForestClassifier(GradientBoostingClassifier(input_matrix, learning_rate=0.001, max_depth=2, max_features=0.7000000000000001, min_samples_leaf=1, min_samples_split=19, n_estimators=100, subsample=0.15000000000000002), bootstrap=True, criterion=gini, max_features=0.8500000000000001, min_samples_leaf=4, min_samples_split=12, n_estimators=100)

AutoSklearn gives the result of RandomForestRegressor for the regression task. As for the classification, it gives the stacking of GradientBoostingClassifier and RandomForestClassifier. All algorithms already have their hyperparameters tuned.

Here is to see the validation data scoring metrics.

Regression

pred_tpot = results # Scatter plot true and predicted values plt.scatter(pred_tpot, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(MSE(y_val, pred_tpot)**0.5))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_tpot)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(y_val, pred_tpot)[0,1],4))) plt.show()

Output:

Fig. 2 TPOT regression result.

Classification

pred_tpot = results # Compute the accuracy print('Accuracy: ' + str(accuracy_score(y_val, pred_tpot))) print('') # Prediction results print('Confusion Matrix') print(pd.DataFrame(confusion_matrix(y_val, pred_tpot), index=[1,2,3,4], columns=[1,2,3,4])) print('') print('Classification Report') print(classification_report(y_val, pred_tpot))

Output:

Accuracy: 0.9246467817896389 Confusion Matrix 1 2 3 4 1 117 11 7 16 2 6 288 10 15 3 2 18 186 36 4 5 12 6 1176 Classification Report precision recall f1-score support 1 0.90 0.77 0.83 151 2 0.88 0.90 0.89 319 3 0.89 0.77 0.82 242 4 0.95 0.98 0.96 1199 accuracy 0.92 1911 macro avg 0.90 0.86 0.88 1911 weighted avg 0.92 0.92 0.92 1911 Distributed Asynchronous Hyper-parameter Optimization (Hyperopt)

Hyperopt is also usually used to optimize the hyperparameters of one model that has been specified. For example, we decide to apply Random Forest and then run hyperopt to find the optimal hyperparameters for the Random Forest. My previous post discussed it. But, this post is different in that it uses hyperopt to search for the best Machine Learning model automatically, not just tuning the hyperparameters. The code is similar but different.

The code below shows how to use hyperopt to run AutoML. Max evaluation of 50 and trial timeout of 20 seconds is set. These will determine how long the AutoML will work. Like in TPOT, we do not set the time limit in hyperopt.

Regression

from hpsklearn import HyperoptEstimator from hpsklearn import any_regressor from hpsklearn import any_preprocessing from hyperopt import tpe from sklearn.metrics import mean_squared_error # Create the model hyperopt = HyperoptEstimator(regressor=any_regressor(‘reg’), preprocessing=any_preprocessing(‘pre’), loss_fn=mean_squared_error, algo=tpe.suggest, max_evals=50, trial_timeout=20) # Fit the data hyperopt.fit(X_train, y_train)

Classification

from hpsklearn import HyperoptEstimator from hpsklearn import any_classifier from hpsklearn import any_preprocessing from hyperopt import tpe # Create the model hyperopt = HyperoptEstimator(classifier=any_classifier(‘cla’), preprocessing=any_preprocessing(‘pre’), algo=tpe.suggest, max_evals=50, trial_timeout=30) # Fit the training data hyperopt.fit(X_train_ar, y_train_ar)

In the Kaggle notebook (in the table above), every time I finished performing the fitting and predicting, I will show the prediction of validation data results in scatter plot, confusion matrix, and classification report. The code is always almost the same with a little adjustment. So, from this point onwards, I am not going to put them in this post. But, the Kaggle notebook provides them.

To see the algorithm from the AutoML search result. Use the below code. The results are ExtraTreeClassfier and XGBRegressor. Observe that it also searches for the preprocessing technics, such as standard scaler and normalizer.

# Show the models print(hyperopt.best_model())

Regression

{'learner': XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=0.6209369845565308, colsample_bynode=1, colsample_bytree=0.6350745975782562, gamma=0.07330922089021298, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.0040826994703554555, max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=2600, n_jobs=1, num_parallel_tree=1, objective='reg:linear', random_state=3, reg_alpha=0.4669165283261672, reg_lambda=2.2280355282357056, scale_pos_weight=1, seed=3, subsample=0.7295609371405459, tree_method='exact', validate_parameters=1, verbosity=None), 'preprocs': (Normalizer(norm='l1'),), 'ex_preprocs': ()}

Classification

{'learner': ExtraTreesClassifier(bootstrap=True, max_features='sqrt', n_estimators=308, n_jobs=1, random_state=1, verbose=False), 'preprocs': (StandardScaler(with_std=False),), 'ex_preprocs': ()} AutoKeras

AutoKeras, as you might guess, is an autoML specializing in Deep Learning or Neural networks. The “Keras” in the name gives the clue. AutoKeras helps in finding the best neural network architecture and hyperparameters for the prediction model. Unlike the other AutoML, AutoKeras does not consider tree-based, distance-based, or other Machine Learning algorithms.

Deep Learning is challenging not only for the hyperparameter-tuning but also for the architecture setting. Many ask about how many neurons or layers are the best to use. There is no clear answer to that. Conventionally, users must try to run and evaluate their Deep Learning architectures one by one before finally decide which one is the best. It takes a long time and resources to accomplish. I wrote a post describing this here. However, AutoKeras can solve this problem.

To apply the AutoKeras, I set the max_trials to be 8 and it will try to find the best deep learning architecture for a maximum of 8 trials. Set the epoch while fitting the training dataset and it will determine the accuracy of the model.

Regression

!pip install autokeras import autokeras # Create the model keras = autokeras.StructuredDataRegressor(max_trials=8) # Fit the training dataset keras.fit(X_train, y_train, epochs=100) # Predict the validation data pred_keras = keras.predict(X_val)

Classification

!pip install autokeras import autokeras # Create the model keras = autokeras.StructuredDataClassifier(max_trials=8) # Fit the training dataset keras.fit(X_train, y_train, epochs=100) # Predict the validation data pred_keras = keras.predict(X_val) # Compute the accuracy print('Accuracy: ' + str(accuracy_score(y_val, pred_keras)))

To find the architecture of the AutoKeras search, use the following code.

# Show the built models keras_export = keras.export_model() keras_export.summary()

Regression

Model: "model" Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 20)] 0 multi_category_encoding (Mul (None, 20) 0 dense (Dense) (None, 512) 10752 re_lu (ReLU) (None, 512) 0 dense_1 (Dense) (None, 32) 16416 re_lu_1 (ReLU) (None, 32) 0 regression_head_1 (Dense) (None, 1) 33 ================================================================= Total params: 27,201 Trainable params: 27,201 Non-trainable params: 0

Classification

Model: "model" Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 71)] 0 multi_category_encoding (Mul (None, 71) 0 normalization (Normalization (None, 71) 143 dense (Dense) (None, 512) 36864 re_lu (ReLU) (None, 512) 0 dense_1 (Dense) (None, 32) 16416 re_lu_1 (ReLU) (None, 32) 0 dropout (Dropout) (None, 32) 0 dense_2 (Dense) (None, 4) 132 classification_head_1 (Softm (None, 4) 0 ================================================================= Total params: 53,555 Trainable params: 53,412 Non-trainable params: 143  MLJAR

MLJAR is another great AutoML. And, you will find why soon. To run MLJAR, I assign the arguments of mode, eval_metric, total_time_limit, and feature_selection. This AutoML will understand whether it is a regression or classification task from the eval_metric. The total_time_limit is the duration of how long we allow MLJAR to run in seconds. In this case, it will take 300 seconds or 5 minutes to find the possible best model. We can also specify whether to allow feature selection. The output then will report the used algorithms and how long they take to finish.

Regression

from supervised.automl import AutoML # Create the model mljar = AutoML(mode="Compete", eval_metric="rmse", total_time_limit=300, features_selection=True) # Fit the training data mljar.fit(X_train, y_train) # Predict the training data mljar_pred = mljar.predict(X_val)

Classification

from supervised.automl import AutoML # Create the model mljar = AutoML(mode="Compete", eval_metric="accuracy", total_time_limit=300, features_selection=True) # Fit the training data mljar.fit(X_train, y_train) # Predict the training data mljar_pred = mljar.predict(X_val)

The argument “mode” lets us decide what the MLJAR is expected for. There are 4 types of modes to define the purpose of running the MLJAR. In the example code above, the mode of “compete” is used for winning a competition by finding the best model by tuning and ensembling methods. The mode of “optuna” is used to find the best-tuned model with unlimited computation time. The mode of “perform” builds a Machine Learning pipeline for production. The mode of “explain” is used for data explanation.

The result of MLJAR is automatically reported and visualized. Unfortunately, Kaggle does not display the report result after saving. So, below is how it should look. The report compares the MLJAR results for every algorithm. We can see the ensemble methods have the lowest MSE for the regression task and the highest accuracy for the classification task. The increasing number of iteration lowers the MSE for the regression task and improves the accuracy of the classification task. (The leaderboard tables below actually have more rows, but they were cut.)

Regression

Fig. 3 MLJAR Report for regression

Fig. 4 MLJAR for regession

Fig. 5 MLJAR report for classification (1). Fig. 6 MLJAR report for classification (2). 

 AutoGluon

AutoGluon requires users to format the training dataset using TabularDataset to recognize it. Users can then specify the time_limit allocation for AutoGluon to work. In the example code below, I set it to be 120 seconds or 2 minutes.

Regression

!pip install -U pip !pip install -U setuptools wheel !pip install -U "mxnet<2.0.0" !pip install autogluon from autogluon.tabular import TabularDataset, TabularPredictor # Prepare the data Xy_train = X_train.reset_index(drop=True) Xy_train['Target'] = y_train Xy_val = X_val.reset_index(drop=True) Xy_val['Target'] = y_val X_train_gluon = TabularDataset(Xy_train) X_val_gluon = TabularDataset(Xy_val) # Fit the training data gluon = TabularPredictor(label='Target').fit(X_train_gluon, time_limit=120) # Predict the training data gluon_pred = gluon.predict(X_val)

Classification

!pip install -U pip !pip install -U setuptools wheel !pip install -U "mxnet<2.0.0" !pip install autogluon from autogluon.tabular import TabularDataset, TabularPredictor # Prepare the data Xy_train = X_train.reset_index(drop=True) Xy_train['Target'] = y_train Xy_val = X_val.reset_index(drop=True) Xy_val['Target'] = y_val X_train_gluon = TabularDataset(Xy_train) X_val_gluon = TabularDataset(Xy_val) # Fit the training data gluon = TabularPredictor(label='Target').fit(X_train_gluon, time_limit=120) # Predict the training data gluon_pred = gluon.predict(X_val)

After finishing the task, AutoGluon can report the accuracy of each Machine Learning algorithm it has tried. The report is called leaderboard. The columns below are actually longer, but I cut them for this post.

# Show the models leaderboard = gluon.leaderboard(X_train_gluon) leaderboard

Regression

model score_test score_val pred_time_test . . .

0 RandomForestMSE -15385.131260 -23892.159881 0.133275 . . .

1 ExtraTreesMSE -15537.139720 -24981.601931 0.137063 . . .

2 LightGBMLarge -17049.125557 -26269.841824 0.026560 . . .

3 XGBoost -18142.996982 -23573.451829 0.054067 . . .

4 KNeighborsDist -18418.785860 -41132.826848 0.135036 . . .

5 CatBoost -19585.309377 -23910.403833 0.004854 . . .

6 WeightedEnsemble_L2 -20846.144676 -22060.013365 1.169406 . . .

7 LightGBM -23615.121228 -23205.065207 0.024396 . . .

8 LightGBMXT -25261.893395 -24608.580984 0.015091 . . .

9 NeuralNetMXNet -28904.712029 -24104.217749 0.819149 . . .

10 KNeighborsUnif -39243.784302 -39545.869493 0.132839 . . .

11 NeuralNetFastAI -197411.475391 -191261.448480 0.070965 . . .

Classification

model score_test score_val pred_time_test . . .

0 WeightedEnsemble_L2 0.986651 0.963399 3.470253 . . .

1 LightGBM 0.985997 0.958170 0.600316 . . .

2 XGBoost 0.985997 0.956863 0.920570 . . .

3 RandomForestEntr 0.985866 0.954248 0.366476 . . .

4 RandomForestGini 0.985735 0.952941 0.397669 . . .

5 ExtraTreesEntr 0.985735 0.952941 0.398659 . . .

6 ExtraTreesGini 0.985735 0.952941 0.408386 . . .

7 KNeighborsDist 0.985473 0.950327 2.013774 . . .

8 LightGBMXT 0.984034 0.951634 0.683871 . . .

9 NeuralNetFastAI 0.983379 0.947712 0.340936 . . .

10 NeuralNetMXNet 0.982332 0.956863 2.459954 . . .

11 CatBoost 0.976574 0.934641 0.044412 . . .

12 KNeighborsUnif 0.881560 0.769935 1.970972 . . .

13 LightGBMLarge 0.627143 0.627451 0.014708 . . .

H2O

Similar to AutoGluon, H2O requires the training dataset in a certain format, called H2OFrame. To decide how long H2O will work, either max_runtime_secs or max_models must be specified. The names explain what they mean.

Regression

import h2o from h2o.automl import H2OAutoML h2o.init() # Prepare the data Xy_train = X_train.reset_index(drop=True) Xy_train['SalePrice'] = y_train.reset_index(drop=True) Xy_val = X_val.reset_index(drop=True) Xy_val['SalePrice'] = y_val.reset_index(drop=True) # Convert H2O Frame Xy_train_h2o = h2o.H2OFrame(Xy_train) X_val_h2o = h2o.H2OFrame(X_val) # Create the model h2o_model = H2OAutoML(max_runtime_secs=120, seed=123) # Fit the model h2o_model.train(x=Xy_train_h2o.columns, y='SalePrice', training_frame=Xy_train_h2o) # Predict the training data h2o_pred = h2o_model.predict(X_val_h2o)

Classification

import h2o from h2o.automl import H2OAutoML h2o.init() # Convert H2O Frame Xy_train_h2o = h2o.H2OFrame(Xy_train) X_val_h2o = h2o.H2OFrame(X_val) Xy_train_h2o['Target'] = Xy_train_h2o['Target'].asfactor() # Create the model h2o_model = H2OAutoML(max_runtime_secs=120, seed=123) # Fit the model h2o_model.train(x=Xy_train_h2o.columns, y='Target', training_frame=Xy_train_h2o) # Predict the training data h2o_pred = h2o_model.predict(X_val_h2o) h2o_pred

For the classification task, the prediction result is a multilabel classification result. It gives the probability value for each class. Below is an example of the classification.

predict p1 p2 p3 p4

4 0.0078267 0.0217498 0.0175197 0.952904

4 0.00190617 0.00130162 0.00116375 0.995628

4 0.00548938 0.0156449 0.00867845 0.970187

3 0.00484961 0.0161661 0.970052 0.00893224

2 0.0283297 0.837641 0.0575789 0.0764503

3 0.00141621 0.0022694 0.992301 0.00401299

4 0.00805432 0.0300103 0.0551097 0.906826

H2O reports its result by a simple table showing various scoring metrics of each Machine Learning algorithm.

# Show the model results leaderboard_h2o = h2o.automl.get_leaderboard(h2o_model, extra_columns = 'ALL') leaderboard_h2o

Regression output:

model_id mean_residual_deviance rmse mse mae rmsle …

GBM_grid__1_AutoML_20240811_022746_model_17 8.34855e+08 28893.9 8.34855e+08 18395.4 0.154829 …

GBM_1_AutoML_20240811_022746 8.44991e+08 29068.7 8.44991e+08 17954.1 0.149824 …

StackedEnsemble_BestOfFamily_AutoML_20240811_022746 8.53226e+08 29210 8.53226e+08 18046.8 0.149974 …

GBM_grid__1_AutoML_20240811_022746_model_1 8.58066e+08 29292.8 8.58066e+08 17961.7 0.153238 …

GBM_grid__1_AutoML_20240811_022746_model_2 8.91964e+08 29865.8 8.91964e+08 17871.9 0.1504 …

GBM_grid__1_AutoML_20240811_022746_model_10 9.11731e+08 30194.9 9.11731e+08 18342.2 0.153421 …

GBM_grid__1_AutoML_20240811_022746_model_21 9.21185e+08 30351 9.21185e+08 18493.5 0.15413 …

GBM_grid__1_AutoML_20240811_022746_model_8 9.22497e+08 30372.6 9.22497e+08 19124 0.159135 …

GBM_grid__1_AutoML_20240811_022746_model_23 9.22655e+08 30375.2 9.22655e+08 17876.6 0.150722 …

XGBoost_3_AutoML_20240811_022746 9.31315e+08 30517.5 9.31315e+08 19171.1 0.157819 …

Classification

model_id mean_per_class_error logloss rmse mse …

StackedEnsemble_BestOfFamily_AutoML_20240608_143533 0.187252 0.330471 0.309248 0.0956343 …

StackedEnsemble_AllModels_AutoML_20240608_143533 0.187268 0.331742 0.309836 0.0959986 …

DRF_1_AutoML_20240608_143533 0.214386 4.05288 0.376788 0.141969 …

GBM_grid__1_AutoML_20240608_143533_model_1 0.266931 0.528616 0.415268 0.172447 …

XGBoost_grid__1_AutoML_20240608_143533_model_1 0.323726 0.511452 0.409528 0.167713 …

GBM_4_AutoML_20240608_143533 0.368778 1.05257 0.645823 0.417088 …

GBM_grid__1_AutoML_20240608_143533_model_2 0.434227 1.10232 0.663382 0.440075 …

GBM_3_AutoML_20240608_143533 0.461059 1.08184 0.655701 0.429944 …

GBM_2_AutoML_20240608_143533 0.481588 1.08175 0.654895 0.428887 …

XGBoost_1_AutoML_20240608_143533 0.487381 1.05534 0.645005 0.416031 …

PyCaret

This is the longest AutoML code that this post is exploring. PyCaret does not need splitting the features (X_train) and label (y_train). So, the below code will only randomly split the training dataset into another training dataset and a validation dataset. Preprocessing, such as filling missing data or feature selection, is also not required. Then, we set up the PyCaret by assigning the data, target variable or label, numeric imputation method, categorical imputation method, whether to use normalization, whether to remove multicollinearity, etc.

Regression

!pip install pycaret from pycaret.regression import * # Generate random numbers val_index = np.random.choice(range(trainSet.shape[0]), round(trainSet.shape[0]*0.2), replace=False) # Split trainSet trainSet1 = trainSet.drop(val_index) trainSet2 = trainSet.iloc[val_index,:] # Create the model caret = setup(data = trainSet1, target='SalePrice', session_id=111, numeric_imputation='mean', categorical_imputation='constant', normalize = True, combine_rare_levels = True, rare_level_threshold = 0.05, remove_multicollinearity = True, multicollinearity_threshold = 0.95)

Classification

!pip install pycaret from pycaret.classification import * # Generate random numbers val_index = np.random.choice(range(trainSet.shape[0]), round(trainSet.shape[0]*0.2), replace=False) # Split trainSet trainSet1 = trainSet.drop(val_index) trainSet2 = trainSet.iloc[val_index,:] # Create the model caret = setup(data = trainSet1, target='Target', session_id=123, numeric_imputation='mean', categorical_imputation='constant', normalize = True, combine_rare_levels = True, rare_level_threshold = 0.05, remove_multicollinearity = True, multicollinearity_threshold = 0.95)

After that, we can run the PyCaret by specifying how many cross-validations folds we want. The PyCaret for regression gives several models sorting from the best scoring metrics. The top models are Bayesian Ridge, Huber Regressor, Orthogonal Matching Pursuit, Ridge Regression, and Passive-Aggressive Regressor. The scoring metrics are MAE, MSE, RMSE, R2, RMSLE, and MAPE. PyCaret for classification also gives several models. The top models are the Extra Trees Classifier, Random Forest Classifier, Decision Tree Classifier, Extreme Gradient Boosting, and Light Gradient Boosting Machine. The below tables are limited for their rows and columns. Find the complete tables in the Kaggle notebook.

# Show the models caret_models = compare_models(fold=5)

Regression

Model MAE MSE RMSE R2 …

br Bayesian Ridge 15940.2956 566705805.8954 23655.0027 0.9059 …

huber Huber Regressor 15204.0960 588342119.6640 23988.3772 0.9033 …

omp Orthogonal Matching Pursuit 16603.0485 599383228.9339 24383.2437 0.9001 …

ridge Ridge Regression 16743.4660 605693331.2000 24543.6840 0.8984 …

par Passive Aggressive Regressor 15629.1539 630122079.3113 24684.8617 0.8972 …

… … … … … … …

Classification

Model Accuracy AUC Recall Prec. …

et Extra Trees Classifier 0.8944 0.9708 0.7912 0.8972 …

rf Random Forest Classifier 0.8634 0.9599 0.7271 0.8709 …

dt Decision Tree Classifier 0.8436 0.8689 0.7724 0.8448 …

xgboost Extreme Gradient Boosting 0.8417 0.9455 0.7098 0.8368 …

lightgbm Light Gradient Boosting Machine 0.8337 0.9433 0.6929 0.8294 …

… … … … … … …

To create the top 5 models, run the following code.

Regression

# Create the top 5 models br = create_model('br', fold=5) huber = create_model('huber', fold=5) omp = create_model('omp', fold=5) ridge = create_model('ridge', fold=5) par = create_model('par', fold=5)

Classification

# Create the top 5 models et = create_model('et', fold=5) rf = create_model('rf', fold=5) dt = create_model('dt', fold=5) xgboost = create_model('xgboost', fold=5) lightgbm = create_model('lightgbm', fold=5)

To tune the selected model, run the following code.

Regression

# Tune the models, BR: Regression br_tune = tune_model(br, fold=5) # Show the tuned hyperparameters, for example for BR: Regression plot_model(br_tune, plot='parameter')

Classification

# Tune the models, LightGBM: Classification lightgbm_tune = tune_model(lightgbm, fold=5) # Show the tuned hyperparameters, for example for LightGBM: Classification plot_model(lightgbm_tune, plot='parameter')

PyCaret lets the users manually perform ensemble methods, like Bagging, Boosting, Stacking, or Blending. The below code performs each ensemble method.

Regression

# Bagging BR br_bagging = ensemble_model(br_tune, fold=5) # Boosting BR br_boost = ensemble_model(br_tune, method='Boosting', fold=5) # Stacking with Huber as the meta-model stack = stack_models(caret_models_5, meta_model=huber, fold=5) # Blending top models caret_blend = blend_models(estimator_list=[br_tune,huber_tune,omp_tune,ridge_tune,par_tune])

Classification

# Bagging LightGBM lightgbm_bagging = ensemble_model(lightgbm_tune, fold=5) # Boosting LightGBM lightgbm_boost = ensemble_model(lightgbm_tune, method='Boosting', fold=5) # Stacking with ET as the meta-model stack = stack_models(caret_models_5, meta_model=et, fold=5) # Blending top models caret_blend = blend_models(estimator_list=[lightgbm_tune,rf,dt])

Now, let’s choose blending models as the predictive models. The following code uses the blending models to predict the validation datasets.

Regression

# Predict the validation data caret_pred = predict_model(caret_blend, data = trainSet2.drop(columns=['SalePrice']))

Classification

# Predict the validation data pred_caret = predict_model(caret_blend, data = trainSet2.drop(columns=['Target'])) AutoViML

I run AutoViML in the notebook by assigning many arguments. Just like PyCaret, AutoViML does not need splitting the features (X_train) and label (y_train). Users only need to split the training dataset (as trainSet1) and the validation dataset (as trainSet2). Users can specify other parameters, like scoring parameter, hyperparameter, feature reduction, boosting, binning, and so on.

Regression

!pip install autoviml !pip install shap from autoviml.Auto_ViML import Auto_ViML # Create the model viml, features, train_v, test_v = Auto_ViML(trainSet1, 'SalePrice', trainSet2.drop(columns=['SalePrice']), scoring_parameter='', hyper_param='RS', feature_reduction=True, Boosting_Flag=True, Binning_Flag=False,Add_Poly=0, Stacking_Flag=False, Imbalanced_Flag=True, verbose=1)

Classification

!pip install autoviml !pip install shap from autoviml.Auto_ViML import Auto_ViML # Create the model viml, features, train_v, test_v = Auto_ViML(trainSet1, 'Target', trainSet2.drop(columns=['Target']), scoring_parameter='balanced_accuracy', hyper_param='RS', feature_reduction=True, Boosting_Flag=True, Binning_Flag=False,Add_Poly=0, Stacking_Flag=False, Imbalanced_Flag=True, verbose=1)

After fitting the training data, we can examine what has been done from the output. For example, we can see that AutoViML was doing preprocessing by filling the missing data. Similar to MLJAR, AutoViML also gives the visualized report. For the regression task, it visualizes the scatter plot of true and predicted values using the XGBoost model. It also plots the prediction residual error in a histogram. From the two plots, we can observe that the model has relatively good accuracy and the residual error is a normal distribution.

The next report is ensemble method trials. We can find this report for both regression and classification tasks. The last graph displays the feature importance. We can see that the most important feature to predict house prices is exterior material quality, followed by overall material and finish quality, and size of the garage respectively.

Fig. 7 Scatter plot of true and predicted values, and residual error histogram.  Fig. 8 Feature importances.

As for the classification task, it does not visualize the scatter plot, but the ROC curve for each of the four prediction classes, micro-average, and macro-average as well as the AUC value. We can see that all of the classes have AUC values above 0.98. The next chart reports the iso-f1 curves as the accuracy metric. It also later gives the classification report.

Fig. 9 ROC curves and iso-f1 curves

Average precision score, micro-averaged over all classes: 0.97 Macro F1 score, averaged over all classes: 0.83 ##################################################### precision recall f1-score support 0 0.84 0.97 0.90 963 1 0.51 0.59 0.54 258 2 0.18 0.11 0.14 191 3 0.00 0.00 0.00 118 accuracy 0.73 1530 macro avg 0.38 0.42 0.40 1530 weighted avg 0.64 0.73 0.68 1530 [[938 19 6 0] [106 151 1 0] [ 53 117 21 0] [ 16 12 90 0]]

The following code prints out the result of AutoViML search. We can see that the best model is XGBRegressor and XGBClassifier for each task.

viml

Regression

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.7, gamma=1, gpu_id=0, grow_policy='depthwise', importance_type='gain', interaction_constraints='', learning_rate=0.1, max_delta_step=0, max_depth=8, max_leaves=0, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=-1, nthread=-1, num_parallel_tree=1, objective='reg:squarederror', predictor='cpu_predictor', random_state=1, reg_alpha=0.5, reg_lambda=0.5, scale_pos_weight=1, seed=1, subsample=0.7, tree_method='hist', ...)

Classification

CalibratedClassifierCV(base_estimator=OneVsRestClassifier(estimator=XGBClassifier(base_score=None, booster='gbtree', colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, gamma=None, gpu_id=None, importance_type='gain', interaction_constraints=None, learning_rate=None, max_delta_step=None, max_depth=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=200, n_jobs=-1, nthread=-1, num_parallel_tree=None, objective='binary:logistic', random_state=99, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, subsample=None, tree_method=None, use_label_encoder=True, validate_parameters=None, verbosity=None), n_jobs=None), cv=5, method='isotonic') LightAutoML

Now, we have reached the 10th AutoML. LightAutoML is expected to be light as its name explains. Here, we set the task to be ‘reg’ for regression, ‘multiclass’ for multiclass classification, or ’binary’ for binary classification. We can also set the metrics and losses in the task. I set the timeout to be 3 minutes to let it find the best model. After simple .fit and .predict, we can already get the result.

Regression

!pip install openpyxl from lightautoml.automl.presets.tabular_presets import TabularAutoML from lightautoml.tasks import Task # Create the model light = TabularAutoML(task=Task(‘reg’,), timeout=60*3, cpu_limit=4) train_data = pd.concat([X_train, y_train], axis=1) # Fit the training data train_light = light.fit_predict(train_data, roles = {‘target’: ‘SalePrice’, ‘drop’:[]}) # Predict the validation data pred_light = light.predict(X_val)

Classification

!pip install openpyxl from lightautoml.automl.presets.tabular_presets import TabularAutoML from lightautoml.tasks import Task train_data = pd.concat([X_train, y_train], axis=1) # Create the model light = TabularAutoML(task=Task(‘multiclass’,), timeout=60*3, cpu_limit=4) # Fit the training data train_light = light.fit_predict(train_data, roles = {‘target’: ‘Target’}) # Predict the validation data pred_light = light.predict(X_val)

The results for the classification task are the predicted class and the probability of each of the classes. In other words, the result fulfills multiclass and multilabel classification expectations.

# Convert the prediction result into dataframe pred_light2 = pred_light.data pred_light2 = pd.DataFrame(pred_light2, columns=['4','2','3','1']) pred_light2 = pred_light2[['1','2','3','4']] pred_light2['Pred'] = pred_light2.idxmax(axis=1) pred_light2['Pred'] = pred_light2['Pred'].astype(int) pred_light2.head()

1 2 3 4 Pred

0 0.00 0.01 0.00 0.99 4

1 0.00 0.00 0.00 1.00 4

2 0.00 0.04 0.00 0.96 4

3 0.00 0.01 0.98 0.01 3

4 0.02 0.38 0.34 0.27 2

Conclusion

We have discussed 10 AutoML packages now. Please notice that there are still more AutoML not yet discussed in this post. Not only in Python notebook but AutoML also can be found in cloud computing or software. Which AutoML is your favourite? Or, do you have other AutoML in mind?

Fig. 10 AutoML chúng tôi you still going to run 10 conventional Machine Learning or prefer the 10 AutoML packages?

About Author

References

Fig3- Image by Author

Fig4-Image by Author

Fig5-Image by Author

Fig6-Image by Author

Fig10- sklearn, topot, hyperopt, autokeras, mljar, autogluon, h2o, pycaret, autoviml

Connect with me here.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Introduction To Online Machine Learning: Simplified

Data is being generated in huge quantities everywhere. Twitter generates 12 + TB of data every day, Facebook generates 25 + TB of data everyday and Google generates much more than these quantities everyday. Given that such data is being produced everyday, we need to build tools to handle data with high

1. Volume : High volume of data are stored today for any industry. Conventional models on such huge data are infeasible.

2. Velocity : Data come at high speed and demand quicker learning algorithms.

3. Variety : Different sources of data have different structures. All these data contribute to prediction. A good algorithm can take in such variety of data.

A simple predictive algorithm like Random Forest on about 50 thousand data points and 100 dimensions take 10 minutes to execute on a 12 GB RAM machine. Problems with hundreds of millions of observation is simply impossible to solve using such machines. Hence, we are left with only two options : Use a stronger machine or change the way a predictive algorithm works. First option is not always feasible. In this article we will learn about On-line Learning algorithms which are meant to handle data with such high Volume and Velocity with limited performance machines.

How does On-line learning differ from batch learning algorithms?

If you are a starter in the analytics industry, all you would have probably heard of will fall under batch learning category. Let’s try to visualize how the working of the two differ from each other.

Batch learning algorithms take batches of training data to train a model. Then predicts the test sample using the found relationship. Whereas, On-line learning algorithms take an initial guess model and then picks up one-one observation from the training population and recalibrates the weights on each input parameter. Here are a few trade-offs in using the two algorithms.

Computationally much faster and more space efficient. In the online model, you are allowed to make exactly one pass on your data, so these algorithms are typically much faster than their batch learning equivalents, since most batch learning algorithms are multi-pass. Also, since you can’t reconsider your previous examples, you typically do not store them for access later in the learning procedure, meaning that you tend to use a smaller memory footprint.

Usually easier to implement. Since the online model makes one pass over the data, we end up processing one example at a time, sequentially, as they come in from the stream. This usually simplifies the algorithm, if you’re doing so from scratch.

More difficult to maintain in production. Deploying online algorithms in production typically requires that you have something constantly passing datapoints to your algorithm. If your data changes and your feature selectors are no longer producing useful output, or if there is major network latency between the servers of your feature selectors, or one of those servers goes down, or really, any number of other things, your learner tanks and your output is garbage. Making sure all of this is running ok can be a trial.

More difficult to evaluate online. In online learning, we can’t hold out a “test” set for evaluation because we’re making no distributional assumptions — if we picked a set to evaluate, we would be assuming that the test set is representative of the data we’re operating on, and that is a distributional assumption. Since, in the most general case, there’s no way to get a representative set that characterizes your data, your only option (again, in the most general case) is to simply look at how well the algorithm has been doing recently.

Usually more difficult to get “right”. As we saw in the last point, online evaluation of the learner is hard. For similar reasons, it can be very hard to get the algorithm to behave “correctly” on an automatic basis. It can be hard to diagnose whether your algorithm or your infrastructure is misbehaving.

In cases where we deal with huge data, we are left with no choice but to use online learning algorithms. The only other option is to do a batch learning on a smaller sample.

Example Case to understand the concept

We want to predict the probability that it will rain today. We have a panel of 11 people who predict the class : Rain and non-rain on different parameters. We need to design an algorithm to predict the probability. Let us first initialize a few denotions.

i are individual predictors

w(i)  is the weight given to the i th predictor

Initial w(i) for i in [1,11] are all 1

We will predict that it will rain today if,

Once, we have the actual response of the target variable, we now send a feedback on the weights of all the parameters. In this case we will take a very simple feedback mechanism. For every right prediction, we will keep the weight of the predictor same. While for every wrong prediction, we divide the weight of the predictor by 1.2 (learning rate). With time we expect the model to converge with a right set of chúng tôi created a simulation with 1000 predictions done by each of the 11 predictors. Here is how our accuracy curve came out,

Each observation was taken at a time to re adjust the weights. Same way we will make predictions for the future data points.

End Notes

Online learning algorithms are widely used by E-commerce and social networking industry. It is not only fast but also has the capability to capture any new trend visible in with time. A variety of feedback systems and converging algorithms are presently available which should be selected as per the requirements. In some of the following articles, we will also take up a few practical examples of Online learning algorithm applications.

Did you find the article useful? Have you used online learning algorithms before ? Share with us any such experiences. Do let us know your thoughts about this article in the box below.

If you like what you just read & want to continue your analytics learning, subscribe to our emails, follow us on twitter or like our facebook page.

Related

Deep Learning Vs. Machine Learning: The Ultimate Guide For 2023

blog / Artificial Intelligence and Machine Learning Deep Learning vs. Machine Learning: The Ultimate Guide for 2023

Share link

Always wondered how Amazon provides the best recommendations based on your previous purchases, or how Siri finds any information you are looking for in a matter of seconds? From self-driving cars to voice assistants, deep learning has made it all possible. Artificial neural networks and deep learning are changing how we function, inside and outside our homes. But first, let us find out the differences between deep learning vs machine learning. 

How Does Deep Learning Work?

Deep learning  is a subset of machine learning. Machine learning in turn is a subset of artificial intelligence.

Continuous data analysis and bucketing help companies visualize information better. Deep learning algorithms draw intelligent conclusions by analyzing data using a logical approach. This is similar to how humans would analyze and visualize data but at a considerably faster rate and with a lower error rate. 

A multi-layered algorithm helps achieve logical structuring in deep learning. Moreover, such neural networks are the backbone of deep learning. Natural language processing (NLP), audio and speech recognition software, image recognition tools, and facial recognition too are all examples of deep learning.

How Does Machine Learning Work?

Machine learning algorithms are not new. Humans have been trying to program machines to think better and in a more logical way for decades now. However, the ability to apply complex algorithms to big data applications in a cost and time-efficient manner is a recent development. Also, companies that have harnessed this technology with skill and sophistication are already leading the race! 

Machine learning experts have been focusing on helping computers learn and improve constantly based on previous experiences. Machine learning algorithms dig deep into data, identify patterns, and offer interesting insights, and trends; all with minimal human intervention. Also, the core requirement to automate computers using machine learning algorithms would be to ensure data-defined patterns, a defined set of rules, and clear objectives. Moreover, thousands of companies around the world have been able to successfully transform manual tasks like bookkeeping, or logging service requests into completely automated processes using machine learning. 

Key Differences: Deep learning vs machine learning 

Deep Learning  Machine Learning 

Deep learning is a subset of machine learning. Additionally, machine learning has evolved to create deep learning. 

Machine learning is a subset of artificial intelligence and a superset of deep learning. Artificial intelligence has evolved to create machine learning.

Neural networks are used in deep learning for data representation. Also, big data is prevalent here and consists of millions of data points.

Structured data is used in machine learning to build algorithms. Also, machine learning has thousands of data points.

The output can range from numbers to free-form elements like text or sound.

Output is always numerical values only 

Neural networks are used in deep learning to pass data through various layers for processing. Also, this is done to interpret better and find trends. 

Model functions are built using automated algorithms. Also, this helps the machine predict better using available data. 

Deep learning solves complex machine learning issues and helps experts dive deeper. Moreover, this helps efficiently predict with minimum human involvement.

Machine learning is used to learn new things, identify trends, and stay ahead of the competition. Also, there is much more human intervention needed.

Types of Machine Learning

Machine learning can be of four types namely supervised, semi-supervised, unsupervised, and reinforcement. 

Supervised 

As the name suggests, supervised learning is where the machine is taught by example.

Semi-supervised –

In this type of machine learning, using a healthy mix of labeled and unlabelled data, machines are taught to label unlabelled data and make information more comprehensible. 

Unsupervised –

Machine learning algorithms are used primarily to identify patterns. 

Reinforcement –

A set of actions, parameters, and end values are provided to machines. 

Learn more about the four types of machine learning here.

Types of Deep Learning

A Deep Neural Network (DNN) is an Artificial Neural Network (ANN) with multiple layers between input and output. Moreover, the success of DNNs has led to lower error rates in speech and image recognition over the last decade.

There are three primary neural network types in deep learning:

1. Multi-Layer Perceptrons (MLP)

One of the most popular and basic feedforward artificial neural networks is MLP. It comprises a series of fully connected layers. Also, modern deep learning architectures use MLP to overcome the need for high computing power. Every new layer in an MLP is a set of non-linear functions. The weighted sum of all the fully connected outputs makes the next layer and so on. 

2. Convolutional Neural Networks (CNN)

Unlike MLPs, CNNs are most commonly used in computer vision. Each layer is a set of non-linear functions but the weighted sums do not create the next layer. Instead, the weighted sums of different coordinates of output subsets that are spatially nearby from the previous layer lead to the next layer. Also, this allows the weights to be reused. Therefore, when a series of images or videos are shared, CNN learns to extract features from the input and classifies the output after due image recognition and object classification.  

3. Recurrent Neural Networks (RNN)

This type of artificial neural network also uses sequential data feeding. However, RNN resolves the time-series issue of sequential input data. In the method, input includes previous samples. However, the connections between multiple nodes form a directed graph along a temporal sequence, unlike CNN or MLP. NLP is the most popular use case. Additionally, its superiority in processing data with varying input lengths makes it ideal for NLP. In such cases, artificial intelligence comprehends the input using modeling, embedding, or translation.

How to Improve Your Deep Learning vs Machine Learning Skills

By Manasa Ramakrishnan

Write to us at [email protected]

Update the detailed information about Machine Learning Model Deployment Using Streamlit on the Hatcungthantuong.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!