Trending March 2024 # Why Is Machine Learning Important? # Suggested April 2024 # Top 5 Popular

You are reading the article Why Is Machine Learning Important? updated in March 2024 on the website Hatcungthantuong.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested April 2024 Why Is Machine Learning Important?

Machine learning can be considered a component of artificial intelligence and involves training the machine to be more intelligent in its operations. AI technology focuses on incorporating human intelligence while machine learning is focused on making the machines learn faster. So we can say that machine learning engineers can provide faster and better optimizations to AI solutions.

AI technology has had a massive impact on society and has transformed almost every industrial sector from planning to production. Thus machine learning engineers and experts are also of great value to this growing industry.

Why is Machine Learning So Useful?

Machine learning is comparatively new but it has existed for many years. Recently gaining a lot of attention, it is essential for many significant technological improvements.

When it comes to business operations, you can access a lot of data with the help of machine learning algorithms. Machine learning also offers more affordable data storage options that have made big data sets possible and accessible for organizations. It has also helped maximize the processing power of computers to be able to perform calculations and operations faster.

Wherever you find AI technology, you will find machine learning experts working to improve the efficiency and results of the AI technologies and machines involved.

Where can Machine Learning be Applied?

Machine learning has a lot of applications in a variety of tasks and operations. It plays a central role in the collection, analysis, and processing the large sets of data. It is not just restricted to the businesses and organizations, you have already interacted with them. However, you might not be aware of the fact that you have already been using machine learning technology. Here are a few examples you can relate to as part of our daily lives.

Machine learning solutions are being incorporated into the medical sciences for better detection and diagnosis of diseases. Here is the interesting part. Machine learning can even be used to keep a check on the emotional states with the help of a smartphone.

This technology is also widely used by manufacturers to minimize losses during operations and maximize production while reducing the cost of maintenances through timely predictions.

The banking industry is also utilizing machine learning to identify any fraudulent practices or transactions to avoid losses. Machine learning can also be used to give significant insights into financial data. This in turn results in better investments and better trades.

When it comes to transportation, the self-driving cars of Google or Tesla are powered by Lachine learning. Thus it can be extremely beneficial for autonomous driving and better interpretations.

What do Machine Learning Engineers do?

Why Pursue a Career in Machine Learning?

There are many reasons to pursue a career in machine learning. It is not only getting [popular and high in demand, but also an interesting discipline where you can be innovative once you have acquired the necessary skills.

Wrapping Up

The aforementioned discussion describes the significant role of the growing machine learning and AI technology in the industrial and business sector and why you should consider pursuing a career in it.

You're reading Why Is Machine Learning Important?

Why Machine Learning Is Key To The Search Marketing Of Tomorrow

Advertising has changed a lot over the years.

There was a time when machine learning, automation, and software-based marketing tech stacks weren’t a “thing.”

But now we’re past the days of just radio, outdoor, print, and a handful of channels on TV.

There are hundreds of channels across physical and print media and online at present, including social, mobile, and video. Even TV has diversified into hundreds of cable channels on your remote control. And yet, digital ad revenue has gone on to surpass that of TV.

The dominance of digital is nothing new. Paid search marketing is becoming more data-focused than ever before.

So, What Do You Do with Big Data?

Why?

Is it because machine learning, automation, and software will completely replace savvy digital professionals and their creative ideas?

No. Far from it.

I believe that the future of digital will be a combination of smart marketers – like yourself – empowered by smart automation based on machine learning. As it happens, in a survey we recently ran on the subject, 97 percent of top digital marketing influencers (including speakers from AWeber, Oracle, and VentureBeat) agreed.

What Is Machine Learning & Why Is It Important? Digital’s Data Problem in Three Parts

Data is a challenge in modern marketing. There’s significantly more of it than there used to be, and as marketing technology matures, it becomes capable of collecting even more on top of that.

1. Overload

Data overload is a known problem. There’s too much of it – an overwhelming abundance of it already.

Yet Oracle points out that digital data growth is expected to increase globally by 4,300 percent by 2023. This problem isn’t going away anytime soon.

2. Ownership

Veritas reports that 52 percent of all business data is “dark” (of dubious or completely unknown value), and projects that mismanaged data will cost businesses $3.3 trillion by 2023.

3. Integration

There’s also a problem with siloing. Most businesses collect data in different buckets that aren’t necessarily integrated directly with each other, or indeed, with their own in-house marketing tech stack.

Accenture reports that while three-quarters of all digital skills gaps (the gap between a team member’s current level knowledge and the level of knowledge they need to successfully use new tech and tactics) come from lack of ownership, the remaining 25 percent of digital skills gaps come from a lack of integration.

And Then There’s the Changing Customer Journey

Advertising isn’t limited to a handful of channels. There are literally thousands of ways to reach customers, and pretty much all of them can be easily tuned out by an audience of increasingly demanding and disaffected customers who expect to have exactly what they’re looking for delivered to them instantly (and who will react poorly when it isn’t).

Research firm McKinsey breaks down the all-important consideration stage of the buying journey into four parts: “initial consideration; active evaluation, or the process of researching potential purchases; closure, when consumers buy brands; and postpurchase, when consumers experience them.”

The firm also finds that two-thirds of the touchpoints in the crucial evaluation stage are customer-driven, including browsing online reviews or soliciting word-of-mouth recommendations.

How Does Machine Learning Solve These Problems?

Machine learning can be used to rein in the challenge of data, particularly when combined with disciplines such as probability-based Bayesian statistics, regression modeling, and data science. One of its greatest strengths here is the ability to take data-driven insights and build predictive models.

These predictive models can, in turn, be used to proactively address points of peak buying interest, attrition, or other key moments observed in the customer buying journey.

Examples of Machine Learning in Action

Let’s look at some examples of the way this technology is being used.

Chatbots & Voice Assistants

You may have noticed an increase in the use of conversational interfaces from major publishers such as Google, Amazon, Microsoft, Apple and Facebook in the form of chatbots and voice assistants (Alexa, Google Assistant, Siri and Cortana among others).

TOPBOTS notes that chatbots can have uses in unique, consumer-based contexts, such as event ticketing, health-related questions and the ever-important sports scores. These interfaces create a relevant and engaging user experience by supplying conversational responses based on historically-collected data – the most commonly-used or highly-searched terms.

Predicting & Preventing Customer Churn

A significantly deeper-funnel strategy at the post-purchase stage is to use machine learning to forecast common points of customer attrition.

Microsoft Azure and Urban Airship have both built predictive analytics models to determine the approximate timeframes and buying stages at which customers tend to most frequently churn. By projecting these important points in the future, these businesses are then able to proactively address common complaints before customers churn, driving higher retention and ultimately strengthening their businesses.

Natural Language Processing (NLP) and Semantic Distance Modeling

Takeaways

Machine learning isn’t necessarily a threat to marketers. On the contrary, it’s a powerful ally that’s making marketers’ lives easier while empowering them to predictively engage their customers in a highly relevant way.

Now, more than ever, it’s important to deliver the right message to the right customer at the right time – and with the power of machine learning, marketers are able to more accurately accomplish this goal by relying on actual data, rather than guesswork.

Game Playtesting: Why Is It Important?

When online games are rampant and many industries profit from the continuous fame of digital games, game testing is vital. Before officially releasing a game for potential users and attractive marketplaces, a game on any platform must be flawless.

What does this mean? All games should be as smooth flowing as possible, limiting potential problems to gamers. The game playtesting stage is where testers play the game before its actual release.

If you’re a game developer, the last thing you want is to have a game with negative reviews. These reviews can come from slow loading times, notorious bugs, and endless error loops. Any game out for game play testing is under scrutiny and subject to evaluation.

Feedback and review

The feedback and review part comes from people who have played the game. As these individuals interact and engage with the ongoing game, reviews may not be avoidable. However, if you’re a game designer, fret not! Take the game playtesting as an opportunity to run through things you thought were working.

The feedback of players or professional game testers reflects the value of the game, and not you.

Criticisms can reveal the kind of value that potential players can expect from the game. An important thing to note is that it’s never too late or too early to perform a playtest.

Game testing is a subset of the gaming industry. There are many professional testers and services that have the relevant skills to assess any game objectively. It’s more than just knowing that a game “is not good enough” or “boring” or “totally unengaging.”

Data accumulation

The bread and butter of any game playtesting is data gathering and study. Whether or not the tests come from a professional game tester or strangers online willing to give the game a try, data is very telling of any developing game.

The data can tell any game developer how the game fares with test players. Is the registration part too long? Do players have a challenging time or easy time playing the game? How do players perform with the current interface of the game?

Accumulated data can pinpoint any need for adjustment, debugging, or simply changing the course of the game itself. Data tells you what to improve and how to improve any ongoing game.

If you’re a game developer, you may not be right about the game you designed and made, and it’s okay. Through data, you will learn more about the core of the game.

Also read: Top 10 Programming Languages for Kids to learn

Avoiding potential cost

If you want to develop a game, balancing costs can become a real problem in keeping the game development smooth. The reason game playtesting is necessary is because it can also help offset potential costs. The last thing any developing team wants is to spend weeks and thousands of dollars on making game stages that don’t work or don’t sell.

Game playtesting can guarantee early catches of significant glitches and impending bugs. The earlier game developers can catch the errors in a game, the less time it takes to correct these errors.

Imagine redoing weeks’ worth of work to compensate for the lack of game playtesting. Not only can game play testing save costs, but it can also save long periods of potentially redeveloping a game.

Also read: How To Make 5K Dollars In A Month? 20+ Easy Ways To Make $5,000 Fast + Tips!

Conclusion

There are three reasons that game playtesting is vital for any game developer and designer out there.

First, player insight pinpoints a lot of potential and improvements for any developing game. Though it may be tough to get a lot of criticism, it helps shape the core competencies of any game.

Second, the data accumulation from test trials can reveal the more technical aspects of the game. Numbers and information can become a leading factor for the success or failure of any game.

Lastly, any game costs something upon its inception and development. It may be time, effort, or money; however, more often than not, it’s the three that any developing team spends in creating something new for the playing market.

Game playtesting is not a way to question the confidence of game makers. Instead, it’s a step to ensure that what any game developer creates is something worthy of attention and praise.

Building Machine Learning Model Is Fun Using Orange

Introduction

With growing need of data science managers, we need tools which take out difficulty from doing data science and make it fun. Not everyone is willing to learn coding, even though they would want to learn / apply data science. This is where GUI based tools can come in handy.

Today, I will introduce you to another GUI based tool – Orange. This tool is great for beginners who wish to visualize patterns and understand their data without really knowing how to code.

In my previous article, I presented you with another GUI based tool KNIME. If you do not want to learn to code but still apply data science, you can try out any of these tools.

By the end of this tutorial, you’ll be able to predict which person out of a certain set of people is eligible for a loan with Orange!

Table of Contents:

Why Orange?

Setting up your System:

Creating your first Workflow

Familiarizing yourself with the basics

Problem Statement

Importing the data files

Understanding the data

How do you clean your data?

Training your first model

1. Why Orange?

Orange is a platform built for mining and analysis on a GUI based workflow. This signifies that you do not have to know how to code to be able to work using Orange and mine data, crunch numbers and derive insights.

You can perform tasks ranging from basic visuals to data manipulations, transformations, and data mining. It consolidates all the functions of the entire process into a single workflow.

The best part and the differentiator about Orange is that it has some wonderful visuals. You can try silhouettes, heat-maps, geo-maps and all sorts of visualizations available.

2. Setting up your System

Orange comes built-in with the Anaconda tool if you’ve previously installed it. If not, follow these steps to download Orange.

Step 2: Install the platform and set the working directory for Orange to store its files.

This is what the start-up page of Orange looks like. You have options that allow you to create new projects, open recent ones or view examples and get started.

Before we delve into how Orange works, let’s define a few key terms to help us in our understanding:

A widget is the basic processing point of any data manipulation. It can do a number of actions based on what you choose in your widget selector on the left of the screen.

A workflow is the sequence of steps or actions that you take in your platform to accomplish a particular task.

You can also go to “Example Workflows” on your start-up screen to check out more workflows once you have created your first one.

3. Creating Your First Workflow

This is your blank Workflow on Orange. Now, you’re ready to explore and solve any problem by dragging any widget from the widget menu to your workflow.

4. Familiarising yourself with the basics

Orange is a platform that can help us solve most problems in Data Science today. Topics that range from the most basic visualizations to training models. You can even evaluate and perform unsupervised learning on datasets:

4.1 Problem

The problem we’re looking to solve in this tutorial is the practice problem Loan Prediction that can be accessed via this link on Datahack.

4.2 Importing the data files

We begin with the first and the necessary step to understand our data and make predictions: importing our data

Step 3: Once you can see the structure of your dataset using the widget, go back by closing this menu.

Neat! Isn’t it?

Let’s now visualize some columns to find interesting patterns in our data.

4.3 Understanding our Data

The plot I’ve explored is a Gender by Income plot, with the colors set to the education levels. As we can see in males, the higher income group naturally belongs to the Graduates!

Although in females, we see that a lot of the graduate females are earning low or almost nothing at all. Any specific reason? Let’s find out using the scatterplot.

One possible reason I found was marriage. A huge number graduates who were married were found to be in lower income groups; this may be due to family responsibilities or added efforts. Makes perfect sense, right?

4.3.2 Distribution

What we see is a very interesting distribution. We have in our dataset, more number of married males than females.

4.3.3 Sieve diagram

Let’s visualize using a sieve diagram.

This plot divides the sections of distribution into 4 bins. The sections can be investigated by hovering the mouse over it.

Let’s now look at how to clean our data to start building our model.

5. How do you clean your data?

Here for cleaning purpose, we will impute missing values. Imputation is a very important step in understanding and making the best use of our data.

Here, I have selected the default method to be Average for numerical values and Most Frequent for text based values (categorical).

You can select from a variety of imputations like:

Distinct Value

Random Values

Remove the rows with missing values

Model-Based

6. Training your First Model

Beginning with the basics, we will first train a linear model encompassing all the features just to understand how to select and build models.

Step 1: First, we need to set a target variable to apply Logistic Regression on it.

Step 4: Once we have set our target variable, find the clean data from the “Impute” widget as follows and place the “Logistic Regression” widget.

Ridge Regression:

Performs L2 regularization, i.e. adds penalty equivalent to square of the magnitude of coefficients

Minimization objective = LS Obj + α * (sum of square of coefficients)

Lasso Regression:

Performs L1 regularization, i.e. adds penalty equivalent to absolute value of the magnitude of coefficients

Minimization objective = LS Obj + α * (sum of absolute value of coefficients)

I have chosen Ridge for my analysis, you are free to choose between the two.

Step 8: To visualize the results better, drag and drop from the “Test and Score” widget to fin d “Confusion Matrix”.

This way, you can test out different models and see how accurately they perform.

Let’s try to evaluate, how a Random Forest would do? Change the modeling method to Random Forest and look at the confusion matrix.

Looks decent, but the Logistic Regression performed better.

We can try again with a Support Vector Machine.

Better than the Random Forest, but still not as good as the Logistic Regression model.

Sometimes the simpler methods are the better ones, isn’t it?

This is how your final workflow would look after you are done with the complete process.

For people who wish to work in groups, you can also export your workflows and send it to friends who can work alongside you!

The resulting file is of the (.ows) extension and can be opened in any other Orange setup.

End Notes

Orange is a platform that can be used for almost any kind of analysis but most importantly, for beautiful and easy visuals. In this article, we explored how to visualize a dataset. Predictive modeling was undertaken as well, using a logistic regression predictor, SVM, and a random forest predictor to find loan statuses for each person accordingly.

Hope this tutorial has helped you figure out aspects of the problem that you might not have understood or missed out on before. It is very important to understand the data science pipeline and the steps we take to train a model, and this should surely help you build better predictive models soon!

Related

Ux Vs Ui Design: Which Is More Important And Why?

blog / Product Design & Innovation UX vs. UI design: Why Both are Equally Critical for Businesses

Share link

According to a 2023 LinkedIn report, 88% of online consumers are less likely to return to a website after a bad experience. Satisfactory experiences directly translate to revenue. But as more companies move online, capturing user attention is becoming more challenging. Also, how do you translate a user experience into a seamless design that is not only functional but also appealing? This is why User Experience (UX) and User Interface (UI) design is important. While the debate around the relative merits of UX vs UI design continues, the answer, clearly, lies in achieving synergy. Let’s find out how these two fields overlap and how to best utilize their joint potential.

What is UX Design? Why is UX Important?

In the digital world, UX is defined as the entire range of feelings and experiences of a user while interacting with a product to accomplish certain tasks. UX design, therefore, refers to the whole set of elements that influence these interactions. User-centric design thinking strives to incorporate relevant, accessible, and convenient experiences in every aspect of the product with which the user interacts. 

UX design is considered to be the backbone of product success. The importance of UX designers lies in their two-pronged approach towards a product. The first is ensuring the product meets its target user needs, and the second is creating a seamless experience for the consumer while achieving the outcomes. Collaborative market research and experimentation helps UX designers define the ease and user-friendliness of the overall interaction with a service or product.

ALSO READ: What is UX Design and How it Can Attract More Visitors?

What is UI Design? Why is UI Important?

User Interface (UI) design deals with a specific aspect of user experience—how an interface behaves, feels, and looks as consumers navigate it. The best UI design attracts the least attention and maximizes seamless interaction to achieve user goals. The success of any software system depends on how transparent or efficient these interactions are, as UI is the primary medium between humans and technology.

UI designers understand the user journey related to a product or service and align it with long-term business goals. They translate the vision proposed by UX designers by mapping user behavior and defining a user flow, or, in other words, the technical requirements of a path taken by users to accomplish a specific task. A successful visual interface should be intuitive, highly responsive, insightful, visually appealing, and easy-to-use.

How Do UX Design and UI Design Work Together?

When dealing with a digital product’s visual and functional aspects, UX and UI designers play the central role of agreeing upon a shared vision. This involves 

Developing a specific style guide

Finding a balance between design and other business areas

Optimizing the workflow requirements 

Instilling a coherent product experience. 

Rather than contrasting the importance of UX vs UI design, their collaboration is central to presenting the ideas to the stakeholders and maintaining design documentation for each project.

ALSO READ: What is Design Thinking? How Important is it for Success?

UX Designer Job Description vs UI Designer Job Description

UI designers share the same end goals as UX designers – to deliver products that meet consumer needs and are also pleasant to use. But understanding the different criteria in the UX vs UI design job descriptions is essential to grasping their respective work processes.

Job description of a UX designer includes the ability to:

Explain user research and customer segmentation to different stakeholders

Categorize customers into appropriate user personas

Consult with clients to understand product-specific goals.

Create product prototypes, screen flows, and wireframes

Assist UI designers to engage in a singular vision

Consult with clients to understand their goals

Job description of a UI designer includes the skills to:

Engage in intense collaboration with project managers and software engineers to define product development direction

Note the requirements from the UX designer in the product development process

Translate the conceptual simplicity and solve design roadblocks accordingly

Defend the designs in front of stakeholders

Find out more effective ways to engage users and bring them on the discussion board

Differences Between UI and UX with Examples Skills

Comparing UX vs UI design skill sets allows you to understand their different job roles. 

To get hired as a UI designer, you need to master the following:

Basic principles of interactive design

Color and typography theory

Developing style guides

Wireframing

Coding skills 

Software proficiency in Java/C++/Python

HTML and CSS training

Software proficiency in Sketch, Adobe XD, InVision Studio, among others

To get hired as a UX designer, here’s a list of the essential skills required:

Critical thinking

Continuous learning and empathetic thinking

Strong understanding of the market

Strong collaboration and communication skills

UX writing

Prototyping and wireframing

Visual communication

Basic knowledge of UI design

Business acumen

Basics of coding

Salary

According to Glassdoor, the average annual salary of a UX designer in the U.S. is $95,578, while that of a UI designer is $86,847. 

Education

UX designers can come from both technical and non-technical backgrounds. Candidates with psychology, media science, journalism, computer science, or market analytics backgrounds are suitable for the job role. Meanwhile, due to their tightly defined job role, UI designers usually belong to technical backgrounds. They tend to have a degree in graphic design, motion graphics, animation and illustration, digital design, or interactive design.

ALSO READ: How the Right Course Led Olivia Bannerman to Her Dream Job as UX Designer

Tasks And Responsibilities: What Do They Do?

The various tasks of a UX designer include:

Research to identify user trends, goals, and behaviors

Mapping the consumer journey for better user experience and product analysis

Building prototypes to finalize designs and test products 

Identifying the pain points in the user journey and user flow

Categorizing products according to user personas of the target market

Stakeholder collaboration and communication with other members of the team

The everyday tasks of a UI designer include:

Finding out efficient ways to enhance user control

Organizing page layouts and select typefaces for designs

Finalizing the language and information architecture for each screen

Developing wireframes to demonstrate samples of the final design

Collaborating with developers to assist in translating designs into functional products 

Is There a UI/UX Designer Role?

With the changing conception of design, the roles of designers are also evolving. As Jonathan Widawski, CEO of Maze, puts it: “We shouldn’t talk about UX vs UI [design]. Instead, it should be UX and UI because they overlap and complement each other.” Nowadays, recruiters are looking for a mix of skills from both the fields of UI and UX. The bottom line is that aspiring candidates should have skills in both areas to increase their chances of recruitment.

How to Know if UI or UX is a Better Fit for You

If you lean towards understanding user psychology and studying trends to define an entire experience, you could explore UX design as a career option. On the other hand, if design is your forte and you prefer exploring different styles of design thinking, your field of choice should be UI design. To excel in this ever-evolving field, you can sign up for Emeritus’ comprehensive online product, design and innovation courses, which are created in collaboration with top global universities.

By Bishwadeep Mitra

Write to us at [email protected]

10 Automated Machine Learning For Supervised Learning (Part 2)

This article was published as a part of the Data Science Blogathon

Introduction

This post will discuss 10 Automated Machine Learning (autoML) packages that we can run in Python. If you are tired of running lots of Machine Learning algorithms just to find the best one, this post might be what you are looking for. This post is the second part of this first post. The first part explains the general concept of Machine Learning from defining the objective, pre-processing, model creation and selection, hyperparameter-tuning, and model evaluation. At the end of that post, Auto-Sklearn is introduced as an autoML. If you are already familiar with Machine Learning, you can skip that part 1.

The main point of the first part is that we require a relatively long time and many lines of code to run all of the classification or regression algorithms before finally selecting or ensembling the best models. Instead, we can run AutoML to automatically search for the best model with a much shorter time and, of course, less code. Please find my notebooks for conventional Machine Learning algorithms for regression (predicting house prices) and classification (predicting poverty level classes) tasks in the table below. My AutoML notebooks are also in the table below. Note that this post will focus only on regression and classification AutoML while AutoML also can be applied for image, NLP or text, and time series forecasting.

Now, we will start discussing the 10 AutoML packages that can replace those long notebooks.

 AutoSklearn

This autoML, as mentioned above, has been discussed before. Let’s do a quick recap. Below is the code for searching for the best model in 3 minutes. More details are explained in the first part.

Regression

!apt install -y build-essential swig curl !pip install auto-sklearn !pip install scipy==1.7.0 !pip install -U scikit-learn from autosklearn.regression import AutoSklearnRegressor from sklearn.metrics import mean_squared_error, mean_absolute_error

Running AutoSklearn

# Create the model sklearn = AutoSklearnRegressor(time_left_for_this_task=3*60, per_run_time_limit=30, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the RMSE rmse_sklearn=MSE(y_val, pred_sklearn)**0.5 print('RMSE: ' + str(rmse_sklearn))

Output:

auto-sklearn results: Dataset name: c40b5794-fa4a-11eb-8116-0242ac130202 Metric: r2 Best validation score: 0.888788 Number of target algorithm runs: 37 Number of successful target algorithm runs: 23 Number of crashed target algorithm runs: 0 Number of target algorithms that exceeded the time limit: 8 Number of target algorithms that exceeded the memory limit: 6 RMSE: 27437.715258009852

Result:

# Scatter plot true and predicted values plt.scatter(pred_sklearn, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(rmse_sklearn))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_sklearn)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(y_val, pred_sklearn)[0,1],4))) plt.show()

 Fig. 1 AutoSklearn Regression Result

Classification

Applying AutoML

!apt install -y build-essential swig curl from autosklearn.classification import AutoSklearnClassifier # Create the model sklearn = AutoSklearnClassifier(time_left_for_this_task=3*60, per_run_time_limit=15, n_jobs=-1) # Fit the training data sklearn.fit(X_train, y_train) # Sprint Statistics print(sklearn.sprint_statistics()) # Predict the validation data pred_sklearn = sklearn.predict(X_val) # Compute the accuracy print('Accuracy: ' + str(accuracy_score(y_val, pred_sklearn)))

Output:

auto-sklearn results: Dataset name: 576d4f50-c85b-11eb-802c-0242ac130202 Metric: accuracy Best validation score: 0.917922 Number of target algorithm runs: 40 Number of successful target algorithm runs: 8 Number of crashed target algorithm runs: 0 Number of target algorithms that exceeded the time limit: 28 Number of target algorithms that exceeded the memory limit: 4 Accuracy: 0.923600209314495

Result:

# Prediction results print('Confusion Matrix') print(pd.DataFrame(confusion_matrix(y_val, pred_sklearn), index=[1,2,3,4], columns=[1,2,3,4])) print('') print('Classification Report') print(classification_report(y_val, pred_sklearn))

Output:

Confusion Matrix 1 2 3 4 1 123 14 3 11 2 17 273 11 18 3 3 17 195 27 4 5 15 5 1174 Classification Report precision recall f1-score support 1 0.83 0.81 0.82 151 2 0.86 0.86 0.86 319 3 0.91 0.81 0.86 242 4 0.95 0.98 0.97 1199 accuracy 0.92 1911 macro avg 0.89 0.86 0.88 1911 weighted avg 0.92 0.92 0.92 1911 Tree-based Pipeline Optimization Tool (TPOT)

TPOT is built on top of scikit-learn. TPOT uses a genetic algorithm to search for the best model according to the “generations” and “population size”. The higher the two parameters are set, the longer will it take time. Unlike AutoSklearn, we do not set the specific running time for TPOT. As its name suggests, after the TPOT is run, it exports lines of code containing a pipeline from importing packages, splitting the dataset, creating the tuned model, fitting the model, and finally predicting the validation dataset. The pipeline is exported in .py format.

In the code below, I set the generation and population_size to be 5. The output gives 5 generations with increasing “scoring”. I set the scoring to be “neg_mean_absolute_error” and” accuracy” for regression and classification tasks respectively. Neg_mean_absolute_error means Mean Absolute Error (MAE) in negative form. The algorithm chooses the highest scoring value. Making the MAE negative will make the algorithm selecting the MAE closest to zero.

Regression

from tpot import TPOTRegressor # Create model cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=123) tpot = TPOTRegressor(generations=5, population_size=5, cv=cv, scoring='neg_mean_absolute_error', verbosity=2, random_state=123, n_jobs=-1) # Fit the training data tpot.fit(X_train, y_train) # Export the result tpot.export('tpot_model.py')

Output:

Generation 1 - Current best internal CV score: -20390.588131563232 Generation 2 - Current best internal CV score: -19654.82630417806 Generation 3 - Current best internal CV score: -19312.09139004322 Generation 4 - Current best internal CV score: -19312.09139004322 Generation 5 - Current best internal CV score: -18752.921100941825 Best pipeline: RandomForestRegressor(input_matrix, bootstrap=True, max_features=0.25, min_samples_leaf=3, min_samples_split=2, n_estimators=100)

Classification

from tpot import TPOTClassifier # TPOT that are stopped earlier. It still gives temporary best pipeline. # Create the model cv = RepeatedStratifiedKFold(n_splits=3, n_repeats=3, random_state=123) tpot = TPOTClassifier(generations=5, population_size=5, cv=cv, scoring='accuracy', verbosity=2, random_state=123, n_jobs=-1) # Fit the training data tpot.fit(X_train, y_train) # Export the result tpot.export('tpot_model.py')

Output:

Generation 1 - Current best internal CV score: 0.7432273262661955 Generation 2 - Current best internal CV score: 0.843824979278454 Generation 3 - Current best internal CV score: 0.8545565589146273 Generation 4 - Current best internal CV score: 0.8545565589146273 Generation 5 - Current best internal CV score: 0.859616978580465 Best pipeline: RandomForestClassifier(GradientBoostingClassifier(input_matrix, learning_rate=0.001, max_depth=2, max_features=0.7000000000000001, min_samples_leaf=1, min_samples_split=19, n_estimators=100, subsample=0.15000000000000002), bootstrap=True, criterion=gini, max_features=0.8500000000000001, min_samples_leaf=4, min_samples_split=12, n_estimators=100)

AutoSklearn gives the result of RandomForestRegressor for the regression task. As for the classification, it gives the stacking of GradientBoostingClassifier and RandomForestClassifier. All algorithms already have their hyperparameters tuned.

Here is to see the validation data scoring metrics.

Regression

pred_tpot = results # Scatter plot true and predicted values plt.scatter(pred_tpot, y_val, alpha=0.2) plt.xlabel('predicted') plt.ylabel('true value') plt.text(100000, 400000, 'RMSE: ' + str(round(MSE(y_val, pred_tpot)**0.5))) plt.text(100000, 350000, 'MAE: ' + str(round(mean_absolute_error(y_val, pred_tpot)))) plt.text(100000, 300000, 'R: ' + str(round(np.corrcoef(y_val, pred_tpot)[0,1],4))) plt.show()

Output:

Fig. 2 TPOT regression result.

Classification

pred_tpot = results # Compute the accuracy print('Accuracy: ' + str(accuracy_score(y_val, pred_tpot))) print('') # Prediction results print('Confusion Matrix') print(pd.DataFrame(confusion_matrix(y_val, pred_tpot), index=[1,2,3,4], columns=[1,2,3,4])) print('') print('Classification Report') print(classification_report(y_val, pred_tpot))

Output:

Accuracy: 0.9246467817896389 Confusion Matrix 1 2 3 4 1 117 11 7 16 2 6 288 10 15 3 2 18 186 36 4 5 12 6 1176 Classification Report precision recall f1-score support 1 0.90 0.77 0.83 151 2 0.88 0.90 0.89 319 3 0.89 0.77 0.82 242 4 0.95 0.98 0.96 1199 accuracy 0.92 1911 macro avg 0.90 0.86 0.88 1911 weighted avg 0.92 0.92 0.92 1911 Distributed Asynchronous Hyper-parameter Optimization (Hyperopt)

Hyperopt is also usually used to optimize the hyperparameters of one model that has been specified. For example, we decide to apply Random Forest and then run hyperopt to find the optimal hyperparameters for the Random Forest. My previous post discussed it. But, this post is different in that it uses hyperopt to search for the best Machine Learning model automatically, not just tuning the hyperparameters. The code is similar but different.

The code below shows how to use hyperopt to run AutoML. Max evaluation of 50 and trial timeout of 20 seconds is set. These will determine how long the AutoML will work. Like in TPOT, we do not set the time limit in hyperopt.

Regression

from hpsklearn import HyperoptEstimator from hpsklearn import any_regressor from hpsklearn import any_preprocessing from hyperopt import tpe from sklearn.metrics import mean_squared_error # Create the model hyperopt = HyperoptEstimator(regressor=any_regressor(‘reg’), preprocessing=any_preprocessing(‘pre’), loss_fn=mean_squared_error, algo=tpe.suggest, max_evals=50, trial_timeout=20) # Fit the data hyperopt.fit(X_train, y_train)

Classification

from hpsklearn import HyperoptEstimator from hpsklearn import any_classifier from hpsklearn import any_preprocessing from hyperopt import tpe # Create the model hyperopt = HyperoptEstimator(classifier=any_classifier(‘cla’), preprocessing=any_preprocessing(‘pre’), algo=tpe.suggest, max_evals=50, trial_timeout=30) # Fit the training data hyperopt.fit(X_train_ar, y_train_ar)

In the Kaggle notebook (in the table above), every time I finished performing the fitting and predicting, I will show the prediction of validation data results in scatter plot, confusion matrix, and classification report. The code is always almost the same with a little adjustment. So, from this point onwards, I am not going to put them in this post. But, the Kaggle notebook provides them.

To see the algorithm from the AutoML search result. Use the below code. The results are ExtraTreeClassfier and XGBRegressor. Observe that it also searches for the preprocessing technics, such as standard scaler and normalizer.

# Show the models print(hyperopt.best_model())

Regression

{'learner': XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=0.6209369845565308, colsample_bynode=1, colsample_bytree=0.6350745975782562, gamma=0.07330922089021298, gpu_id=-1, importance_type='gain', interaction_constraints='', learning_rate=0.0040826994703554555, max_delta_step=0, max_depth=10, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=2600, n_jobs=1, num_parallel_tree=1, objective='reg:linear', random_state=3, reg_alpha=0.4669165283261672, reg_lambda=2.2280355282357056, scale_pos_weight=1, seed=3, subsample=0.7295609371405459, tree_method='exact', validate_parameters=1, verbosity=None), 'preprocs': (Normalizer(norm='l1'),), 'ex_preprocs': ()}

Classification

{'learner': ExtraTreesClassifier(bootstrap=True, max_features='sqrt', n_estimators=308, n_jobs=1, random_state=1, verbose=False), 'preprocs': (StandardScaler(with_std=False),), 'ex_preprocs': ()} AutoKeras

AutoKeras, as you might guess, is an autoML specializing in Deep Learning or Neural networks. The “Keras” in the name gives the clue. AutoKeras helps in finding the best neural network architecture and hyperparameters for the prediction model. Unlike the other AutoML, AutoKeras does not consider tree-based, distance-based, or other Machine Learning algorithms.

Deep Learning is challenging not only for the hyperparameter-tuning but also for the architecture setting. Many ask about how many neurons or layers are the best to use. There is no clear answer to that. Conventionally, users must try to run and evaluate their Deep Learning architectures one by one before finally decide which one is the best. It takes a long time and resources to accomplish. I wrote a post describing this here. However, AutoKeras can solve this problem.

To apply the AutoKeras, I set the max_trials to be 8 and it will try to find the best deep learning architecture for a maximum of 8 trials. Set the epoch while fitting the training dataset and it will determine the accuracy of the model.

Regression

!pip install autokeras import autokeras # Create the model keras = autokeras.StructuredDataRegressor(max_trials=8) # Fit the training dataset keras.fit(X_train, y_train, epochs=100) # Predict the validation data pred_keras = keras.predict(X_val)

Classification

!pip install autokeras import autokeras # Create the model keras = autokeras.StructuredDataClassifier(max_trials=8) # Fit the training dataset keras.fit(X_train, y_train, epochs=100) # Predict the validation data pred_keras = keras.predict(X_val) # Compute the accuracy print('Accuracy: ' + str(accuracy_score(y_val, pred_keras)))

To find the architecture of the AutoKeras search, use the following code.

# Show the built models keras_export = keras.export_model() keras_export.summary()

Regression

Model: "model" Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 20)] 0 multi_category_encoding (Mul (None, 20) 0 dense (Dense) (None, 512) 10752 re_lu (ReLU) (None, 512) 0 dense_1 (Dense) (None, 32) 16416 re_lu_1 (ReLU) (None, 32) 0 regression_head_1 (Dense) (None, 1) 33 ================================================================= Total params: 27,201 Trainable params: 27,201 Non-trainable params: 0

Classification

Model: "model" Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 71)] 0 multi_category_encoding (Mul (None, 71) 0 normalization (Normalization (None, 71) 143 dense (Dense) (None, 512) 36864 re_lu (ReLU) (None, 512) 0 dense_1 (Dense) (None, 32) 16416 re_lu_1 (ReLU) (None, 32) 0 dropout (Dropout) (None, 32) 0 dense_2 (Dense) (None, 4) 132 classification_head_1 (Softm (None, 4) 0 ================================================================= Total params: 53,555 Trainable params: 53,412 Non-trainable params: 143  MLJAR

MLJAR is another great AutoML. And, you will find why soon. To run MLJAR, I assign the arguments of mode, eval_metric, total_time_limit, and feature_selection. This AutoML will understand whether it is a regression or classification task from the eval_metric. The total_time_limit is the duration of how long we allow MLJAR to run in seconds. In this case, it will take 300 seconds or 5 minutes to find the possible best model. We can also specify whether to allow feature selection. The output then will report the used algorithms and how long they take to finish.

Regression

from supervised.automl import AutoML # Create the model mljar = AutoML(mode="Compete", eval_metric="rmse", total_time_limit=300, features_selection=True) # Fit the training data mljar.fit(X_train, y_train) # Predict the training data mljar_pred = mljar.predict(X_val)

Classification

from supervised.automl import AutoML # Create the model mljar = AutoML(mode="Compete", eval_metric="accuracy", total_time_limit=300, features_selection=True) # Fit the training data mljar.fit(X_train, y_train) # Predict the training data mljar_pred = mljar.predict(X_val)

The argument “mode” lets us decide what the MLJAR is expected for. There are 4 types of modes to define the purpose of running the MLJAR. In the example code above, the mode of “compete” is used for winning a competition by finding the best model by tuning and ensembling methods. The mode of “optuna” is used to find the best-tuned model with unlimited computation time. The mode of “perform” builds a Machine Learning pipeline for production. The mode of “explain” is used for data explanation.

The result of MLJAR is automatically reported and visualized. Unfortunately, Kaggle does not display the report result after saving. So, below is how it should look. The report compares the MLJAR results for every algorithm. We can see the ensemble methods have the lowest MSE for the regression task and the highest accuracy for the classification task. The increasing number of iteration lowers the MSE for the regression task and improves the accuracy of the classification task. (The leaderboard tables below actually have more rows, but they were cut.)

Regression

Fig. 3 MLJAR Report for regression

Fig. 4 MLJAR for regession

Fig. 5 MLJAR report for classification (1). Fig. 6 MLJAR report for classification (2). 

 AutoGluon

AutoGluon requires users to format the training dataset using TabularDataset to recognize it. Users can then specify the time_limit allocation for AutoGluon to work. In the example code below, I set it to be 120 seconds or 2 minutes.

Regression

!pip install -U pip !pip install -U setuptools wheel !pip install -U "mxnet<2.0.0" !pip install autogluon from autogluon.tabular import TabularDataset, TabularPredictor # Prepare the data Xy_train = X_train.reset_index(drop=True) Xy_train['Target'] = y_train Xy_val = X_val.reset_index(drop=True) Xy_val['Target'] = y_val X_train_gluon = TabularDataset(Xy_train) X_val_gluon = TabularDataset(Xy_val) # Fit the training data gluon = TabularPredictor(label='Target').fit(X_train_gluon, time_limit=120) # Predict the training data gluon_pred = gluon.predict(X_val)

Classification

!pip install -U pip !pip install -U setuptools wheel !pip install -U "mxnet<2.0.0" !pip install autogluon from autogluon.tabular import TabularDataset, TabularPredictor # Prepare the data Xy_train = X_train.reset_index(drop=True) Xy_train['Target'] = y_train Xy_val = X_val.reset_index(drop=True) Xy_val['Target'] = y_val X_train_gluon = TabularDataset(Xy_train) X_val_gluon = TabularDataset(Xy_val) # Fit the training data gluon = TabularPredictor(label='Target').fit(X_train_gluon, time_limit=120) # Predict the training data gluon_pred = gluon.predict(X_val)

After finishing the task, AutoGluon can report the accuracy of each Machine Learning algorithm it has tried. The report is called leaderboard. The columns below are actually longer, but I cut them for this post.

# Show the models leaderboard = gluon.leaderboard(X_train_gluon) leaderboard

Regression

model score_test score_val pred_time_test . . .

0 RandomForestMSE -15385.131260 -23892.159881 0.133275 . . .

1 ExtraTreesMSE -15537.139720 -24981.601931 0.137063 . . .

2 LightGBMLarge -17049.125557 -26269.841824 0.026560 . . .

3 XGBoost -18142.996982 -23573.451829 0.054067 . . .

4 KNeighborsDist -18418.785860 -41132.826848 0.135036 . . .

5 CatBoost -19585.309377 -23910.403833 0.004854 . . .

6 WeightedEnsemble_L2 -20846.144676 -22060.013365 1.169406 . . .

7 LightGBM -23615.121228 -23205.065207 0.024396 . . .

8 LightGBMXT -25261.893395 -24608.580984 0.015091 . . .

9 NeuralNetMXNet -28904.712029 -24104.217749 0.819149 . . .

10 KNeighborsUnif -39243.784302 -39545.869493 0.132839 . . .

11 NeuralNetFastAI -197411.475391 -191261.448480 0.070965 . . .

Classification

model score_test score_val pred_time_test . . .

0 WeightedEnsemble_L2 0.986651 0.963399 3.470253 . . .

1 LightGBM 0.985997 0.958170 0.600316 . . .

2 XGBoost 0.985997 0.956863 0.920570 . . .

3 RandomForestEntr 0.985866 0.954248 0.366476 . . .

4 RandomForestGini 0.985735 0.952941 0.397669 . . .

5 ExtraTreesEntr 0.985735 0.952941 0.398659 . . .

6 ExtraTreesGini 0.985735 0.952941 0.408386 . . .

7 KNeighborsDist 0.985473 0.950327 2.013774 . . .

8 LightGBMXT 0.984034 0.951634 0.683871 . . .

9 NeuralNetFastAI 0.983379 0.947712 0.340936 . . .

10 NeuralNetMXNet 0.982332 0.956863 2.459954 . . .

11 CatBoost 0.976574 0.934641 0.044412 . . .

12 KNeighborsUnif 0.881560 0.769935 1.970972 . . .

13 LightGBMLarge 0.627143 0.627451 0.014708 . . .

H2O

Similar to AutoGluon, H2O requires the training dataset in a certain format, called H2OFrame. To decide how long H2O will work, either max_runtime_secs or max_models must be specified. The names explain what they mean.

Regression

import h2o from h2o.automl import H2OAutoML h2o.init() # Prepare the data Xy_train = X_train.reset_index(drop=True) Xy_train['SalePrice'] = y_train.reset_index(drop=True) Xy_val = X_val.reset_index(drop=True) Xy_val['SalePrice'] = y_val.reset_index(drop=True) # Convert H2O Frame Xy_train_h2o = h2o.H2OFrame(Xy_train) X_val_h2o = h2o.H2OFrame(X_val) # Create the model h2o_model = H2OAutoML(max_runtime_secs=120, seed=123) # Fit the model h2o_model.train(x=Xy_train_h2o.columns, y='SalePrice', training_frame=Xy_train_h2o) # Predict the training data h2o_pred = h2o_model.predict(X_val_h2o)

Classification

import h2o from h2o.automl import H2OAutoML h2o.init() # Convert H2O Frame Xy_train_h2o = h2o.H2OFrame(Xy_train) X_val_h2o = h2o.H2OFrame(X_val) Xy_train_h2o['Target'] = Xy_train_h2o['Target'].asfactor() # Create the model h2o_model = H2OAutoML(max_runtime_secs=120, seed=123) # Fit the model h2o_model.train(x=Xy_train_h2o.columns, y='Target', training_frame=Xy_train_h2o) # Predict the training data h2o_pred = h2o_model.predict(X_val_h2o) h2o_pred

For the classification task, the prediction result is a multilabel classification result. It gives the probability value for each class. Below is an example of the classification.

predict p1 p2 p3 p4

4 0.0078267 0.0217498 0.0175197 0.952904

4 0.00190617 0.00130162 0.00116375 0.995628

4 0.00548938 0.0156449 0.00867845 0.970187

3 0.00484961 0.0161661 0.970052 0.00893224

2 0.0283297 0.837641 0.0575789 0.0764503

3 0.00141621 0.0022694 0.992301 0.00401299

4 0.00805432 0.0300103 0.0551097 0.906826

H2O reports its result by a simple table showing various scoring metrics of each Machine Learning algorithm.

# Show the model results leaderboard_h2o = h2o.automl.get_leaderboard(h2o_model, extra_columns = 'ALL') leaderboard_h2o

Regression output:

model_id mean_residual_deviance rmse mse mae rmsle …

GBM_grid__1_AutoML_20240811_022746_model_17 8.34855e+08 28893.9 8.34855e+08 18395.4 0.154829 …

GBM_1_AutoML_20240811_022746 8.44991e+08 29068.7 8.44991e+08 17954.1 0.149824 …

StackedEnsemble_BestOfFamily_AutoML_20240811_022746 8.53226e+08 29210 8.53226e+08 18046.8 0.149974 …

GBM_grid__1_AutoML_20240811_022746_model_1 8.58066e+08 29292.8 8.58066e+08 17961.7 0.153238 …

GBM_grid__1_AutoML_20240811_022746_model_2 8.91964e+08 29865.8 8.91964e+08 17871.9 0.1504 …

GBM_grid__1_AutoML_20240811_022746_model_10 9.11731e+08 30194.9 9.11731e+08 18342.2 0.153421 …

GBM_grid__1_AutoML_20240811_022746_model_21 9.21185e+08 30351 9.21185e+08 18493.5 0.15413 …

GBM_grid__1_AutoML_20240811_022746_model_8 9.22497e+08 30372.6 9.22497e+08 19124 0.159135 …

GBM_grid__1_AutoML_20240811_022746_model_23 9.22655e+08 30375.2 9.22655e+08 17876.6 0.150722 …

XGBoost_3_AutoML_20240811_022746 9.31315e+08 30517.5 9.31315e+08 19171.1 0.157819 …

Classification

model_id mean_per_class_error logloss rmse mse …

StackedEnsemble_BestOfFamily_AutoML_20240608_143533 0.187252 0.330471 0.309248 0.0956343 …

StackedEnsemble_AllModels_AutoML_20240608_143533 0.187268 0.331742 0.309836 0.0959986 …

DRF_1_AutoML_20240608_143533 0.214386 4.05288 0.376788 0.141969 …

GBM_grid__1_AutoML_20240608_143533_model_1 0.266931 0.528616 0.415268 0.172447 …

XGBoost_grid__1_AutoML_20240608_143533_model_1 0.323726 0.511452 0.409528 0.167713 …

GBM_4_AutoML_20240608_143533 0.368778 1.05257 0.645823 0.417088 …

GBM_grid__1_AutoML_20240608_143533_model_2 0.434227 1.10232 0.663382 0.440075 …

GBM_3_AutoML_20240608_143533 0.461059 1.08184 0.655701 0.429944 …

GBM_2_AutoML_20240608_143533 0.481588 1.08175 0.654895 0.428887 …

XGBoost_1_AutoML_20240608_143533 0.487381 1.05534 0.645005 0.416031 …

PyCaret

This is the longest AutoML code that this post is exploring. PyCaret does not need splitting the features (X_train) and label (y_train). So, the below code will only randomly split the training dataset into another training dataset and a validation dataset. Preprocessing, such as filling missing data or feature selection, is also not required. Then, we set up the PyCaret by assigning the data, target variable or label, numeric imputation method, categorical imputation method, whether to use normalization, whether to remove multicollinearity, etc.

Regression

!pip install pycaret from pycaret.regression import * # Generate random numbers val_index = np.random.choice(range(trainSet.shape[0]), round(trainSet.shape[0]*0.2), replace=False) # Split trainSet trainSet1 = trainSet.drop(val_index) trainSet2 = trainSet.iloc[val_index,:] # Create the model caret = setup(data = trainSet1, target='SalePrice', session_id=111, numeric_imputation='mean', categorical_imputation='constant', normalize = True, combine_rare_levels = True, rare_level_threshold = 0.05, remove_multicollinearity = True, multicollinearity_threshold = 0.95)

Classification

!pip install pycaret from pycaret.classification import * # Generate random numbers val_index = np.random.choice(range(trainSet.shape[0]), round(trainSet.shape[0]*0.2), replace=False) # Split trainSet trainSet1 = trainSet.drop(val_index) trainSet2 = trainSet.iloc[val_index,:] # Create the model caret = setup(data = trainSet1, target='Target', session_id=123, numeric_imputation='mean', categorical_imputation='constant', normalize = True, combine_rare_levels = True, rare_level_threshold = 0.05, remove_multicollinearity = True, multicollinearity_threshold = 0.95)

After that, we can run the PyCaret by specifying how many cross-validations folds we want. The PyCaret for regression gives several models sorting from the best scoring metrics. The top models are Bayesian Ridge, Huber Regressor, Orthogonal Matching Pursuit, Ridge Regression, and Passive-Aggressive Regressor. The scoring metrics are MAE, MSE, RMSE, R2, RMSLE, and MAPE. PyCaret for classification also gives several models. The top models are the Extra Trees Classifier, Random Forest Classifier, Decision Tree Classifier, Extreme Gradient Boosting, and Light Gradient Boosting Machine. The below tables are limited for their rows and columns. Find the complete tables in the Kaggle notebook.

# Show the models caret_models = compare_models(fold=5)

Regression

Model MAE MSE RMSE R2 …

br Bayesian Ridge 15940.2956 566705805.8954 23655.0027 0.9059 …

huber Huber Regressor 15204.0960 588342119.6640 23988.3772 0.9033 …

omp Orthogonal Matching Pursuit 16603.0485 599383228.9339 24383.2437 0.9001 …

ridge Ridge Regression 16743.4660 605693331.2000 24543.6840 0.8984 …

par Passive Aggressive Regressor 15629.1539 630122079.3113 24684.8617 0.8972 …

… … … … … … …

Classification

Model Accuracy AUC Recall Prec. …

et Extra Trees Classifier 0.8944 0.9708 0.7912 0.8972 …

rf Random Forest Classifier 0.8634 0.9599 0.7271 0.8709 …

dt Decision Tree Classifier 0.8436 0.8689 0.7724 0.8448 …

xgboost Extreme Gradient Boosting 0.8417 0.9455 0.7098 0.8368 …

lightgbm Light Gradient Boosting Machine 0.8337 0.9433 0.6929 0.8294 …

… … … … … … …

To create the top 5 models, run the following code.

Regression

# Create the top 5 models br = create_model('br', fold=5) huber = create_model('huber', fold=5) omp = create_model('omp', fold=5) ridge = create_model('ridge', fold=5) par = create_model('par', fold=5)

Classification

# Create the top 5 models et = create_model('et', fold=5) rf = create_model('rf', fold=5) dt = create_model('dt', fold=5) xgboost = create_model('xgboost', fold=5) lightgbm = create_model('lightgbm', fold=5)

To tune the selected model, run the following code.

Regression

# Tune the models, BR: Regression br_tune = tune_model(br, fold=5) # Show the tuned hyperparameters, for example for BR: Regression plot_model(br_tune, plot='parameter')

Classification

# Tune the models, LightGBM: Classification lightgbm_tune = tune_model(lightgbm, fold=5) # Show the tuned hyperparameters, for example for LightGBM: Classification plot_model(lightgbm_tune, plot='parameter')

PyCaret lets the users manually perform ensemble methods, like Bagging, Boosting, Stacking, or Blending. The below code performs each ensemble method.

Regression

# Bagging BR br_bagging = ensemble_model(br_tune, fold=5) # Boosting BR br_boost = ensemble_model(br_tune, method='Boosting', fold=5) # Stacking with Huber as the meta-model stack = stack_models(caret_models_5, meta_model=huber, fold=5) # Blending top models caret_blend = blend_models(estimator_list=[br_tune,huber_tune,omp_tune,ridge_tune,par_tune])

Classification

# Bagging LightGBM lightgbm_bagging = ensemble_model(lightgbm_tune, fold=5) # Boosting LightGBM lightgbm_boost = ensemble_model(lightgbm_tune, method='Boosting', fold=5) # Stacking with ET as the meta-model stack = stack_models(caret_models_5, meta_model=et, fold=5) # Blending top models caret_blend = blend_models(estimator_list=[lightgbm_tune,rf,dt])

Now, let’s choose blending models as the predictive models. The following code uses the blending models to predict the validation datasets.

Regression

# Predict the validation data caret_pred = predict_model(caret_blend, data = trainSet2.drop(columns=['SalePrice']))

Classification

# Predict the validation data pred_caret = predict_model(caret_blend, data = trainSet2.drop(columns=['Target'])) AutoViML

I run AutoViML in the notebook by assigning many arguments. Just like PyCaret, AutoViML does not need splitting the features (X_train) and label (y_train). Users only need to split the training dataset (as trainSet1) and the validation dataset (as trainSet2). Users can specify other parameters, like scoring parameter, hyperparameter, feature reduction, boosting, binning, and so on.

Regression

!pip install autoviml !pip install shap from autoviml.Auto_ViML import Auto_ViML # Create the model viml, features, train_v, test_v = Auto_ViML(trainSet1, 'SalePrice', trainSet2.drop(columns=['SalePrice']), scoring_parameter='', hyper_param='RS', feature_reduction=True, Boosting_Flag=True, Binning_Flag=False,Add_Poly=0, Stacking_Flag=False, Imbalanced_Flag=True, verbose=1)

Classification

!pip install autoviml !pip install shap from autoviml.Auto_ViML import Auto_ViML # Create the model viml, features, train_v, test_v = Auto_ViML(trainSet1, 'Target', trainSet2.drop(columns=['Target']), scoring_parameter='balanced_accuracy', hyper_param='RS', feature_reduction=True, Boosting_Flag=True, Binning_Flag=False,Add_Poly=0, Stacking_Flag=False, Imbalanced_Flag=True, verbose=1)

After fitting the training data, we can examine what has been done from the output. For example, we can see that AutoViML was doing preprocessing by filling the missing data. Similar to MLJAR, AutoViML also gives the visualized report. For the regression task, it visualizes the scatter plot of true and predicted values using the XGBoost model. It also plots the prediction residual error in a histogram. From the two plots, we can observe that the model has relatively good accuracy and the residual error is a normal distribution.

The next report is ensemble method trials. We can find this report for both regression and classification tasks. The last graph displays the feature importance. We can see that the most important feature to predict house prices is exterior material quality, followed by overall material and finish quality, and size of the garage respectively.

Fig. 7 Scatter plot of true and predicted values, and residual error histogram.  Fig. 8 Feature importances.

As for the classification task, it does not visualize the scatter plot, but the ROC curve for each of the four prediction classes, micro-average, and macro-average as well as the AUC value. We can see that all of the classes have AUC values above 0.98. The next chart reports the iso-f1 curves as the accuracy metric. It also later gives the classification report.

Fig. 9 ROC curves and iso-f1 curves

Average precision score, micro-averaged over all classes: 0.97 Macro F1 score, averaged over all classes: 0.83 ##################################################### precision recall f1-score support 0 0.84 0.97 0.90 963 1 0.51 0.59 0.54 258 2 0.18 0.11 0.14 191 3 0.00 0.00 0.00 118 accuracy 0.73 1530 macro avg 0.38 0.42 0.40 1530 weighted avg 0.64 0.73 0.68 1530 [[938 19 6 0] [106 151 1 0] [ 53 117 21 0] [ 16 12 90 0]]

The following code prints out the result of AutoViML search. We can see that the best model is XGBRegressor and XGBClassifier for each task.

viml

Regression

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.7, gamma=1, gpu_id=0, grow_policy='depthwise', importance_type='gain', interaction_constraints='', learning_rate=0.1, max_delta_step=0, max_depth=8, max_leaves=0, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=-1, nthread=-1, num_parallel_tree=1, objective='reg:squarederror', predictor='cpu_predictor', random_state=1, reg_alpha=0.5, reg_lambda=0.5, scale_pos_weight=1, seed=1, subsample=0.7, tree_method='hist', ...)

Classification

CalibratedClassifierCV(base_estimator=OneVsRestClassifier(estimator=XGBClassifier(base_score=None, booster='gbtree', colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, gamma=None, gpu_id=None, importance_type='gain', interaction_constraints=None, learning_rate=None, max_delta_step=None, max_depth=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=200, n_jobs=-1, nthread=-1, num_parallel_tree=None, objective='binary:logistic', random_state=99, reg_alpha=None, reg_lambda=None, scale_pos_weight=None, subsample=None, tree_method=None, use_label_encoder=True, validate_parameters=None, verbosity=None), n_jobs=None), cv=5, method='isotonic') LightAutoML

Now, we have reached the 10th AutoML. LightAutoML is expected to be light as its name explains. Here, we set the task to be ‘reg’ for regression, ‘multiclass’ for multiclass classification, or ’binary’ for binary classification. We can also set the metrics and losses in the task. I set the timeout to be 3 minutes to let it find the best model. After simple .fit and .predict, we can already get the result.

Regression

!pip install openpyxl from lightautoml.automl.presets.tabular_presets import TabularAutoML from lightautoml.tasks import Task # Create the model light = TabularAutoML(task=Task(‘reg’,), timeout=60*3, cpu_limit=4) train_data = pd.concat([X_train, y_train], axis=1) # Fit the training data train_light = light.fit_predict(train_data, roles = {‘target’: ‘SalePrice’, ‘drop’:[]}) # Predict the validation data pred_light = light.predict(X_val)

Classification

!pip install openpyxl from lightautoml.automl.presets.tabular_presets import TabularAutoML from lightautoml.tasks import Task train_data = pd.concat([X_train, y_train], axis=1) # Create the model light = TabularAutoML(task=Task(‘multiclass’,), timeout=60*3, cpu_limit=4) # Fit the training data train_light = light.fit_predict(train_data, roles = {‘target’: ‘Target’}) # Predict the validation data pred_light = light.predict(X_val)

The results for the classification task are the predicted class and the probability of each of the classes. In other words, the result fulfills multiclass and multilabel classification expectations.

# Convert the prediction result into dataframe pred_light2 = pred_light.data pred_light2 = pd.DataFrame(pred_light2, columns=['4','2','3','1']) pred_light2 = pred_light2[['1','2','3','4']] pred_light2['Pred'] = pred_light2.idxmax(axis=1) pred_light2['Pred'] = pred_light2['Pred'].astype(int) pred_light2.head()

1 2 3 4 Pred

0 0.00 0.01 0.00 0.99 4

1 0.00 0.00 0.00 1.00 4

2 0.00 0.04 0.00 0.96 4

3 0.00 0.01 0.98 0.01 3

4 0.02 0.38 0.34 0.27 2

Conclusion

We have discussed 10 AutoML packages now. Please notice that there are still more AutoML not yet discussed in this post. Not only in Python notebook but AutoML also can be found in cloud computing or software. Which AutoML is your favourite? Or, do you have other AutoML in mind?

Fig. 10 AutoML chúng tôi you still going to run 10 conventional Machine Learning or prefer the 10 AutoML packages?

About Author

References

Fig3- Image by Author

Fig4-Image by Author

Fig5-Image by Author

Fig6-Image by Author

Fig10- sklearn, topot, hyperopt, autokeras, mljar, autogluon, h2o, pycaret, autoviml

Connect with me here.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Update the detailed information about Why Is Machine Learning Important? on the Hatcungthantuong.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!