Trending December 2023 # Key Differences Between Data Science And Data Analytics # Suggested January 2024 # Top 19 Popular

You are reading the article Key Differences Between Data Science And Data Analytics updated in December 2023 on the website Hatcungthantuong.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested January 2024 Key Differences Between Data Science And Data Analytics

Data Science and Data Analytics: Learn what makes each type of analysis different

The rise of Big Data has spawned two new industry buzzwords: Data Science and Data Analytics. Today, the entire globe significantly contributes to tremendous data growth, hence the term “Big Data.”

Texts, emails, tweets, user queries (on search engines), social media chatter, data created by IoT and linked devices -everything we do online – are all Big Data. The data produced by the digital world every day is so enormous and complicated that standard data processing and analysis technologies cannot handle it. Data Science and Data Analytics come into play. We frequently use Data Science and Data Analytics interchangeably since Big Data, Data Science, and Data Analytics are still developing technologies. The fact that both data scientists and data analysts deal with big data is the leading cause of the misconception. The substantial differences between data analysts and data scientists fuel the argument between data science and data analytics.

Big Data is dealt with by data science and data analytics, each using a different strategy. Data Analytics falls under the category of data science. Mathematics, statistics, computer science, information science, machine learning, and artificial intelligence are all combined in data science. This results in differences between data science and data analytics. Data mining, data inference, predictive modeling, and the creation of ML algorithms are all included in this process, which aims to discover patterns from large datasets and turn them into practical business strategies. However, statistics, mathematics, and statistical analysis play a significant role in data analytics. Data science looks for novel and original issues that might spur commercial innovation. On the other hand, data analysis seeks answers to these issues and determines how to use them inside an organization to promote data-driven innovation.

Data are used differently by data scientists and analysts. Data scientists clean, analyze, and evaluate data to derive insights using a combination of mathematical, statistical, and machine-learning approaches. ML algorithms, predictive models, custom analyses, and prototypes are used to create sophisticated data modeling procedures. Data analysts gather vast amounts of data, arrange it, and analyze it to find pertinent patterns, whereas data analysts evaluate data sets to detect trends and develop conclusions. Following the analytical phase, they try to display their findings using techniques for data visualization, such as charts and graphs. To make detailed results understandable to a company’s technical and non-technical people, data analysts translate them into business-savvy language. Another distinction between data analysis and data science is this. To provide valuable insights for data-driven decision-making, both jobs execute varied degrees of data gathering, cleansing, and analysis. Therefore, data scientists’ and analysts’ duties frequently overlap, leaving people to wonder, Are data analytics and data science the same?

Data scientists’ responsibilities

to prepare, purge, and verify data accuracy.

using massive datasets to undertake exploratory data analysis.

to create ETL pipelines for data mining.

to do statistical analysis utilizing ML methods such as decision trees, random forests, logistic regression, and KNN.

to create helpful ML libraries and automate code.

employing machine learning techniques and algorithms to get business insights.

to find new trends in data and anticipate business outcomes.

The duties of data analysts

to gather and analyze data.

to find essential trends in a dataset.

to carry out SQL data querying.

to experiment with various analytical techniques, including descriptive, diagnostic, prescriptive, and predictive analytics.

to show the gathered data using data visualization tools like Tableau, IBM Cognos Analytics, etc.

Data scientists need to be experts in programming (Python, R, SQL), predictive modeling, and machine learning, as well as in mathematics and statistics. Data analysts must be knowledgeable in database administration and visualization, data mining, data modeling, data warehousing, and data analysis. Data scientists and analysts need to have a strong sense of logic and problem-solving skills. Another distinction between data analytics and data science is this.

An analyst of data must be:

well-versed in SQL databases and Excel.

competent with various technologies, including SAS, Tableau, and Power BI.

programming skills in R or Python.

adept at visualizing data.

To be a data scientist, one must:

well-versed in multivariate calculus, linear algebra, probability, and statistics.

proficient in R, Python, Java, Scala, Julia, SQL, and MATLAB programming.

adept at managing databases, handling data, and using machine learning.

knowledgeable about using extensive data systems like Hadoop and Apache Spark.

You're reading Key Differences Between Data Science And Data Analytics

Data Analytics Vs. Data Science

Data analytics and data science are closely related technologies, yet significant differences exist between them.

Data science, in contrast, focuses on the larger picture of data, and involves creating new models and systems to build an overall portrait of a given data universe.

In essence, data science takes a “larger view” than data analytics. But both data methodologies involve interacting with big data repositories to gain important insights.

For more information, also see: What is Big Data Analysis

Data Science Data Analytics

Scope Macro Micro

Skills

ML software development

Predictive analytics

Engineering and programming

BI tools

Statistical analysis

Data mining

Data modeling

Goal To extract knowledge insights from data To gain insights and make decisions based on data

Popular Tools Python, ML, Tableau, SQL SQL, Excel, Tableau

As noted, while data analytics and data science and are closely related, they both perform separate tasks. Some more detail:

Data analytics analyzes defined data sets to give actionable insights for a company’s business decisions. The process extracts, organizes, and analyzes data to transform raw data into actionable information. Once the data is analyzed, professionals can find suggestions and recommendations for a company’s next steps.

Data analytics is a form of business intelligence that helps companies remain competitive in today’s data-driven market sectors.

For more on data analytics: Best Data Analysis Methods

Data science is the process of assembling data stores, conceptualizing data frameworks, and building all-encompassing models to drive the deep analysis of data.

Data science uses technologies that include statistics, machine learning, and artificial intelligence to build models from huge data sets. It helps businesses answer deeper questions about trends and data flow, often allowing a company to make business forecasts with the results.

Given the complexity of data science, it’s no surprise that the technology and tools that drive this process are constantly – and rapidly – evolving, as they are with data analytics.

For more on data science: Data Science Market Trends

Both data analytics and data science are essential disciplines for companies seeking to find maximum benefit from their data repositories. Among the benefits:

Streamline operations: Data analytics has the potential to gather and analyze a company’s data to find where current production is slowing and improve efficiency by helping a company predict future delays.

Mitigate risks: Data analytics can help companies see and understand their risks. Data analytics can help take preventative measures as well.

Discover unknown patterns: Data science can find overall patterns within a company’s collection of data that can potentially benefit them. Analyzing these larger, systemic models can help a business understand their workflow better, which can support major business changes.

Company innovation: With data science, a company can find foundational problems that they previously did not fully realize. This deep insight benefits may benefit the company at several different levels of operation.

Real-time optimization: The larger vision offered by data science enables businesses to react to change quickly –  an overall systemic view offers great guidance.

For more information: Data Science & Analytics Predictions, Trends, & Forecasts

Lack of communication within teams: Team members and executives may not have the expertise to provide much granular insight into their data, despite their control over it. Without a data analyst, a company could miss information from different teams.

Low quality of data: Decisions for a company can be negatively affected if low-quality data or data that has not been fully prepped is involved in the process.

Privacy concerns: Similar to data science, there are problems with privacy while using data analytics. If a company or professional does not govern sensitive information in a compliant manner, the data can be compromised.

Domain knowledge required: Using data science requires a company or staffer to have significant knowledge about data science as it grows and changes, which means that companies must allot budget for hiring and training qualified professionals.

Unexpected results: Occasionally, data science processes cannot incorporate or mine data that is considered “arbitrary” data, meaning data this is not recognized by the system for any reason. Because a data scientist may not know which data is recognized, data problems could go under the radar.

Data privacy: As with data analytics, if data is treated without careful standards, the large datasets are more susceptible to cybersecurity privacy problems.

Companies need to select the optimum tools to use data analytics and data science most  effectively. See below for examples of some leading tools:

Here are the top six data analytics tools and what they can do for a business:

Tableau: Collects and combines multiple data inputs and offers a dashboard display with visual data mining.

Microsoft Power BI: AI and ML functionality, powering the augmented analytics, and image analytics.

Qlik: AI and ML, easy deep data skills, and data mining.

ThoughtSpot: Search-based query interface, augmented analytics, and comparative analysis to anomaly detection.

Sisense: Cloud-native infrastructure, great scalability, container technology, caching engine, and augmented data prep features.

TIBCO: Streaming analytics, data mining, augmented analytics, and natural language user interface.

Here are the top six data science tools and what they can do for a business:

When researching which data analytics and data sciences tools to buy, it is important to understand that data analytics and data science work in combination with one another – meaning that more than one software tool may be needed to create the optimum data strategy.

In some cases this means buying both data solutions from one vendor, but this isn’t necessary. It also works to buy “best of breed” from two different – competing – vendors. Just make sure to do an extensive trial run with both applications working in concert, to ensure that the combination creates the ideal result.

Data science and data analytics are separate disciplines but are both are crucially important to businesses.

For businesses looking to increase their understanding of data and how it can help their organizations, data analytics and data science play a contrasting and complimentary role. They are different – but they are both essential.

Therefore, business must understand the differing roles of data analytics and data science, and be prepared to select tools for each discipline that work well in combination.

Structured Vs. Unstructured Data: Key Differences Explained

Structured data consists of clearly defined data types with patterns that make them easily searchable, while unstructured data—“everything else”—is composed of data that is usually not as easily searchable, including formats like audio, video, and social media postings.

Structured data analytics is a mature process and technology, whereas unstructured data analytics is a nascent industry with a lot of new investment in research and development.

The structured data versus unstructured data issue within corporations is deciding if they should invest in analytics for unstructured data and determining if it is possible to aggregate the two into better business intelligence.

For more information, also see: Data Management Platforms

Structured Data Unstructured Data

Organized information Diverse structure for information

Quantitative Qualitative

Requires less storage Requires more storage

Not flexible Flexible

ID codes for databases Videos and images

Structured data usually resides in relational databases (RDBs). Fields store length-delimited data like phone numbers, Social Security numbers, or ZIP codes, and records even contain text strings of variable length like names, making it a simple matter to search.

Data may be human- or machine-generated, as long as the data is created within an RDB structure. This format is eminently searchable, both with human-generated queries and via algorithms using types of data and field names, such as alphabetical or numeric, currency, or date.

Common relational database applications with structured data include airline reservation systems, inventory control, sales transactions, and ATM activity. Structured Query Language (SQL) enables queries on this type of structured data within relational databases.

Some relational databases store or point to unstructured data, such as customer relationship management (CRM) applications. The integration can be awkward at best since memo fields do not lend themselves to traditional database queries. Still, most of the CRM data is structured.

For more information, also see: Top Data Warehouse Tools

Because structured data is organized, it is commonly stored in data centers for easy access of the data. The data warehouses hold their own space for businesses that choose to use it.

Structured data is organized, making it easy for a company to find exactly what they are looking for. With this method, a company can begin using the data instantly.

Due to the organization style of structured data, it is more difficult to have flexibility or varied use cases.

Structured data is stored in specific spaces of data warehouses. While accessing the data is easy, scalability can be difficult. Changes within data warehouses can become hard to manage. Using cloud data centers help with the storage problems.

Data centers or other storage for structured data can become expensive and be part of the structured data ordeal. Again, cloud data centers are recommended, but the storage can still require significant work to keep the data maintained properly.

ZIP codes

Phone numbers

Email addresses

ATM activity

Inventory control

Student fee payment databases

Airline reservation and ticketing

Google’s Structured Data Testing Tool

Yandex Structured Data Validator

Markle’s Schema Markup Generator

SEO SiteCheckup

Bing Markup Validator

Google Email Markup Tester

RDF Translator

JSON-LD Playground

Schema Markup Generator

Microdata Tool

For more information, also see: What is Big Data Analysis

Unstructured data is essentially everything else. Unstructured data has an internal structure but is not structured via predefined data models or schema. It may be textual or non-textual and human- or machine-generated. It may also be stored within a non-relational database like NoSQL.

Typical human-generated unstructured data includes:

Email: Message field

Social Media: Data from Facebook, Twitter, and LinkedIn.

Websites: YouTube, Instagram, and photo sharing sites.

Mobile Data: Text messages and locations.

Communications: Chat, IM, phone recordings, and collaboration software.

Media: MP3, digital photos, and audio and video files.

Business Applications: Microsoft Office documents and productivity applications.

Typical machine-generated unstructured data includes:

Satellite Imagery: Weather data, landforms, and military movements.

Scientific Data: Oil and gas exploration, space exploration, seismic imagery, and atmospheric data.

Digital Surveillance: Surveillance photos and video.

Sensor Data: Traffic, weather, and oceanographic sensors.

Use cases for unstructured data are significantly larger than structured data due to its flexibility. From social media posts to scientific data, unstructured data gives companies the flexibility to use the data how they want.

When a company has more unstructured data than structured data, there is more data to work with. Unstructured data may be difficult to analyze, but through processing, a company can benefit from the data.

Because of the ability to store unstructured data at data lakes, a business can save money with how they choose to store the data.

If a company uses unstructured data, it is more difficult to take the raw data and analyze it despite its flexibility.

Unstructured data cannot be managed by business tools. Its inconsistent nature makes it more difficult than structured data.

Unstructured data comes in many different forms, such as medical records, social media posts, and emails. This information may be challenging with analysis.

Text files

Email

Social media

Website

Mobile data

Communications

Media

Business applications

Satellite imagery

Scientific data

Digital surveillance

Sensor data

MonkeyLearn

Microsoft Excel

Google Sheets

RapidMinder

KNIME

Power BI

Tableau

MongoDB Charts

Apache Hadoop

Apache Spark

For more information, also see: Top Data Analytics Tools 

Semi-structured data maintains internal tags and markings that identify separate data elements, which enables data analysts to determine information grouping and hierarchies. Both documents and databases can be semi-structured. This type of data only represents about 5–10% of the data pie, but has critical business usage cases when used in combination with structured and unstructured data.

Email is a huge use case, but most semi-structured development centers on easing data transport issues. Sharing sensor data is a growing use case, as are web-based data sharing and transport, including electronic data interchange (EDI), many social media platforms, document markup languages, and NoSQL databases.

New tools are available to analyze unstructured data, particularly given specific use case parameters. Most of these tools are based on machine learning. Structured data analytics can use machine learning as well, but the massive volume and many different types of unstructured data requires it.

A few years ago, analysts using keywords and key phrases could search unstructured data and get a decent idea of what the data involved. E-discovery was and is a prime example of this approach. However, unstructured data has grown so dramatically that users need to employ analytics that not only work at compute speeds but also automatically learn from their activity and user decisions.

Natural language processing (NLP), pattern sensing and classification, and text-mining algorithms are all common examples, as are document relevance analytics, sentiment analysis, and filter-driven web harvesting.

Unstructured data analytics with machine-learning intelligence allows organizations to:

Failed compliance can cost companies millions of dollars in fees, litigation, and lost business. Pattern recognition and email threading analysis software search massive amounts of email and chat data for potential noncompliance.

A recent example in this area is Volkswagen, which might have avoided huge fines and reputational hits by using analytics to monitor communications for suspicious messages.

Text analytics and sentiment analysis lets analysts review positive and negative results of marketing campaigns, or even identify online threats. This level of analytics is far more sophisticated than simple keyword search, which can only report basics, like how often posters mention the company name during a new campaign.

New analytics also include context:

Was the mention positive or negative?

Were posters reacting to each other?

What was the tone of reactions to executive announcements?

The automotive industry, for example, is heavily involved in analyzing social media, since car buyers often turn to other posters to guide their car buying experience. Analysts use a combination of text mining and sentiment analysis to track auto-related user posts on Twitter and Facebook.

Machine learning analytics tools quickly work on massive amounts of documents to analyze customer behavior.

A major magazine publisher applied text mining to hundreds of thousands of articles, analyzing each separate publication by the popularity of major subtopics. Then, it extended analytics across all of its content properties to see which overall topics got the most attention by customer demographic.

The analytics ran across hundreds of thousands of pieces of content across all publications, and cross-referenced hot topic results by segments. The result was a rich education on which topics were most interesting to distinct customers, and which marketing messages resonated most strongly with them.

For more information, also see: The Data Analytics Job Market 

Aside from being stored in a relational database versus stored outside of one, the biggest difference between structured and unstructured data is the ease of analysis. Mature analytics tools exist for structured data, but analytics tools for mining unstructured data are nascent and developing.

The “versus” in unstructured data versus structured data does not denote conflict between the two. Customers select one or the other not based on their data structure, but on the applications that use them: relational databases for structured data and most any other type of application for unstructured data.

Data Science Vs Data Visualization

Differences between Data Science vs Data Visualization

Data Science is defined as the art of interpreting data and getting useful information out of it whereas Data Visualization involves the representation of the data, basically, both of them cannot be considered as two completely different entities, as they are bound together in a way that Data Visualizations is the subset of Data Science, so few of the differences that occur between them is based upon there application, tools, process, required skills and the significance.

Start Your Free Data Science Course

The best example of data science on our day to day basis is Amazon’s recommendation for a user while shopping. The machine is learning about a user’s web activity and interprets and manipulate it thus by giving the best recommendation based on your interests and choice of shopping. To provide this recommendation, the data scientists represent (visualize) the user’s web activity and analyze to provide best choices for the user and this is where data visualization comes into the picture.

Data science and data visualization are not two different entities. They are bound to each other. Data visualization is a subset of data science. Data science is not a single process or a method or any workflow. It is a combined effect of small miniatures dealing with the data. Be it a process of data mining techniques, the EDA, modeling, representation.

Example:  To portray any incident/story in our daily basis, it could be conveyed as a speech but when it is represented visually, the real value of it will be established and understood.

Also, it is not only about representing the final outcome, but also applicable to understanding the raw data. It is always better to represent the data in order to get better insights and how to solve the problem or get a meaningful information out of it which influences the system.

Let’s say we want to predict what will be iPhone sales for the year 2023,

How exactly can one predict the sales in the future? What are the prerequisites, how confidence is your prediction, what’s the error rate? All these are answered and justified using data science.

3. Key factors – Recent changes in organization, recent market value, and the customer reviews on the past sale

How well could one get more insights from the historical data? The best way is to visualize it.

Data visualization plays a key role in two stages

The initial phase of analytics (i.e., Represent the available data and conclude what attributes and parameters to be used in order to build a predictive machine). This stimulates the data scientist in providing the solution with various approaches. So here in our example, it is historical data representation which historical year can be picked best for analysis. This is decided based on the visualization.

Two – Outcome. The prediction results for the year 2023 has to be represented in a way that it reaches the world. Comparison between phone and google pixel sales for the upcoming years. It will lead to better decision making for organizations.

Back to the iPhone analysis, the historical data has to be analyzed and pick the best attributes that cause significant impact towards the prediction rate (like sales on location wise, season-wise, age).

and support vector machine – to mention few). Train the model using the historical data and get the prediction for the upcoming year. This is a high-level picture of the processes involved in the data science.

Head to Head Comparison Between Data Science and Data Visualization (Infographics)

Below is the Top 7 Comparison between Data Science and Data Visualization:

Key Differences Between Data Science and Data Visualization

Data science comprises of multiple statistical solutions in solving a problem whereas visualization is a technique where data scientist use it to analyze the data and represent it the endpoint.

Data science is about algorithms to train the machine (Automation – No human power, the machine will simulate as the human in order to cut down many manual processes. It’s about observation and interpretation of the activity).  Data visualization is about graphs, plotting, choosing the best model based on representation.

Data Science and Data Visualization Comparison Table

Below are the lists of points, describe the comparison between Data Science and Data Visualization

Basis for comparison Data science Data visualization

Concept Insights about the data. Explanation of the data. Prediction, facts Representation of the data(be it a source or the results)

Application/Use Cases Next world cup prediction, Automated cars Organization metrics

Who does this? Data scientists, data analysts, mathematicians Data scientists, UI/UX

Tools Python, Matlab, R (to mention few) Tableau, SAS, Power BI, d3 js (to mention few). Python and R have libraries as well to generate plots and graphs.

Process Data harvest, data mining, data munging, data cleansing, Modeling, measurement Represent it in any chart form or graphs

How significant Many organizations are relying on data science results for decision making. It helps data scientists in understanding the source and how to solve the problem or providing recommendations.

Skills Statistics, algorithms Data analysis, and plotting techniques.

Conclusion

There are many perspectives when it comes to data science. In an easy way to approach, it is how to solve a problem in various cases being it a prediction, categorization, recommendations, sentiment analysis. In a nutshell, all these could be accomplished using the statistical way of problem-solving. It’s a combination of (machine learning, deep learning, neural networks, NLP, data mungling etc)

Data visualization adds up a key ingredient in taking the approach to solving the problems. It’s a photograph for your script (in layman’s term).

Recommended Articles

This has been a guide to Differences Between Data Science vs Data Visualization. Here we have discussed Data Science vs Data Visualization head to head comparison, key difference along with infographics and comparison table. You may also look at the following articles to learn more –

Exploratory Data Analysis And Visualization Techniques In Data Science

Photo by fauxels from Pexels

Exploratory Data Analysis (EDA) is a process of describing the data by means of statistical and visualization techniques in order to bring important aspects of that data into focus for further analysis. This involves inspecting the dataset from many angles, describing & summarizing it without making any assumptio

“Exploratory data analysis is an attitude, a state of flexibility, a willingness to look for those things that we believe are not there, as well as those we believe to be there” – John W. Tukey

Exploratory data analysis is a significant step to take before diving into statistical modeling or machine learning, to ensure the data is really what it is claimed to be and that there are no obvious errors. It should be part of data science projects in every organization.

This article was published as a part of the Data Science Blogathon

Why Exploratory Data Analysis is important?

Just like everything in this world, data has its imperfections. Raw data is usually skewed, may have outliers, or too many missing values. A model built on such data results in sub-optimal performance. In hurry to get to the machine learning stage, some data professionals either entirely skip the exploratory data analysis process or do a very mediocre job. This is a mistake with many implications, that includes generating inaccurate models, generating accurate models but on the wrong data, not creating the right types of variables in data preparation, and using resources inefficiently.

In this article, we’ll be using Pandas, Seaborn, and Matplotlib libraries of Python to demonstrate various EDA techniques applied to Haberman’s Breast Cancer Survival Dataset.

Data set description Attribute information :

Patient’s age at the time of operation (numerical).

Year of operation (year — 1900, numerical).

A number of positive axillary nodes were detected (numerical).

Attributes 1, 2, and 3 form our features (independent variables), while attribute 4 is our class label (dependent variable).

Let’s begin our analysis . . .

1. Importing libraries and loading data

Import all necessary packages —

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import scipy.stats as stats

Load the dataset in pandas dataframe —

df = pd.read_csv('haberman.csv', header = 0) df.columns = ['patient_age', 'operation_year', 'positive_axillary_nodes', 'survival_status'] 2. Understanding data



Output:

Shape of the dataframe —

df.shape

There are 305 rows and 4 columns. But how many data points for each class label are present in our dataset?

df[‘survival_status’].value_counts()

Output:

The dataset is imbalanced as expected.

Out of a total of 305 patients, the number of patients who survived over 5 years post-operation is nearly 3 times the number of patients who died within 5 years.

df.info()

Output:

All the columns are of integer type.

No missing values in the dataset.

2.1 Data preparation

Before we go for statistical analysis and visualization, we see that the original class labels — 1 (survived 5 years and above) and 2 (died within 5 years) are not in accordance with the case.

df['survival_status'] = df['survival_status'].map({1:"yes", 2:"no"}) 2.2 General statistical analysis df.describe()

Output:

On average, patients got operated at age of 63.

An average number of positive axillary nodes detected = 4.

As indicated by the 50th percentile, the median of positive axillary nodes is 1.

As indicated by the 75th percentile, 75% of the patients have less than 4 nodes detected.

If you see, there is a significant difference between the mean and the median values. This is because there are some outliers in our data and the mean is influenced by the presence of outliers.

2.3 Class-wise statistical analysis survival_yes = df[df['survival_status'] == 'yes'] survival_yes.describe()

Output:

survival_no = df[df[‘survival_status’] == ‘no’] survival_no.describe()

 

Output:

From the above class-wise analysis, it can be observed that —

The average age at which the patient is operated on is nearly the same in both cases.

Patients who died within 5 years on average had about 4 to 5 positive axillary nodes more than the patients who lived over 5 years post-operation.

Note that, all these observations are solely based on the data at hand.

3. Uni-variate data analysis 3.1 Distribution Plots

Uni-variate analysis as the name suggests is an analysis carried out by considering one variable at a time. Let’s say our aim is to be able to correctly determine the survival status given the features — patient’s age, operation year, and positive axillary nodes count. Which among these 3 variables is more useful than other variables in order to distinguish between the class labels ‘yes’ and ‘no’? To answer this, we’ll plot the distribution plots (also called probability density function or PDF plots) with each feature as a variable on X-axis. The values on the Y-axis in each case represent the normalized density.

1. Patient’s age

sns.FacetGrid(df, hue = "survival_status").map(sns.distplot, "patient_age").add_legend() plt.show()

Output:

Among all the age groups, the patients belonging to 40-60 years of age are highest.

There is a high overlap between the class labels. This implies that the survival status of the patient post-operation cannot be discerned from the patient’s age.

2. Operation year

sns.FacetGrid(df, hue = "survival_status").map(sns.distplot, "operation_year").add_legend() plt.show()

Output:

Just like the above plot, here too, there is a huge overlap between the class labels suggesting that one cannot make any distinctive conclusion regarding the survival status based solely on the operation year.

3. Number of positive axillary nodes

sns.FacetGrid(df, hue = "survival_status").map(sns.distplot, "positive_axillary_nodes").add_legend() plt.show()

Output:

This plot looks interesting! Although there is a good amount of overlap, here we can make some distinctive observations –

Patients having 4 or fewer axillary nodes — A very good majority of these patients have survived 5 years or longer.

Patients having more than 4 axillary nodes — the likelihood of survival is found to be less as compared to the patients having 4 or fewer axillary nodes.

But our observations must be backed by some quantitative measure. That’s where the Cumulative Distribution function(CDF) plots come into the picture.

The area under the plot of PDF over an interval represents the probability of occurrence of the random variable in the given interval. Mathematically, CDF is an integral of PDF over the range of values that a continuous random variable takes. CDF of a random variable at any point ‘x’ gives the probability that a random variable will take a value less than or equal to ‘x’.

counts, bin_edges = np.histogram(survival_yes[‘positive_axillary_nodes’], density = True) pdf = counts/sum(counts) cdf = np.cumsum(pdf) plt.plot(bin_edges[1:], cdf, label = ‘CDF Survival status = Yes’)

counts, bin_edges = np.histogram(survival_no[‘positive_axillary_nodes’], density = True) pdf = counts/sum(counts) cdf = np.cumsum(pdf) plt.plot(bin_edges[1:], cdf, label = ‘CDF Survival status = No’) plt.legend() plt.xlabel(“positive_axillary_nodes”) plt.grid() plt.show()

Output:

 

Some of the observations that could be made from the CDF plot —

Patients having 4 or fewer positive axillary nodes have about 85% chance of survival for 5 years or longer post-operation, whereas this number is less for the patients having more than 4 positive axillary nodes. This gap diminishes as the number of axillary nodes increases.

3.2 Box plots and Violin plots

Box plot, also known as box and whisker plot, displays a summary of data in five numbers — minimum, lower quartile(25th percentile), median(50th percentile), upper quartile(75th percentile), and maximum data values.

A violin plot displays the same information as the box and whisker plot; additionally, it also shows the density-smoothed plot of the underlying distribution.

Let’s make the box plots for our feature variables –

plt.figure(figsize = (15, 4)) plt.subplot(1,3,1) sns.boxplot(x = 'survival_status', y = 'patient_age', data = df) plt.subplot(1,3,2) sns.boxplot(x = 'survival_status', y = 'operation_year', data = df) plt.subplot(1,3,3) sns.boxplot(x = 'survival_status', y = 'positive_axillary_nodes', data = df) plt.show()

Output:

The patient age and the operation year plots show similar statistics.

The isolated points seen in the box plot of positive axillary nodes are the outliers in the data. Such a high number of outliers is kind of expected in medical datasets.

Violin Plots –

plt.figure(figsize = (15, 4)) plt.subplot(1,3,1) sns.violinplot(x = 'survival_status', y = 'patient_age', data = df) plt.subplot(1,3,2) sns.violinplot(x = 'survival_status', y = 'operation_year', data = df) plt.subplot(1,3,3) sns.violinplot(x = 'survival_status', y = 'positive_axillary_nodes', data = df) plt.show()

Output:

Violin plots in general are more informative as compared to the box plots as violin plots also represent the underlying distribution of the data in addition to the statistical summary. In the violin plot of positive axillary nodes, it is observed that the distribution is highly skewed for class label = ‘yes’, while it is moderately skewed for ‘no’. This indicates that –

For the majority of patients (in both the classes), the number of positive axillary nodes detected is on the lesser side. Of which, patients having 4 or fewer positive axillary nodes are more likely to survive 5 years post-operation.

These observations are consistent with our observations from previous sections.

4. Bi-variate data analysis 4.1 Pair plot

Next, we shall plot a pair plot to visualize the relationship between the features in a pairwise manner. A pair plot enables us to visualize both distributions of single variables as well as the relationship between pairs of variables.

sns.set_style('whitegrid') sns.pairplot(df, hue = 'survival_status') plt.show()

Output:

In the case of the pair plot, it can be seen that the plots on the upper half and lower half of the diagonal are the same, only the axis is interchanged. So, they essentially convey the same information. Analyzing either would suffice. The plots on the diagonal are different from the rest of the plots. These plots are kernel density smoothed histograms representing the univariate distribution of a particular feature.

As we can observe in the above pair plot, there is a high overlap between any two features and hence no clear distinction can be made between the class labels based on the feature pairs.

4.2 Joint plot

While the Pair plot provides a visual insight into all possible correlations, the Joint plot provides bivariate plots with univariate marginal distributions.

sns.jointplot(x = 'patient_age', y = 'positive_axillary_nodes', data = df) plt.show()

Output:

The pair plot and the joint plot reveal that there is no correlation between the patient’s age and the number of positive axillary nodes detected.

The histogram on the top edge indicates that patients are more likely to get operated in the age of 40–60 years compared to other age groups.

The histogram on the right edge indicates that the majority of patients had fewer than 4 positive axillary nodes.

4.3 Heatmap

Heatmaps are used to observe the correlations among the feature variables. This is particularly important when we are trying to obtain the feature importance in regression analysis. Although correlated features do not impact the performance of the statistical model, it could mess up the post-modeling analysis.

Let’s see if there exist any correlation among our features by plotting a heatmap.

sns.heatmap(df.corr(), cmap = ‘YlGnBu’, annot = True)

plt.show()

Output:

The values in the cells are Pearson’s R values which indicate the correlation among the feature variables. As we can see, these values are nearly 0 for any pair, so no correlation exists among any pair of variables.

5. Multivariate analysis with Contour plot

A contour plot is a graphical technique for representing a 3-dimensional surface by plotting constant z slices, called contours, in a 2-dimensional format. A contour plot enables us to visualize data in a two-dimensional plot. Here is a diagrammatic representation of how the information from the 3rd dimension can be consolidated into a flat 2-D chart –

Plotting a contour plot using the seaborn library for patient’s age on x-axis and operation year on the y-axis —

sns.jointplot(x = 'patient_age', y = 'operation_year' , data = df, kind = 'kde', fill = True) plt.show()

Output:

From the above contour plot,  it can be observed that the years 1959–1964 witnessed more patients in the age group of 45–55 years.

Frequently Asked Questions

In this article, we learned some common steps involved in exploratory data analysis. We also saw several types of charts & plots and what information is conveyed by each of these. This is just not it, I encourage you to play with the data and come up with different kinds of visualizations and observe what insights you can extract from it.

About me

Hi, I am Pratik Nabriya, a Data Scientist currently employed with an Analytics & AI firm based out of Noida. My key skills include Machine learning, Deep learning, NLP, Time-Series Analysis, SQL and I’m familiar with working in a Cloud environment. I love to write blogs and articles in my spare time and share my learnings with fellow data professionals.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Related

Top 10 Data Science Platforms That Cash The Analytics Code

Data science platforms are the must-have tools for any business enterprises that aspire to scale up its frontiers. Data science platform is essentially a software hub around which all the data science functionalities like data exploration and integration from various sources, coding, model building are performed. Data science platforms are programmed to train and test models and deploy the results to solve real-life business problems. Data science platforms are a massive hit driving business revenues to new heights, this can be ascertained by the fact that the global data science platform market is expected to grow at a CAGR of around 39.2% in the next decade to reach to approx. $385.2 billion by 2025. Using the massively varied data science platforms, one question is often asked and debated, which ones are the top data science platforms that let you use the best tools for the job at hand? According to a leading Data Science and Analytics recruitment agency, Burtch Works, 62% of analytics professionals prefer to code in R or Python over legacy solution SAS. While choosing a data science platform, among available open source solutions like Jupyter and RStudio or the closed platforms that rely on proprietary solutions can be a daunting task, business enterprises should rely on data science platforms that best serve their needs and allow them to use packages and languages as per their requirements. Here are the top data science platforms that are most used and liked in the business world, in short, these are the data science platforms that feature most of the Analytics code written!  

Alteryx is a computer software company headquartered at Irvine, California. Alteryx Analytics offers business intelligence and predictive analytics products that are used for data science and analytics. Alteryx Analytics is a closed platform and pricing vary from $3,995 per user, per year (for a 3-year subscription of Alteryx Designer) to $5,194 per user, per year (for a 1-Year Subscription of Alteryx Designer). Another offering is the cloud-based Alteryx analytics gallery which costs $1,950 per year, per user under a one year contract and $1,500 per year, per user under a three-year contract. Alteryx Analytics technology partners include Tableau, Microsoft, Amazon Web Services and Qlik (provider of QlikView & Qliksense). Alteryx Analytics is deployed by popular names including Johnson & Johnson, Hyatt, Unilever, and Audi among others.  

TIBCO Statistica is increasingly being relied upon by business enterprises to solve complex problems. The platform offers users to create innovative models with the latest deep learning, predictive, prescriptive, AI, and analytical techniques. The platform’s capabilities include comprehensive analytics algorithms including regression, clustering, decision trees, neural networks, machine learning that can be accessed through the built-in nodes. TIBCO Statistica offers data access through Apache Hadoop databases and data preparation by an automated data health check node. Users can use the reusable analytic workflow templates and integrate open source R, Python, C# and Scala, scripts to upgrade analytic workflows. While the TIBCO Statistica for Windows comes with a free trial of 30 days, the Analyst, Modeler, Data Scientist server comes with a price tag.  

With over six million users worldwide, Anaconda is a free and open source distribution of Python and R programming languages.  Anaconda products include Anaconda Distribution and Anaconda Enterprise. While Anaconda Distribution helps users install and manage packages, dependencies, and environment for 1,400+ data science packages for Python/R language, Anaconda Enterprise helps business enterprises harness data science, machine learning and artificial intelligence capabilities through model development, model training and model deployment. Anaconda is used by National Grid (a British MNC electricity and gas utility company) extensively to reduce maintenance costs and improve safety and reliability of their electric transmission assets.  

Databricks Unified Analytics Platform is developed from the creators of Apache Spark. Databricks workspace provides its users with a platform to manage all analytic process from ETL to model training and deployment through shared notebooks, simplified production jobs and ecosystem integration. The Databricks Unified Analytics platform prepares clean data on a real-time basis ready to train ML models for AI applications. Databricks is available for a 14-day free trial. For Databricks basic, Databricks Data Engineering, and Databricks Data Analytics, users have to pay as per Databricks Unit (DBU) on the workload the business enterprises run.  

Update the detailed information about Key Differences Between Data Science And Data Analytics on the Hatcungthantuong.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!