You are reading the article Stock Price Analysis With Python updated in November 2023 on the website Hatcungthantuong.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested December 2023 Stock Price Analysis With Python
Stock price analysis with Python is crucial for investors to understand the risk of investing in the stock market. A company’s stock prices reflect its evaluation and performance, which influences the demand and supply in the market. Technical analysis of the stock is a vast field, and we will provide an overview of it in this article. By analyzing the stock price with Python, investors can determine when to buy or sell the stock. This article will be a starting point for investors who want to analyze the stock market and understand its volatility. So, let’s dive into the stock price analysis with Python.
Libraries Used in Stock Price Analysis With PythonThe following are the libraries required to be installed beforehand which can easily be downloaded with the help of the pip function. A brief description of the Library’s name and its application is provided below
LibraryApplicationYahoo FinanceTo download stock dataPandasTo handle data frames in pythonNumpyNumerical PythonMatplotlibPlotting graphs
import pandas as pd import datetime import numpy as np import matplotlib.pyplot as plt from pandas.plotting import scatter_matrix !pip install yfinance import yfinance as yf %matplotlib inline Data DescriptionWe have downloaded the daily stock prices data using the Yahoo finance API functionality. It’s a five-year data capturing Open, High, Low, Close, and Volume
Open: The price of the stock when the market opens in the morning
Close: The price of the stock when the market closed in the evening
High: Highest price the stock reached during that day
Low: Lowest price the stock is traded on that day
Volume: The total amount of stocks traded on that day
Here, we will take the Example of three companies TCS, Infosys, and Wipro which are the industry leaders in providing IT services.
start = "2014-01-01" end = '2023-1-01' tcs = yf.download('TCS',start,end) infy = yf.download('INFY',start,end) wipro = yf.download('WIPRO.NS',start,end) Exploratory Analysis for Stock Price Analysis With PythonPython Code:
The above graph is the representation of open stock prices for these three companies via line graph by leveraging matplotlib library in python. The Graph clearly shows that the prices of Wipro is more when comparing it to other two companies but we are not interested in the absolute prices for these companies but wanted to understand how these stock fluctuate with time.
tcs['Volume'].plot(label = 'TCS', figsize = (15,7)) infy['Volume'].plot(label = "Infosys") wipro['Volume'].plot(label = 'Wipro') plt.title('Volume of Stock traded') plt.legend()The Graph shows the volume traded by these companies which clearly shows that stocks of Infosys are traded more compared to other IT stocks.
#Market Capitalisation tcs['MarktCap'] = tcs['Open'] * tcs['Volume'] infy['MarktCap'] = infy['Open'] * infy['Volume'] wipro['MarktCap'] = wipro['Open'] * wipro['Volume'] tcs['MarktCap'].plot(label = 'TCS', figsize = (15,7)) infy['MarktCap'].plot(label = 'Infosys') wipro['MarktCap'].plot(label = 'Wipro') plt.title('Market Cap') plt.legend()Only volume or stock prices do not provide a comparison between companies. In this case, we have plotted a graph for Volume * Share price to better compare the companies. As we can clearly see from the graph that Wipro seems to be traded on a higher side.
Moving Averages for Stock Price Analysis With PythonAs we know the stock prices are highly volatile and prices change quickly with time. To observe any trend or pattern we can take the help of a 50-day 200-day average
tcs['MA50'] = tcs['Open'].rolling(50).mean() tcs['MA200'] = tcs['Open'].rolling(200).mean() tcs['Open'].plot(figsize = (15,7)) tcs['MA50'].plot() tcs['MA200'].plot() Scattered Plot Matrix data = pd.concat([tcs['Open'],infy['Open'],wipro['Open']],axis = 1) data.columns = ['TCSOpen','InfosysOpen','WiproOpen'] scatter_matrix(data, figsize = (8,8), hist_kwds= {'bins':250})The above graph is the combination of histograms for each company and a subsequent scattered plot taking two companies’ stocks at a time. From the graph, we can clearly figure out that Wipro stocks are loosely showing a linear correlation with Infosys.
Percentage Increase in Stock ValueA percentage increase in stock value is the change in stock comparing that to the previous day. The bigger the value either positive or negative the volatile the stock is.
#Volatility tcs['returns'] = (tcs['Close']/tcs['Close'].shift(1)) -1 infy['returns'] = (infy['Close']/infy['Close'].shift(1))-1 wipro['returns'] = (wipro['Close']/wipro['Close'].shift(1)) - 1 tcs['returns'].hist(bins = 100, label = 'TCS', alpha = 0.5, figsize = (15,7)) infy['returns'].hist(bins = 100, label = 'Infosysy', alpha = 0.5) wipro['returns'].hist(bins = 100, label = 'Wipro', alpha = 0.5) plt.legend()It is clear from the graph that the percentage increase in stock price histogram for TCS is the widest which indicates the stock of TCS is the most volatile among the three companies compared.
ConclusionThe above analysis can be used to understand a stock’s short-term and long-term behaviour. A decision support system can be created which stock to pick from industry for low-risk low gain or high-risk high gain depending on the risk apatite of the investor.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Related
You're reading Stock Price Analysis With Python
Twitter Sentiment Analysis Using Python Programming.
Sentiment Analysis is the process of estimating the sentiment of people who give feedback to certain event either through written text or through oral communication. Of course the oral communication also has to be converted to written text so that it can be analysed through python program. The sentiment expressed by people may be positive or negative. By assigning weightage to the different words in the sentiment text we calculate a numeric value and that gives us a mathematical evaluation of the sentiment.
Usefulness
Customer Fedback − It is vital for business to know the customer’s opinion about product or services. When the customer’s feedback is available as written text we can run the sentiment analysis in Twitter to programmatically find out the overall feedback as positive or negative and take corrective action.
Political Campaigns − For political opponents it is very vital to know the reaction of the people to whom they are delivering the speech. If the feedback from the public can be gathered through online platforms like social media platforms, then we can judge the response of the public to a specific speech.
Government Initiatives − When the government implements new schemes from time to time they can judge the response to the new scheme by taking public opinion. Often the public put their praise or anger through Twitter.
ApproachBelow we list the steps that are required to build the sentiment analysis program in python.
First we install Tweepy and TextBlob. This module will help us gathering the data from Twitter as well as extracting the text and processing them.
Authenticating to Twitter. We need to use the API keys so that the data can be extracted from tweeter.
Then we classify the tweets into positive and negative tweets based on the text in the tweet.
Example import re import tweepy from tweepy import OAuthHandler from textblob import TextBlob class Twitter_User(object): def __init__(self): consumer_key = '1ZG44GWXXXXXXXXXjUIdse' consumer_secret = 'M59RI68XXXXXXXXXXXXXXXXV0P1L6l7WWetC' access_token = '865439532XXXXXXXXXX9wQbgklJ8LTyo3PhVDtF' access_token_secret = 'hbnBOz5XXXXXXXXXXXXXefIUIMrFVoc' try: self.auth = OAuthHandler(consumer_key, consumer_secret) self.auth.set_access_token(access_token, access_token_secret) self.api = tweepy.API(self.auth) except: print("Error: Authentication Failed") def pristine_tweet(self, twitter): def Sentiment_Analysis(self, twitter): audit = TextBlob(self.pristine_tweet(twitter)) # set sentiment return 'positive' elif audit.sentiment.polarity == 0: return 'negative' def tweet_analysis(self, query, count = 10): twitter_tweets = [] try: get_twitter = self.api.search(q = query, count = count) for tweets in get_twitter: inspect_tweet = {} inspect_tweet['text'] = tweets.text inspect_tweet['sentiment'] = self.Sentiment_Analysis(tweets.text) if inspect_tweet not in twitter_tweets: twitter_tweets.append(inspect_tweet) else: twitter_tweets.append(inspect_tweet) return twitter_tweets except tweepy.TweepError as e: print("Error : " + str(e)) def main(): api = Twitter_User() twitter_tweets = api.tweet_analysis(query = 'Ram Nath Kovind', count = 200) Positive_tweets = [tweet for tweet in twitter_tweets if tweet['sentiment'] == 'positive'] print("Positive tweets percentage: {} %".format(100*len(Positive_tweets)/len(twitter_tweets))) Negative_tweets = [tweet for tweet in twitter_tweets if tweet['sentiment'] == 'negative'] print("Negative tweets percentage: {} %".format(100*len(Negative_tweets)/len(twitter_tweets))) print("nnPositive_tweets:") for tweet in Positive_tweets[:10]: print(tweet['text']) print("nnNegative_tweets:") for tweet in Negative_tweets[:10]: print(tweet['text']) if __name__ == "__main__": main() OutputRunning the above code gives us the following result −
Positive tweets percentage: 48.78048780487805 % Negative tweets percentage: 46.34146341463415 % Positive_tweets: RT @heartful_ness: "@kanhashantivan presents a model of holistic living. My deep & intimate association with this organisation goes back to… RT @heartful_ness: Heartfulness Guide @kamleshdaaji welcomes honorable President of India Ram Nath Kovind @rashtrapatibhvn, honorable first… RT @DrTamilisaiGuv: Very much pleased by the affection shown by our Honourable President Sri Ram Nath Kovind and First Lady madam Savita Ko… RT @BORN4WIN: Who became the first President of India from dalit community? A) K.R. Narayanan B) V. Venkata Giri C) R. Venkataraman D) Ram… Negative_tweets: RT @Keyadas63: What wuld those #empoweredwomen b termed who reach Hon HC at the drop of a hat But Demand #Alimony Maint? @MyNation_net @vaa… RT @heartful_ness: Thousands of @heartful_ness practitioners meditated with Heartfulness Guide @kamleshdaaji at @kanhashantivan & await the… RT @TurkeyinDelhi: Ambassador Sakir Ozkan Torunlar attended the Joint Session of Parliament of #India and listened the address of H.E. Shri…Determining The Market Price Of Old Vehicles Using Python
This article was published as a part of the Data Science Blogathon.
IntroductionOLX Group is a Dutch-domiciled online marketplace that over 300 million people use every month for buying, selling, and exchanging products and services ranging from cars, furniture, and electronics to jobs and services listings.
ScenarioWe will attempt to determine the market price for a car that we would like to sell. The details of our car are as follows:
Make and Model – Swift Dzire
Year of Purchase – 2009
Km Driven – 80,000
Current Location – Rajouri Garden
ApproachOur approach to addressing the issue would be as follows:
1. Search for all the listings on the OLX platform for the same make and model of our car.
2. Extract all the relevant information and prepare the data.
3. Use the appropriate variables to build a machine learning model that, based on certain inputs be able to determine the market price of a car.
4. Input the details of our car to fetch the price that we should put on our listing.
WARNING! Please refer to the chúng tôi of the respective website before scrapping any data. In case the website does not allow scrapping of what you want to extract, please mark an email to the web administrator before proceeding.
Stage 1 – SearchWe will start with importing the necessary libraries
In order to automatically search for the relevant listing and extract the details, we will use Selenium
import selenium from selenium import webdriver as wb from chúng tôi import By from chúng tôi import WebDriverWait from selenium.webdriver.support import expected_conditions as ECFor basic data wrangling, format conversion and cleaning we will use pandas, numpy, datetime and time
import pandas as pd import numpy as np import datetime import time from datetime import date as dt from datetime import timedeltaFor building our model, we will use Linear Regression
from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_splitWe firstly create a variable called ‘item’, to which we assign the name of the item we want to sell.
item = 'Swift Dzire' location = 'Rajouri Garden'Next, we would want to open the OLX website using chrome driver and search for Swift Dzire in the location we are interested in.
Source: Olx.in
driver = wb.Chrome(r"PATH WHERE CHROMEDRIVER IS SAVEDchromedriver.exe") driver.find_element_by_xpath('//*[@id="container"]/header/div/div/div[2]/div/div/div[1]/div/div[1]/input').clear() driver.find_element_by_xpath('//*[@id="container"]/header/div/div/div[2]/div/div/div[1]/div/div[1]/input').send_keys(location) time.sleep(5) driver.find_element_by_xpath('//*[@id="container"]/header/div/div/div[2]/div/div/div[2]/div/form/fieldset/div/input').send_keys(item) time.sleep(5) while True: try: except TimeoutException: breakNow that we have loaded all the results, we will extract all the information that we can potentially use to determine the market price. A typical listing looks like this
Source: OLX
Stage 2 – Data Extraction and PreparationFrom this we will extract the following and save the information to an empty dataframe called ‘df’:
1. Maker name
2. Year of purchase
3. Km driven
4. Location
5. Verified Seller or not
6. Price
df = pd.DataFrame() n = 200 for i in range(1,n): try: make = driver.find_element_by_xpath('//*[@id="container"]/main/div/div/section/div/div/div[4]/div[2]/div/div[3]/ul/li['+str(i)+']/a/div[1]/div[2]/div[2]').text make = pd.Series(make) det = driver.find_element_by_xpath('//*[@id="container"]/main/div/div/section/div/div/div[4]/div[2]/div/div[3]/ul/li['+str(i)+']/a/div[1]/div[2]/div[1]').text year = pd.Series(det.split(' - ')[0]) km = pd.Series(det.split(' - ')[1]) price = driver.find_element_by_xpath('//*[@id="container"]/main/div/div/section/div/div/div[4]/div[2]/div/div[3]/ul/li['+str(i)+']/a/div[1]/div[2]/span').text price = pd.Series(price) det2 = driver.find_element_by_xpath('//*[@id="container"]/main/div/div/section/div/div/div[4]/div[2]/div/div[3]/ul/li['+str(i)+']/a/div[1]/div[2]/div[3]').text location = pd.Series(det2.split('n')[0]) date = pd.Series(det2.split('n')[1]) try: verified = driver.find_element_by_xpath('//*[@id="container"]/main/div/div/section/div/div/div[4]/div[2]/div/div[3]/ul/li['+str(i)+']/a/div[2]/div/div[1]/div/div/div').text verified = pd.Series(verified) except: verified = 0 except: continue df_temp = pd.DataFrame({'Car Model':make,'Year of Purchase':year,'Km Driven':km,'Location':location,'Date Posted':date,'Verified':verified,'Price':price}) df = df.append(df_temp)Within the obtained dataframe, we will first have to do some basic data cleaning where we remove the commas from Price and Km Driven and convert them to integers.
df['Price'] = df['Price'].str.replace(",","").str.extract('(d+)') df['Km Driven'] = df['Km Driven'].str.replace(",","").str.extract('(d+)') df['Price'] = df['Price'].astype(float).astype(int) df['Km Driven'] = df['Km Driven'].astype(float).astype(int)As you can see in the image above, for the listings that are put up on the same day, there instead of a date ‘Today’ is mentioned. Similarly, for the items listed one day prior, ‘Yesterday’ is mentioned. For dates that are listed as ‘4 days ago’ or ‘7 days ago’, we extract the first part of the string, convert it to an integer and subtract those many days from today’s date to get the actual date of posting. We will convert such strings into proper dates as our objective is to create a variable called ‘Days Since Posting’, using the same.
df.loc[df['Date Posted']=='Today','Date Posted']=datetime.datetime.now().date() df.loc[df['Date Posted']=='Yesterday','Date Posted']=datetime.datetime.now().date() - timedelta(days=1) df.loc[df['Date Posted'].str.contains(' days ago',na=False),'Date Posted']=datetime.datetime.now().date() - timedelta(days=int(df[df['Date Posted'].str.contains(' days ago',na=False)]['Date Posted'].iloc[0].split(' ')[0])) def date_convert(date_to_convert): return datetime.datetime.strptime(date_to_convert, '%b %d').strftime(str(2023)+'-%m-%d') for i,j in zip(df['Date Posted'],range(0,n)): try: df['Date Posted'].iloc[j] = date_convert(str(i)) except: continue df['Days Since Posting'] = (pd.to_datetime(datetime.datetime.now().date()) - pd.to_datetime(df['Date Posted'])).dt.daysOnce created, we will convert this along with ‘Year of Purchase’ to integers.
df['Year of Purchase'] = df['Year of Purchase'].astype(float).astype(int) df['Days Since Posting'] = df['Days Since Posting'].astype(float).astype(int)Further, we will use one-hot encoding to convert the verified seller column
df['Verified'] = np.where(df['Verified']==0,0,1)Finally, we will get the following dataframe.
The ‘Location‘ variable in its current form cannot be used in our model given that it’s categorical in nature. Thus, to be able to make use of it, we will first have to transform this into dummy variables and then use the relevant variable in our model. We convert this to dummy variables as follows:
df = pd.get_dummies(df,columns=['Location']) Stage 3 – Model BuildingAs we have got our base data ready, we will now proceed toward building our model. We will use ‘Year of Purchase’, ‘Km Driven’, ‘Verified’, ‘Days Since Posting’ and ‘Location_Rajouri Garden’ as our input variables and ‘Price’ as our target variable.
X = df[['Year of Purchase','Km Driven','Verified','Days Since Posting','Location_Rajouri Garden']] y = df[['Price']]We will use a 25% test dataset size and fit the Linear Regression model on the training set.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25) model = LinearRegression().fit(X_train,y_train)We check the training and test set accuracies.
print("Training set accuracy",model.score(X_train,y_train)) print("Test set accuracy",model.score(X_test,y_test))Let’s check out the summary of our model
Stage 4 – Predicting the Market PriceFinally, we will use details of our own car and feed them into the model. Let’s revisit the input variable details we have of our own car
Year of Purchase – 2009
Km Driven – 80,000
Verified – 0
Days Since Posting – 0
Location-Rajouri Garden – 1
Till now we are not a verified seller and would have to use 0 for the relevant feature. However, as we saw in our model summary the coefficient for ‘Verified’ is positive, i.e., being a verified seller should enable us to list our vehicle at a higher price. Let’s test this with both the approaches – for a non-verified seller first and then a verified seller.
print("Market price for my car as a non-verified seller would be Rs.",int(round(model.predict([[2009,80000,0,0,1]]).flatten()[0]))){answer image}
print("Market price for my car as a verified seller would be Rs.",int(round(model.predict([[2009,80000,1,0,1]]).flatten()[0]))) ConclusionThus, we saw how we could use the various capabilities of Python to determine the market price of items we want to sell on an online marketplace like OLX, Craiglist, or eBay. We extracted information from all similar listings in our area and built a basic machine learning model, which we used to predict the price to be set based on the features of our vehicle. Further, we also got to know that it would be better to list our vehicle as a verified seller on OLX. Being a verified seller would fetch us a 17% higher price as compared to being a non-verified seller.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Related
How Statistical Analysis Is Performed With Advantage?
What is Statistical Analysis?
Statistical Analysis is the scientific way to collect, preprocess and apply a set of statistical methods to discover the insights or underlying pattern of the data. With the increase in cheap data and incremental bandwidth, we are now sitting on a ton of structured and unstructured data. Along with the need for acquiring and maintaining this huge data, one main challenge is to deal with the noise and convert the data into a meaningful way. The statistical analysis comes up with a set of statistical methodologies and tools to address the problem.
Start Your Free Data Science Course
How Statistical Analysis is Performed?Statistical analysis is a vast literature of data analysis itself. Let us discuss the most common approaches of statistical data analysis:
Searching for Central TendencyWhile working with structural data it is often the preliminary step to get an idea on the central tendency of the data set. Suppose you are analyzing the salary data of an organization. Then you may be interested in the following questions like what is the average salary of a manager working in the organization for 3 years with so and so qualification? The following are used as a measurement of central tendency.
Mean: Mean is basically the average of all the data points. Mean is the total salary divided by the number of data points.
Median: Median is the 50th percentile of the data. When we are seeking information like average salary, the median will be a more robust measure. It is less sensitive to outliers.
Mode: Mode is the most frequent value in the list of numbers. Suppose we are dealing with a list of numbers [12, 33, 44, 55, 67, 55, 8, 55], here the mode with be 55.
Searching for DispersionStandard Deviation: Standard Deviation quantifies how much the data point varies from its central tendency (dispersion). The lower the value, the more the data points are identical with its central value.
Variance: Variance is the square of standard deviation. The variance gives us the spread (variability) of the data. While working with high dimensional data we often come up with a situation where we need to reduce the dimensionality or analyze the important variables of the data set. In such situations, we convert the axis in such a way that maximum variability is preserved. This new rotating axis is called the principal components. We choose N important components (an axis with high variance) from the rotating components.
Interquartile Range (IQR): Interquartile range is the range of data between the 25th and 75th percentile values of the data set. We use box plot, violin plot, etc. to analyze the IQR in graphical ways.
Regression Problems Advantages of Using Statistical Analysis
In the era of Big Data, while implementing any machine learning use case it is the utmost importance of how we choose the sample from the huge data lake. Statistical analysis helps us to determine the proper sampling methodology (i.e random, random without substitution, stratified sampling, etc) and reduce the sampling bias.
For example, we are dealing with binary classification problem where 80% of data points belong to the class A and only 20% belong to class B. Now if we want to perform any statistical test with samples from the population, we must ensure the samples are also in 80:20 ratio (80% class A: 20% class B).
Be it sampling or decision making the basis of statistical analysis is historical data. This makes statistical data analysis more acceptable as an industry-standard than another manual process of data analysis.
Why Do We Need Statistical Analysis?The main goal of statistical analysis is to find valuable insights from the data which may be used to discover Industry trends, customer rate of attrition to a product or service, making a valuable business decision, etc.
From the collection of data to find the underlying patterns of the data, statistical analysis is the base of all data-driven methodologies and classical machine learning.
Scope of Statistical AnalysisThe following are the points that explain the scope of Statistical Analysis:
In today’s world, more and more Industries are switching to data-based decision-making systems instead of classical deterministic rule-based approaches.
Statistical analysis is being used dominantly to solve various business problems across domains like Manufacturing, Insurance, Banking and Finances, Automobile, etc. from the industry point of view.
From a technical perspective statistical analysis helps to solve linear regress, time series forecasting, predictive analysis, etc.
ConclusionIn this article, we have discussed the various aspects of statistical data analysis like methodologies, the need, and scope of use cases, etc. Statistical analysis is a very old area of study which lays out the base for modern machine learning and data-driven business models. The practical implementation of statistical analysis methodologies differs based on the type of use case and industry.
Recommended ArticlesSentiment Analysis With Lstm And Torchtext With Code And Explanation
In this article, we will see every single details that you need to know for sentiment data analysis using the LSTM network using the torchtext library. We will see, how to use spacy tokenizer in torchtext data class and the use of tabular and bucket iterator. We will use the embedding matrix with or without pre-trained Glove embedding as an input and we will also see how to process text data of different lengths in a batch with pack_padded_sequence. And you can use these techniques in your problem
What are Field and LabelField?In sentiment data, we have text data and labels (sentiments). The torchtext came up with its text processing data types in NLP. The text data is used with data-type: Field and the data type for the class are LabelField. In the older version PyTorch, you can import these data-types from chúng tôi but in the new version, you will find it in torchtext.legacy.data. You can find detailed information for Field here.
Some important arguments of the data types, that you will use are ‘tokenize’, ‘use_vocab’, ‘batch_first’, ‘include_lengths’, ‘sequential’, and ‘lower’. Let’s first understand the argument tokenize. In simple words, tokenization is a process to split your sentence into words or more basic words. You can use tokenize in many ways either defining your function of a tokenizer, or you can define a function in torch with get_tokenizer, or you can use an inbuilt tokenizer of Field. First, we will install spacy then we will see the tokenizer function.
pip install spacy python -m spacy download en_core_web_sm # Build tokenizer def tokenizer(text): return [token.text for token in spacy_en.tokenizer(text)]You can also define using torch get_tokenizer as well (another way to define) :
from torchtext.data.utils import get_tokenizer tokenizer = get_tokenizer('spacy', language='en_core_web_sm')Let’s see the output of any of the tokenizer we defined above. Both are the same.
print(tokenizer("I can't run whole day")) Output: ['I', 'ca', "n't", 'run', 'whole', 'day']After defining the tokenizer, you can pass it into your Filed. Filed is data-type for your input text. For the article purpose let’s define some sample data in a CSV file.
TEXT = data.Field(tokenize=tokenizer, use_vocab=True, lower=True, batch_first=True, include_lengths=True) LABEL = data.LabelField(dtype=torch.long, batch_first=True, sequential=False) fields = [('text', TEXT), ('label', LABEL)]In the above data-set and the code: Text input is sequential data and sequential argument is True by default so no need to pass in the first line of code and we pass it in the label field. The include_lengths argument will return the length of each sentence in a batch, we will see this in BucketIterator section of this article in more detail. We can also use tokenizer within the Field without using any tokenizer function we did above (we are not using any of the tokenizer functions we defined above)-
TEXT = data.Field(use_vocab=True, lower=True, tokenize='spacy', tokenizer_language='en_core_web_sm', batch_first=True, include_lengths=True) TabularDataset for the Project training_data = data.TabularDataset( path='sample.csv', format='csv', fields=fields, skip_header=True, ) for example in training_data.examples: print(example.text, example.label) Output: ['she', 'good'] 1 ['he', 'is', 'sad'] 2 ['i', 'am', 'very', 'happy'] 1We will do the same thing we do always, splitting data into trains and test data as we do with train_test_split of Sklearn. Here TabularDataset has a split function itself, and we will use that function to split our data with a random state:
train_data, val_data = training_data.split(split_ratio=0.7, random_state=random.seed(SEED)) Glove Embedding for Sentiment Analysis LSTM TorchTextUp to this point, we have read our data and converted it into TabularDataset. Now we will see, how to use embedding in this data. I am giving basic informative notes on embedding, which will be helpful for you if you are not aware. Neural Net only deals with numbers. Embedding converts words into integers and there is a vector corresponding to each integer. Refer to the below image, suppose we have 10k words in our dictionary and you have assigned each word a value between1 to 10k.
Create a zero vector of dimension 10k, Now suppose if you want to represent the word “man”, because its value is 1 in the dictionary(refer to the image below), so in the vector put 1 in the first index and keep others to zero. Such types of vectors are one-hot encode vectors and the problem with these vectors is their dimension. If we have 2B words in our dictionary, we have to make a 2B dimension vector.
To overcome such a problem we generate a dense vector and Glove is one such approach that has a dense vector for a word. Here we will download and use pre-trained Glove Embedding in our problem. You can download the Glove vector using the torch and all the dimensional details can be found at this link.
vectors = Vectors(name='glove.6B.50d.txt') TEXT.build_vocab(train_data, vectors=vectors, max_size=10000, min_freq=1) LABEL.build_vocab(train_data)
In the above code, we initialized the vector and build our training data vocabulary with this vector. I mean, we get a vector for all known tokens from the data set (word/ token). We can restrict the size of vocabulary also. If you do not have the Glove text file, use the following code to download the vector. The cache argument will help you to store the downloaded file for future use. I mean, no need to download the same file again and again.
cache = '.vector_cache' if not os.path.exists(cache): os.mkdir(cache) vectors = Glove(name='840B', dim=50, cache=cache)When you have built the vocabulary, you can check out the dictionary. Here I have small data so I can print whole tokens here for demonstration purposes.
print(list(TEXT.vocab.stoi.items())) output: [('', 0), ('', 1), ('am', 2), ('good', 3), ('happy', 4), ('he', 5), ('i', 6), ('is', 7), ('sad', 8), ('she', 9), ('very', 10)]If you have noticed, we have two extra tokens UNK and PAD and the corresponding indices of these two are 0 and 1. If you want to see the vector corresponding to token=’good’, you can do this by the code below.
print(TEXT.vocab.vectors[TEXT.vocab.stoi['good']])Here TEXT.vocab.vectors contains 50 dimensional vectors for 11 different tokens. chúng tôi converts string to integer(index). The vectors for UNK and PAD are always zero vectors. I am not printing the values as it will take more space here, but you can play around with it. Now I am getting the device type I have because it is going to be used in Bucket-Iterator.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') BucketIterator for Sentiment Analysis LSTM TorchTextBefore the code part of BucketIterator, let’s understand the need for it. This iterator rearranges our data so that similar lengths of sequences fall in one batch with descending order to sequence length (seq_len=Number of tokens in a sentence). If we have the text of length=[4,6,8,5] and we want to split this data into two batches the BucketIterator will split it into [8,6] and [5,4].
Figure 3: BucketIterator for one batch
Arranging data in descending order is required for efficient calculations. Should we replace the question mark with PAD tokens? You will get the answer in this article. BucketIterator helps to keep a similar length of sentences in one batch. This will reduce the padding tokens overhead for computational points of view, first see how to code the BucketIterator:
BTACH_SZIE = 2 train_itr, val_itr = BucketIterator.splits( (train_data, val_data), batch_size=BATCH_SIZE, sort_key=lambda x:len(x.text), device=device, shuffle=True, sort_within_batch=True, sort=False )I hope every argument is self-explanatory here, we passed the batch size of 2. Choose batch size wisely as it is a crucial hyper-parameter and its value also depends on how much data you can process in your GPU/CPU memory. We did not sort the entire data-set but we did sort the data samples within a batch(sort_within_batch=True). See how our batches look:
for batch_no, batch in enumerate(train_itr): text, batch_len = batch.text print(text, batch_len) print(batch.label) output: (tensor([[ 6, 2, 10, 4], [ 5, 7, 8, 1]]), tensor([4, 3])) tensor([0, 1])Each batch contains the token ids and labels, here we got the length of each sentence in a batch as well because we passed include_length as true in the TEXT Field. If you have more sentences of different lengths, you will see BucketIterator arrange the data very nicely.
Basics of LSTM ModelLong short-term memory (LSTM) is a family member of RNN. RNN learns the sequential relationship and this is the reason RNN works well in NLP because the next token has some information from the previous tokens. LSTM can learn longer sequences compare to RNN or GRU. Example: “I am not going to say sorry, and this is not my fault.”
Here the same person who does not want to say sorry is also confident of not being guilty. To understand such logic the network has to be capable of learning the relationship between the first word to the last word of a sentence if necessary. For longer sentences, the network has to understand the relevant relationship between all words and the order of the sequence (which token is coming next in the sentence).
The LSTM plays a very good role here and remembers longer dependency in the sequence due to its capability of remembering relevant information and forgetting irreverent information in a sequence. You can explore this article for more details, you will get all the RNN basics.
Input Shape and Hidden
The input can be given in two ways: 1. (Sequence First: Sequence Length, Batch Size, Input Dimension) 2. (Batch First: Batch Size, Sequence Length, Input Dimension). We will use the second format of the input here. We already have defined the batch size in the BucketIterator, the sequence_length is the number of tokens in a batch and the input dimension is the Glove vector dimension which is 50 in our case.
The hidden shape is (No of Direction * Number of Layers, Batch Size, Hidden Size). Sentiment text information can be extracted using Bi-directional LSTM so the number of directions is 2, we will use 2 number of LSTM layers so its value is 2 in our case. The batch size we already discussed and hidden size you can choose suitable value 8, 16, 32, 64, etc.
Figure 4: Input shape for LSTM(RNN)
Model class SentimentClassifier(nn.Module): def __init__(self, vocab_size, embed_dim, hidden, n_label, n_layers): super(SentimentClassifier, self).__init__() self.hidden = hidden self.n_layers = n_layers self.embed = nn.Embedding(vocab_size, embed_dim) chúng tôi = nn.LSTM(embed_dim, hidden, num_layers=n_layers, bidirectional=True, batch_first=True)#dropout=0.2 chúng tôi = nn.Linear(hidden * 2, n_label) def forward(self, input, actual_batch_len): embed_out = self.embed(input) hidden = torch.zeros(self.n_layers * 2 , input.shape[0], self.hidden) cell = torch.zeros( self.n_layers * 2, input.shape[0], self.hidden) pack_out = nn.utils.rnn.pack_padded_sequence( embed_out, actual_batch_len,batch_first=True).to(device) out_lstm, (hidden, cell) = self.lstm(pack_out, (hidden, cell))#dropout hidden = torch.cat((hidden[-2,:,:], hidden[-1,:,:]),dim=1) out = self.fc(hidden) return out VOCAB_SIZE = len(TEXT.vocab) EMBEDDING_DIM = TEXT.vocab.vectors.shape[1] HIDDEN= 64 NUM_LABEL = 4 # number of classes NUM_LAYERS = 2 model = SentimentClassifier(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN, NUM_LABEL, NUM_LAYERS)This is our model, do not worry we will break this code step by step. VOCAB_SIZE: Total tokens in data set, EMBEDDING_DIM: Glove vector dimension (50 here), HIDDEN we took 64, NUM_LABEL is our number of classes and NUM_LAYERS is 2: 2 stacked LSTM layer. First, we defined the embedding layer which is a mapping of the vocabulary size to a dense vector, this is the reason, we have mapped total vocab size to the vector dimension. See an example for torch embedding where we have only 2 tokens in the vocab and we want it to transform into a 4-dimensional vector:
emb = nn.Embedding(2,4)# size of vocab = 2, vector len = 4 print(emb.weight) output: tensor([[ 0.2626, -0.7775, -0.7230, 0.6391], [-0.7772, 0.4914, -0.9622, 1.2316]], requires_grad=True)In the above code, the first and second output list is a 4-dimensional embedding vector for emb(0)[token 1] and emb(1)[token[2] respectively. The second thing we defined in the classifier is the LSTM layer, we did a mapping of the vector (Embedding dimension) to the hidden. You can also pass dropout in LSTM for regularization. At last, we defined a fully connected layer which resulted out in our desired number of classes and the input for this linear transformation is two times the hidden. Why have two times hidden? Because this is bidirectional LSTM and we are concatenating the final hidden cells from the forward and backward direction of the last layer of LSTM (As we have bidirectional LSTM layers).
Time to discuss what we did in the forward method of SentimentClassifier class. We are passing two-argument input (batched data) and the number of tokens in each sequence of the batch. Very first we passed input to embedding layers we created but wait….. This embedding does not aware of the Glove embedding, we just downloaded before. If you do not want to use any pretrained embedding just go ahead (parameters learning from scratch for the embedding) else do the following code to copy existing vectors for each token we have.
model.embed.weight.data.copy_(TEXT.vocab.vectors) print(model.embed.weight) Output: tensor([[ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [ 0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000], [-0.2660, 0.4732, 0.3187, ..., -0.1116, -0.2955, -0.2576], ..., [ 0.1777, 0.1764, 0.0684, ..., 0.1164, -0.0368, 0.1446], [ 0.4121, 0.0792, -0.4929, ..., 0.0564, 0.1322, -0.5023], [ 0.5183, 0.0194, 0.0089, ..., 0.2638, -0.0442, -0.3650]])The first two vectors are zero vectors as they represent the UNK and PAD tokens(as we have seen in the glove embedding section). Copying the pre-trained embedding will help our model to converge much-mush faster as the tokens are already well-positioned in some hyper-dimensional space. So do not forget to copy existing vectors from the pre-trained embedding.
The hidden and cell need to be reset for the first token of every new sentence in LSTM and this is the reason we initialized it to zero before pass it to the LSTM. If we do not set the hidden and cell to zero Torch does it, so it is optional here. We used pack_padded_sequence and the question is why? As you remember we saw question marks in figure 3 for empty tokens, just go up if you missed them.
pack_padded_sequenceThen we used pack_padded_sequence on the embedding output. As BucketIterator grouped the similar length sequences in one batch with descending order of sequence length, and this is essential for pack_padded_sequence. The pack_padded_sequence returns you new batches from the existing batch. I will give you all the basics through code:
Figure 5: Batch creation pack_padded_sequence
data: tensor([[ 6, 2, 10, 4], [ 9, 3, 1, 1]]) # 1 is padded token len: tensor([4, 2])Let’s have a batch of two sentences (1) “I am very happy” (2) “She good”. The token_ids are written above with length [4,2] The pack_padded_sequence converts the data into batches of [2, 2, 1, 1] as shown in figure 5. Let us understand this with a small example with code for that we are passing the embedding output to pack_padded_sequence with a list of seq_len we have [4, 2].
for batch in train_itr: text, len = batch.text emb = nn.Embedding(vocab_size, EMB_DIM) emb.weight.data.copy_(TEXT.vocab.vectors) emb_out = emb(text) pack_out = nn.utils.rnn.pack_padded_sequence(emb_out, len, batch_first=True) rnn = nn.RNN(EMB_DIM, 4, batch_first=True) out, hidden = rnn(pack_out)If we print the hidden here we will get:
Hidden Output: [[[ 0.9451, -0.9984, -0.4613, 0.9768], [ 0.9672, -0.9905, -0.1192, 0.9983]]]If we print the complete output we will get:
rnn_output: [[ 0.9092, -0.9358, -0.8513, 0.9401], [ 0.8691, -0.9776, 0.5006, 0.1485], [ 0.8109, -0.9987, 0.9487, 0.9641], [ 0.9672, -0.9905, -0.1192, 0.9983], [ 0.9926, -0.9055, -0.5543, 0.9884], [ 0.9451, -0.9984, -0.4613, 0.9768]]Refer to figure 5 for this explanation (focus on purple lined tokens). The hidden of the last token will explain the sentiment for the sentence. Here is the first hidden output, that is corresponding to the last token (“happy”) of the first sequence and in rnn_output list it is the last one. The second last(5th) rnn_output is (“good”) of no use here. But the last hidden output belongs to the last token of the second sequence(“good”) and it is the 4th rnn_output. If our sequence length and data set will grow, we can save a lot of computations with pack_padded_sequence. You can transform the output to its original form of sequences by printing the following lines and I leave this part for you to analyze.
print(nn.utils.rnn.pad_packed_sequence(out, batch_first=True))Now we have completed all the required things we need to know, we have data in our hands, we have made our model ready and we copied Glove embedding to our model’s embedding. So at last we will define some hyper-parameters then we will start training data.
Calculate Loss
opt = torch.optim.Adam(model.parameters(), lr=0.001) criterion = nn.CrossEntropyLoss() model.to(device)We have defined CrossEntropyLoss (multi-class) as a loss function as we have 4 numbers of the output class and we used Adam as the optimizer. If you remember we passed the data to the device in BucketIterator so if you have Cuda then call model.to() method because data and model are to be in the same memory, either CPU or GPU. Now we will define functions to calculate the loss and accuracy of our model.
def accuracy(preds, y): _, preds = torch.max(preds, dim= 1) acc = torch.sum(preds == y) / len(y) return acc def calculateLoss(model, batch, criterion): text, text_len = batch.text preds = model(text, text_len.to('cpu') ) loss = criterion(preds, batch.label) acc = accuracy(preds, batch.label) return loss, len(batch.label), accThe accuracy function consists of simply Torch operations: matching our predictions with actuals. In calculateLoss we passed input to our model, the only thing to note here we shifted the batch_sequence_lengths (text_len in above code) to the CPU before.
Epoch Loop
N_EPOCH = 100 for i in range(N_EPOCH): model.train() train_len, train_acc, train_loss = 0, [], [] for batch_no, batch in enumerate(train_itr): opt.zero_grad() loss, blen, acc = calculateLoss( model, batch, criterion) train_loss.append(loss * blen) train_acc.append(acc * blen) train_len = train_len + blen loss.backward() opt.step() train_epoch_loss = np.sum(train_loss) / train_len train_epoch_acc = chúng tôi train_acc ) / train_len model.eval() with torch.no_grad(): for batch in val_itr: val_results = [calculateLoss( model, batch, criterion) for batch in val_itr] loss, batch_len, acc = zip(*val_results) epoch_loss = np.sum(np.multiply(loss, batch_len)) / np.sum(batch_len) epoch_acc = np.sum(np.multiply(acc , batch_len)) / np.sum(batch_len) print('epoch:{}/{} epoch_train_loss:{:.4f},epoch_train_acc:{:.4f}' ' epoch_val_loss:{:.4f},epoch_val_acc:{:.4f}'.format(i+1, N_EPOCH, train_epoch_loss.item(), train_epoch_acc.item(), epoch_loss.item(), epoch_acc.item()))If you are new to Torch: we use three important functionality (1) zero_grad to set all gradients to zero (2) loss.backward() to computes the gradients (3) opt.step() to update the parameters. All these three are only for training data so we set torch.no_grad() during evaluation phase.
ConclusionWow, we have completed this article, and it’s time for you to hands-on your data set. In my experience in many real-world applications, we are using sentiment analysis heavily in the industry. I hope this article helps your understanding much better than before. See you next time with some other interesting NLP article.
All the images used in this article are designed by the author.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Python String Count() With Examples
Python count
The count() is a built-in function in Python. It will return the total count of a given element in a string. The counting begins from the start of the string till the end. It is also possible to specify the start and end index from where you want the search to begin.
In this Python tutorial, you will learn:
The syntax for PythonString Count()Python count function syntax:
string.count(char or substring, start, end) Parameters of Python Syntax
Char or substring: You can specify a single character or substring you are wants to search in the given string. It will return you the count of the character or substring in the given string.
start : (optional) It indicates the start index from where the search will begin. If not given, it will start from 0. For example, you want to search for a character from the middle of the string. You can give the start value to your count function.
end: (optional) It indicates the end index where the search ends. If not given, it will search till the end of the list or string given. For example, you don’t want to scan the entire string and limit the search till a specific point you can give the value to end in your count function, and the count will take care of searching till that point.
ReturnValueThe count() method will return an integer value, i.e., the count of the given element from the given string. It returns a 0 if the value is not found in the given string.
Example 1: Count Method on a StringThe following example shows the working of count() function on a string.
str1 = "Hello World" str_count1 = str1.count('o') # counting the character “o” in the givenstring print("The count of 'o' is", str_count1) str_count2 = str1.count('o', 0,5) print("The count of 'o' usingstart/end is", str_count2)Output:
The count of 'o' is 2 The count of 'o' usingstart/end is 1 Example 2: Count occurrence of a character in a given stringThe following example shows the occurrence of a character in a given string as well as in by using the start/end index.
str1 = "Welcome to Guru99 Tutorials!" str_count1 = str1.count('u') # counting the character “u” in the given string print("The count of 'u' is", str_count1) str_count2 = str1.count('u', 6,15) print("The count of 'u' usingstart/end is", str_count2)
Output:
The count of 'u' is 3 The count of 'u' usingstart/end is 2 Example 3: Count occurrence of substring in a given stringFollowing example shows the occurrence of substring in a givenstring as well as usingstart/endindex.
str1 = "Welcome to Guru99 - Free Training Tutorials and Videos for IT Courses" str_count1 = str1.count('to') # counting the substring “to” in the givenstring print("The count of 'to' is", str_count1) str_count2 = str1.count('to', 6,15) print("The count of 'to' usingstart/end is", str_count2)Output:
The count of 'to' is 2 The count of 'to' usingstart/end is 1 Summary:
The count() is a built-in function in Python. It will return you the count of a given element in a list or a string.
In the case of a string, the counting begins from the start of the string till the end. It is also possible to specify the start and end index from where you want the search to begin.
The count() method returns an integer value.
Update the detailed information about Stock Price Analysis With Python on the Hatcungthantuong.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!