Author: Yogesh Kulkarni
Source:
https://www.analyticsvidhya.com/blog/2017/01/sentiment-analysis-of-twitter-posts-on-chennai-floods-using-python
Thanks Eran Shlomo for posting it to Linkedin
Converted and annotated to a IPython notebook: Samy Zafrany
The best way to learn data science is to do data science. No second thought about it!
One of the ways, I do this is continuously look for interesting work done by other community members. Once I understand the project, I do / improve the project on my own. Honestly, I can’t think of a better way to learn data science.
As part of my search, I came across a study on sentiment analysis of Chennai Floods on Analytics Vidhya. I decided to perform sentiment analysis of the same study using Python and add it here. Well, what can be better than building onto something great.
To get acquainted with the crisis of Chennai Floods, 2015 you can read the complete study here. This study was done on a set of social interactions limited to the first two days of Chennai Floods in December 2015.
The objectives of this article is to understand the different subjects of interactions during the floods using Python. Grouping similar messages together with emphasis on predominant themes (rescue, food, supplies, ambulance calls) can help government and other authorities to act in the right manner during the crisis time.
# These are css/html styles for good looking ipython notebooks
from IPython.core.display import HTML
css = open('style-notebook.css').read()
HTML('<style>{}</style>'.format(css))
A typical tweet is mostly a text message within limit of 140 characters. #hashtags convey subject of the tweet whereas @user seeks attention of that user. Forwarding is denoted by ‘rt’ (retweet) and is a measure of its popularity. One can like a tweet by making it ‘favorite’.
About 6000 twits were collected with ‘#ChennaiFloods’ hashtag and between 1st and 2nd Dec 2015. Jefferson’s GetOldTweets utility (got) was used in Python 2.7 to collect the older tweets. One can store the tweets either in a csv file or to a database like MongoDb to be used for further processing.
The got Python module can be downloaded from: https://github.com/Jefferson-Henrique/GetOldTweets-python
Note that you will also need to download Mongo from https://www.mongodb.com and run a mongo server on your pc.
import got, codecs
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client['twitter_db']
collection = db['twitter_collection']
tweetCriteria = got.manager.TweetCriteria().setQuerySearch('ChennaiFloods').setSince("2015-12-01").setUntil("2015-12-02").setMaxTweets(6000)
def streamTweets(tweets):
for t in tweets:
obj = {"user": t.username, "retweets": t.retweets, "favorites":
t.favorites, "text":t.text,"geo": t.geo,"mentions":
t.mentions, "hashtags": t.hashtags,"id": t.id,
"permalink": t.permalink,}
tweetind = collection.insert_one(obj).inserted_id
got.manager.TweetManager.getTweets(tweetCriteria, streamTweets);
Tweets stored in MongoDB can be accessed from another python script. Following example shows how the whole db was converted to a Pandas DataFrame.
import pandas as pd
from pymongo import MongoClient
client = MongoClient('localhost', 27017)
db = client['twitter_db']
collection = db['twitter_collection']
df = pd.DataFrame(list(collection.find()))
Take a look at the 6 first records of this dataframe
df.head(6)
Once in dataframe format, it is easier to explore the data. Here are few examples:
# FreqDist = Frequency Distribution of words list
from nltk import FreqDist
hashtags = []
for hs in df["hashtags"]: # Each entry may contain multiple hashtags. We need to split.
hashtags += hs.split(" ")
hashtag_dist = FreqDist(hashtags)
hashtag_dist.plot(10)
As seen in the study the most used tags were “#chennairains”, “#ICanAccommodate”, apart from the original query tag “#ChennaiFloods”.
users = df["user"].tolist()
users_dist = FreqDist(users)
users_dist.plot(10)
All tweets are processed to remove unnecessary things like links, non-English words, stopwords, punctuation’s, etc. We employ the power of the nltk.tokenize and nltk.corpus modules:
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import re, string
import nltk
tweets_texts = df["text"].tolist()
stopwords = stopwords.words('english')
english_vocab = set(w.lower() for w in nltk.corpus.words.words())
def process_tweet_text(tweet):
if tweet.startswith('@null'):
return "[Tweet not available]"
tweet = re.sub(r'\$\w*','',tweet) # Remove tickers
tweet = re.sub(r'https?:\/\/.*\/\w*','',tweet) # Remove hyperlinks
tweet = re.sub(r'['+string.punctuation+']+', ' ',tweet) # Remove puncutations like 's
twtok = TweetTokenizer(strip_handles=True, reduce_len=True)
tokens = twtok.tokenize(tweet)
tokens = [i.lower() for i in tokens if i not in stopwords and len(i) > 2 and
i in english_vocab]
return tokens
words = []
for tw in tweets_texts:
words += process_tweet_text(tw)
# How many words do we have?
len(words)
# Take a look at the first 15 words
words[0:15]
The words are plotted again to find the most frequently used terms. A few simple words repeat more often than others: ’help’, ‘people’, ‘stay’, ’safe’, etc.
[(‘twitter’, 1026), (‘pic’, 1005), (‘help’, 569), (‘people’, 429), (‘safe’, 274)]
These are immediate reactions and responses to the crisis.
Some infrequent terms are [(‘fit’, 1), (‘bible’, 1), (‘disappear’, 1), (‘regulated’, 1), (‘doom’, 1)].
words_dist = FreqDist(words)
words_dist.plot(16)
Collocations are the words that are found together. They can be bi-grams (two words together) or phrases like trigrams (3 words) or n-grams (n words).
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(words, 5)
finder.apply_freq_filter(5)
for word1,word2 in finder.nbest(bigram_measures.likelihood_ratio, 10):
print(word1, word2)
These depict the disastrous situation, like “stay safe”, “rescue team”, even a commonly used Hindi phrase “pani pani” (lots of water).
In such crisis situations, lots of similar tweets are generated. They can be grouped together in clusters based on closeness or ‘distance’ amongst them. Artem Lukanin has explained the process in details here. TF-IDF method is used to vectorize the tweets and then cosine distance is measured to assess the similarity.
Each tweet is pre-processed and added to a list. The list is fed to TFIDF Vectorizer to convert each tweet into a vector. Each value in the vector depends on how many times a word or a term appears in the tweet (TF) and on how rare it is amongst all tweets/documents (IDF). Below is a visual representation of TFIDF matrix it generates.
cleaned_tweets = []
for tw in tweets_texts:
words = process_tweet_text(tw)
#Form sentences of processed words
cleaned_tweet = " ".join(w for w in words if len(w) > 2 and w.isalpha())
cleaned_tweets.append(cleaned_tweet)
df['CleanTweetText'] = cleaned_tweets
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(use_idf=True, ngram_range=(1,3))
tfidf_matrix = tfidf_vectorizer.fit_transform(cleaned_tweets)
feature_names = tfidf_vectorizer.get_feature_names() # num phrases
from sklearn.metrics.pairwise import cosine_similarity
dist = 1 - cosine_similarity(tfidf_matrix)
print(dist)
from sklearn.cluster import KMeans
num_clusters = 3
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)
clusters = km.labels_.tolist()
df['ClusterID'] = clusters
print(df['ClusterID'].value_counts())
We obtained 3 clusers:
The top words used in each cluster can be computed by as follows:
#sort cluster centers by proximity to centroid
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
for i in range(num_clusters):
print("Cluster {}: Words:".format(i))
for ind in order_centroids[i, :10]:
print(' %s' % feature_names[ind])
The result is:
Doc2Vec methodology available in gensim package is used to vectorize the tweets, as follows:
import gensim
from gensim.models.doc2vec import TaggedDocument
taggeddocs = []
tag2tweetmap = {}
for index,i in enumerate(cleaned_tweets):
if len(i) > 2: # Non empty tweets
tag = u'SENT_{:d}'.format(index)
sentence = TaggedDocument(words=gensim.utils.to_unicode(i).split(), tags=[tag])
tag2tweetmap[tag] = i
taggeddocs.append(sentence)
model = gensim.models.Doc2Vec(taggeddocs, dm=0, alpha=0.025, size=20, min_alpha=0.025, min_count=0)
for epoch in range(60):
if epoch % 20 == 0:
print('Training epoch: %s' % epoch)
model.train(taggeddocs)
model.alpha -= 0.002 # decrease the learning rate
model.min_alpha = model.alpha # fix the learning rate, no decay
Once trained model is ready the tweet-vectors available in model can be clustered using K-means.
from sklearn.cluster import KMeans
dataSet = model.syn0
kmeansClustering = KMeans(n_clusters=6)
centroidIndx = kmeansClustering.fit_predict(dataSet)
topic2wordsmap = {}
for i,val in enumerate(dataSet):
tag = model.docvecs.index_to_doctag(i)
topic = centroidIndx[i]
if topic in topic2wordsmap.keys():
for w in (tag2tweetmap[tag].split()):
topic2wordsmap[topic].append(w)
else:
topic2wordsmap[topic] = []
for i in topic2wordsmap:
words = topic2wordsmap[i]
print("Topic {} has words {}".format(i, words[:5]))
dataSet.shape
This article shows how to implement Capstone-Chennai Floods study using Python and its libraries. With this tutorial, one can get introduction to various Natural Language Processing (NLP) workflows such as accessing twitter data, pre-processing text, explorations, clustering and topic modeling.