by Tony Guo
Last class we covered some Python basics and worked with structured data. This time, we are going to work with unstructured data – texts! The terms “Text Mining” and “Text Analytics” are often used interchangeably and refer to the extraction of data or information from text. The overarching goal is, essentially, to turn unstructured text into structured data for analysis, via application of natural language processing (NLP) and analytical methods.
Python is a great language for text processing. The NLTK (Natural Language Toolkit) package in Python provides a practical introduction to building Python programs that works with human language data.
Whether it is personal opinions/updates from Twitter and Facebook, or formal reviews from Yelp and Amazon, the biggest source of text data is social media. These information are mostly public data, and there are different ways for us to retrieve these information. In fact, an important step in text mining on social media is collecting and parsing the text.
In this class, we are going to “mine” our data from Twitter.
Twitter APIs allow users to interact with Twitter data, responses are available in JSON format. There are two kinds of Twitter APIs:
1) The REST API provide access to read and write Twitter data, author a new Tweet, read author profile and follower data, and more. This is ideal for conducting singular Twitter searches, or posts.
2) The Streaming API give developers low latency access to Twitter’s global stream of Tweet data. This is ideal for monitoring or processing Tweets continuously in real-time.
Both APIs requires OAuth to provide authorized access. This means that in order to collect tweets for analysis, we will need to create an account on the Twitter developer site and generate credentials for use with the Twitter API. Please follow the following steps and create a developer account before class. You will need it for the actual in-class excerise.
To create a Twitter developer account:
1. Go to https://dev.twitter.com/user/login and log in with your Twitter user name and password. If you do not yet have a Twitter account, click the Sign up link that appears under the Login button.
2. If you have not yet used the Twitter developer site, you’ll be prompted to authorize the site to use your account. Click Authorize app to continue.
3. Go to the Twitter applications page at https://dev.twitter.com/apps and click Create a new application.
4. Follow the on-screen instructions. For the application Name, Description, and Website, you can specify any text — you’re simply generating credentials to use with this tutorial, rather than creating a real application.
5. Twitter displays the details page for your new application. Click the Key and Access Tokens tab collect your Twitter developer credentials. You’ll see a Consumer key and Consumer secret. Make a note of these values; you’ll need them for the in-class exercise. You may want to store your credentials in a text file.
6. At the bottom of the page click Create my access token. Make a note of the Access token and Access token secret values that appear, or add them to the text file you created in the preceding step.
After completing the 6 steps, you should have 4 credential keys. For example, my keys were:
consumer_key = "6D3ipSSZkhrWlD4KTrYKbHqZx"
consumer_secret = "iTUIWNQdPApazt6xQqwQNhfT9caGhZB9pIdxOFeKVQGoPmIIi4"
access_token = "18549299-tSVu3Avf816WINeoqD5F5b0hEFptlQLHK4yltMIaj"
access_token_secret = "LoO0mtcabeVEjCkEKaiG9I1z5Kco5DnMBPrHbREA0A6pi"
To connect to the Twitter APIs in Python, we are going to use a package called tweepy. You can read its documentation if you want to get familiar with tweepy before the class.
References:
In-Class Assignment
Please complete the following as part of your work for this class. There are 8 questions and 1 optional bonus question.
Submit the assignment answers as a well-commented Python script named: TweetMining-[YourStudentId].py
Question 1 (Connecting to Twitter)
Connect to Twitter API with your credential keys using the Tweepy package. Do a search on Twitter with a keyword of your choice.
Work around Tweepy’s limitations (i.e. your code should be able to return more than 100 tweets). Try to collect at least 1,000 tweets excluding retweets and non-english tweets.
Question 2 (Saving results as a dataframe)
Saving your results as a dataframe, your columns should at least include the following:
User Name, Number of Followers, Date and Time of Tweet, and the actual Tweet.
You can create more columns to save fields of your interest.
Question 3 (Text preprocessing – Link, User name, and Hashtag removal)
Write a function that takes the tweet text as input, and strips any links, user names, and hashtags used in the tweet.
Question 4 (Text preprocessing – Punctuation and Symbol removal)
Modify the function in Question 3, so that it also removes punctuations (all lowercase) and any symbol in the list of symbols provided below.
symbols = [“,”, “~”, “`”, “!”, “%”, “$”, “^”, “&”, “*”, “(“, “)”, “+”, “=”, “{“, “}”, “[“, “]”, “|”, “?”,]
Question 5 (Text preprocessing – Stopword removal)
Modify the function in Question 4, so that it also removes any stopword in the list of common stopwords provided below.
stop_words = [“a”, “about”, “above”, “after”, “again”, “against”, “all”, “am”, “an”, “and”, “any”, “are”, “arent”, “as”, “at”, “b”, “be”, “because”, “been”, “before”, “being”, “below”, “between”, “both”, “but”, “by”, “can”, “cant”, “cannot”, “could”, “couldnt”, “did”, “didnt”, “didnot”, “do”, “does”, “doesnt”, “doesnot”, “doing”, “dont”, “down”, “during”, “each”, “ever”, “few”, “for”, “from”, “further”, “get”, “getting”, “got”, “gotten”, “had”, “hadnt”, “has”, “hasnt”, “have”, “havent”, “having”, “he”, “her”, “here”, “heres”, “hers”, “herself”, “him”, “himself”, “his”, “how”, “hows”, “i”, “ill”, “im”, “ive”, “if”, “in”, “into”, “is”, “isnt”, “it”, “its”, “itself”, “lets”, “me”, “more”, “most”, “much”, “my”, “myself”, “no”, “nor”, “not”, “never”, “of”, “off”, “on”, “once”, “only”, “or”, “other”, “ought”, “our”, “ours”, “ourselves”, “out”, “over”, “own”, “really”, “r”, “rt”, “same”, “she”, “shes”, “should”, “shouldnt”, “so”, “some”, “such”, “than”, “that”, “thats”, “the”, “their”, “theirs”, “them”, “themselves”, “then”, “there”, “theres”, “these”, “they”, “theyre”, “think”, “thinking”, “this”, “those”, “thought”, “through”, “to”, “too”, “under”, “until”, “up”, “very”, “was”, “wasnt”, “wasnot”, “we”, “were”, “werent”, “what”, “whats”, “when”, “whens”, “where”, “wheres”, “which”, “while”, “who”, “whos”, “whom”, “why”, “whys”, “will”, “with”, “wont”, “would”, “wouldnt”, “you”, “youre”, “your”, “yours”, “yourself”, “yourselves”]
Question 6 (Text preprocessing – Stemming)
Modify the function in Question 5, so that it also stems the words.
Question 7 (Dataframe Modification)
Run your function with all the tweets you collected, save the processed tweets into a new column named “ProcessedText”.
Question 8 (Word Frequencies)
Use the NLTK package to compute the word frequencies in processed texts. Print out the most frequent 20 words and their number of appearance.
Question 9 (bonus)
Make a visualization with processed text (i.e. word cloud) or some other plots with variables you saved into your dataframe using your knowledge from the previous Python session. Write down some insights you discovered on Twitter related to your keyword.