I applied the Naïve Bayes Classifier method previously described to the Amazon food review data, and the results were encouraging, but unfortunately very slow to come by – the algorithm took about 19 hours to run for the first set of results below, and 43 hours for the second set of results (both contained only 5000 rows). The benefit of using this approach compared to the trial and error method of checking for specific words and their counts across different rating levels is that the algorithm will detect words with predictive power for you; the presence of ALL the words are considered rather than just what we can come up with from a bit of surface-level digging. Continue reading
Category Archives: Text analytics
A Review: Social Media Mining an Introduction
A great read on Social Media Mining and text analytics is readily available online under the title: Social Media Mining an Introduction. The authors of this book are Reza Zafarani, Mohammad Ali Abbasi, and Huan Liu, published under the Cambridge University Press, drafted April 20, 2014. A link to the book is found at: http://dmml.asu.edu/smm/SMM.pdf Continue reading
Exploratory Analysis with Text Mining
We wanted to explore if there are any correlation between the review score and the actual comment in the Amazon Food Reviews. It will be interesting to see how accurately the review scores reflect what the users actually think about the product.
We first used SQL to extract the “review_text” for each review score. Then, using Python, symbols were removed and the frequency of each word was collected. We looked through the high frequency words and chose some meaningful words to further investigate in SQL. With SQL, we counted the instances of those meaningful key words in the “review_text”. We extracted the results from SQL to Excel to combine the similar words/categories.
Naïve Bayes Classifier for Document Classification
Naïve Bayes Classifiers are a family of simple probabilistic classifiers which apply Bayes’ theorem with a strong (naïve) assumption about the independence between observations. In the context of text analytics, the assumption is that the prevalence of words in a ‘document’ (article, email, post, tweet, etc.) are independent of each other. Although this is clearly a very naïve assumption, Naïve Bayes has been shown to produce very strong results. One way of compensating for its shortcoming is to study biagrams or triagrams (sets of two or three words at a time) which helps to capture some of the dependence in the text.
Naïve Bayes essentially works as follows: Continue reading
Exploring Reviews from Descriptive to Mining
Our analysis began by using SQL in order to count how many reviews there were for each review score numbered 1 through 5 and found the proportion the reviews fell into each category. Below is a summary of the number of comments for each rating. The total number of comments in the system is 568454. Continue reading
Amazon Foods Reviews Activity – by Tank Brigade
Following our previous post on how to create a word cloud in R, we have decided to try those techniques out. By inspecting the data, we observed that some words seemed to appear more often in reviews with higher scores, while some others were more likely to appear in reviews with lower scores. Our initial hypothesis is that some words are positively correlated with good reviews and some words are negatively correlated with good reviews. Continue reading
Word Play
Our analysis began by getting a feel for the data supplied from Web data: Amazon Fine Foods reviews (https://snap.stanford.edu/data/web-FineFoods.html). This meant filtering the file by including simple positive words and excluding negative and neutral words and phrases. Below is a list of the words that were included in each of these categories. Continue reading
The Text Mining ‘tm’ Package in R
Tank Group – Haider Shah, Tony Guo and Chris Pang
To perform text mining in R, there is a useful package called ‘tm’ which provides several functions for text handling, processing and management. The package uses the concept of a ‘corpus’ which is a collection of text documents to operate upon. Text can be stored either in-memory in R via a Volatile Corpus or on an external data store such as a database via a Permanent Corpus.
Example: Building a Word Cloud from Twitter Feeds
Continue reading
Let’s talk about Text Analytics
Text analytics has been increasingly attracting the attention of the OR-analytics community as the amount of information stored as textual data increases. Just think about the amount of data stored in emails, news articles and social media, not to mention contact-center notes, surveys, feedback forms, and so forth. Some estimate that text analytics market will grow between 25% and 40% during the next 5 years.[1][2] But let’s make sure we are talking about the same thing when we talk about text analytics.
So, what is text analytics? Continue reading