Applying Naive Bayes to Text Mining

I applied the Naïve Bayes Classifier method previously described to the Amazon food review data, and the results were encouraging, but unfortunately very slow to come by – the algorithm took about 19 hours to run for the first set of results below, and 43 hours for the second set of results (both contained only 5000 rows). The benefit of using this approach compared to the trial and error method of checking for specific words and their counts across different rating levels is that the algorithm will detect words with predictive power for you; the presence of ALL the words are considered rather than just what we can come up with from a bit of surface-level digging.
Approach
The approach I took included first extracting the review text and the review score columns from the database; applying some preprocessing of the review text to remove problematic punctuation such as commas and quotations.
I then wrote a Python script that read in the data, and then apply a Naïve Bayes Classifier algorithm from a python package called TextBlob (http://textblob.readthedocs.org/en/dev/).
Installing TextBlob required the following steps:

  1. Download and unpack the source file from the Install page on the TextBlobk website
  2. Then, open the command prompt, change directory to the TextBlob folder, and run the setup.py file
  3. This run will fail because TextBlob needs a “corpa” file. Python will return a line of code which you will need to run.
  4. After that, all you need is to have two lines in your Python script to run the classifier:
    a. from textblob import TextBlob
    b. from textblob.classifiers import NaiveBayesClassifier

TextBlob requires certain data structures run correctly; I used the CSV format, which needs to be structured as follows: Dataset = [(‘Text1’, ‘Rating1’), (‘Text2’, ‘Rating 2’),…], that is, a list of comma separated tuples of strings. This data set must also be split into a test set and a training set. Loading the data in required other characters needed to be eliminated as TextBlob could not handle certain ascii characters. Fortunately, Python returns an error with the code of the problematic characters which I then searched online to find the actual symbol that would appear in the text file (such as the registered trademark symbol). I removed these by using a simple find and replace. In the future, it might be easier to develop a script in Python (or search for a premade one) that checks for all ascii characters and replaces them with appropriate substitutes.

Next, TextBlob is given the training set to train the classification algorithm, and then that algorithm is given the test set to make predictions on what it thinks the rating should be given ALL of the words in the text. TextBlob provides an accuracy calculation, as well as a list of the top “most informative features”. I wrote a small printout function that writes the text, the actual rating, and the predicted rating into a csv file.

Results Part 1: Original distribution of ratings (run time: 19 hours)
Preliminary results were quite good: the algorithm was about 62% accurate on predicting the rating exactly, was only off by 3 or 4 stars 14% of the time – a pretty reliable result for such a naïve classification method. However, from the output, I could see that this high accuracy was primarily because most of the lines (86%) were just being classified as 5’s (because the data set had so many 5-star ratings). I addressed this with Results Part 2 below.

Looking at some of the predictions that were off by 4 stars was interesting – all but one of them were 5-star predictions that should actually have been 1s. For instance the author of the following review was probably just confused about how the rating system works: “This a good soup. It is a different tasting tomato soup but good. The price was great. Thank You”. In another case, the algorithm was tricked by language structure: “I would highly recommend you do not purchase this product”. The algorithm saw “recommend”, but failed to account for “do not” because Naïve Bayes assumes words are independent.

Results 1

Some of the “most informative features” were also interesting. For instance, the algorithm was able to tell me that if the review contains the word “worst” the odds that the product has a 1-star review rather than a 5-star review is 24 to 1, and that if the review contains the word “expires” the odds of a 2-star review rather than a 5-star review is 30.9 to 1. The key benefit here being that I did not have to identify these words – the algorithm did it for me.

Results Part 2: Equal distribution of ratings (run time: 43 hours)
I then made a data set with 1000 entries of each rating, so still 5000 in total. Now, the accuracy dropped to 32.6%; but the results were more impressive because the predictions were based more on the understanding of the text rather than on the prior probability of each rating. Additionally, the results were still fairly robust as 67% were within 1 star from the true rating and only 17% were off by 3 or 4 stars.

In this case, the classifications that were off by 4 stars had similar features compared to the last case: many customers who did not understand the rating system, and others who said very positive things about something other than the product in order to make a contrasting point.

Results 2

The words identified as being informative were more interesting in this case. For instance, if the post contains the word “ok” the odds of the product getting 3 stars compared to 5 is 13.5 to 1. Thus the algorithm was able to identify “neutral” words, not just strong positive/negative words for classification. The other features also make sense and are self-explanatory.

1 thought on “Applying Naive Bayes to Text Mining

  1. Rene Lagos

    Very interesting experiment, Alex. Sounds like this is a good tool to explore the text and just see what’s in there (provided you have a comparison metric such as review score). Perhaps the run time could be reduced by repeating the experiment with a number of smaller samples.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.