Our analysis began by using SQL in order to count how many reviews there were for each review score numbered 1 through 5 and found the proportion the reviews fell into each category. Below is a summary of the number of comments for each rating. The total number of comments in the system is 568454.
We wanted to consider the number of unique users and products in the in the system so we found there are 74257 products and 256056 users in total. On average each product has 7.65513 comments and each user wrote 2.22001 comments, this could provide insight into deciding whether or not a user targeted marketing campaign specialized for a specific product or customer profile is worth it.
We wanted to identify which reviews people felt were the most helpful because these comments may provide a good training set of words for classifying specific review scores. We ranked the comments with more than 95% approval in helpfulness ratings and at least 100 or more people who have input if they did or did not find the review helpful. As a start we considered the most basic set of training words which might be useful and so considered the review summary only of the data.
Then we chose the following key words from the above comments to conduct further analysis. We checked the number of comments that contain those key words as well as the average rating of comments that have those words. We can get a sense of the polarity of those words and conduct future analysis based on this finding.
The code we used is:
SELECT COUNT(review_text),
avg(cast(review_score as float))
FROM Bootcamp_2015.dbo.Foods_reviews
where review_text like ‘%great%’
Woes with Python:
There were numerous traps with Python that we happened to fall in. In particular, Python 3.4 doesn’t support many of the modules and tool-kits available for text analytics. Thus, it is highly advisable to avoid using python 3.4 for this purpose until more modules and platforms are supported by this version. As for which toolkits and module’s appeared to be most popular, NLTK (Natural Language Toolkit), NumPy, and textblob all are highly rated and relatively user friendly from the code observed. [1][2][3]
When downloading NLTK, after several attempts to make it run properly, we found that NumPy is supported only on a 32-bit binary installation. After some research we were able to find this version of NumPy for a 64-bit binary, however this is only supported by Python 2.7 which goes back to our point above. [4]
While using TextBlob, we found it was necessary to have the NLTK package installed, however we were unable to get multiple modules to load for a single Python Shell code, i.e. run the module setup for NLTK, you couldn’t simultaneously run another module to use in the same Python Shell. Hence, if we loaded the set-up necessary to import textblob we would get an error message from the shell saying module NLTK couldn’t be found, and if we loaded the set-up necessary to import NLTK we couldn’t import textblob. Again we were unable to compromise this issue but we believe it has to do with using Python 3.4 which has modules which are unsupported.
[1] http://www.nltk.org/
[2] http://www.numpy.org/
[3] http://textblob.readthedocs.org/en/dev/
[4] http://www.lfd.uci.edu/~gohlke/pythonlibs/#numpy