Word Play

Our analysis began by getting a feel for the data supplied from Web data: Amazon Fine Foods reviews (https://snap.stanford.edu/data/web-FineFoods.html). This meant filtering the file by including simple positive words and excluding negative and neutral words and phrases. Below is a list of the words that were included in each of these categories.

Positive: Good, excellent, awesome, love, great, perfect, unbeatable, favorite, yum, wonderful

Negative: Gross, waste, yuck, disgusting, hate, poor, horrible/ horrid (therefore condensed into the rough root word ‘horri’, worst, bad, mistake, awful, inconvenient, complain, not (with a space after it to exclude any time that ‘not’ is part of a word), love star(bucks), love suma(tra)

Neutral Phrases: not good, not great, not the best, least favorite, not my favorite, not the worst

Note that we analyzed the proportion between specific keywords and their corresponding rating. It was found that very few people say ‘delicious’ or ‘yummy’ when they dislike a product. For example, of the 52,268 entries that had a review rating = 1, 3.19% of these rows included either yummy or delicious. For rating = 2, these words appeared 3.95% of the time. This value increases to 10.14% for a rating = 5. This leads us to believe that fewer people will say ‘delicious’ or ‘yummy’ if they dislike a product.  In a similar way, we studied the words ‘good’, ‘bad’, ‘expensive’, ‘cheap’, and ‘at all’. The usage of the words ‘good’ and ‘bad’ was fairly even across all rating options. This is due to the fact that these words can be used in a positive, negative, or neutral way and so we tried to mitigate this variation by excluding misleading phrases with these words. Similarly, the words ‘cheap’ and ‘expensive’ show up relatively evenly across all rating categories, however, the statements that include these words have a higher amount of variation than those that include ‘good’ or ‘bad’ and therefore, these are harder to account for in our analysis. This meant that we did not filter the data based on these two words.

The table of the total reviews and the percentage in which certain words appear in given ratings is shown below.

Total Review 568,447 total
Review Score =1 Review Score =2 Review Score =3 Review Score =4 Review Score =5
TOTAL 52,268 29,768 42,639 80,654 363,118 568,447
good 10,990 8,045 13,688 28,777 96,304 157,804
% 21.03% 27.03% 32.10% 35.68% 26.52%
bad 5,356 2,674 3,936 4,674 12,524 29,164
% 10.25% 8.98% 9.23% 5.80% 3.45%
expensive 2,024 1,122 1,928 3,965 13,420 22,459
% 3.87% 3.77% 4.52% 4.92% 3.70%
cheap 2468 1241 1838 3196 13154 21,897
% 4.72% 4.17% 4.31% 3.96% 3.62%
delicious/yummy 1669 1177 1947 5856 36811 47,460
% 3.19% 3.95% 4.57% 7.26% 10.14%
at all 3351 2059 1922 3096 10903 21,331
% 6.41% 6.92% 4.51% 3.84% 3.00%

Firstly, we searched for highly rated entries (4.0 or 5.0) and lowly rated entries (1.0 or 2.0) and filtered based on whether they contained one of the positive sentiments and excluded the entry if they included a negative or neutral sentiment. This process was refined as more ‘good’ and ‘bad’ words were added, however, through the analysis a more interesting phenomenon was brought to our attention: the fact that many lowly-rated entries included positive sentiments. Such examples are:

Review Score Sentiment (review_summary or review_text)
1.0 Love the tea
1.0 Best Buy!
1.0 Heaven!
1.0 My 10 year old loves this
1.0 Good taste and quality
2.0 Wonderful tea, cute tin….Wow! The tea is actually really good.

The filters were applied to both the Review Summary and Review Text to ensure that any of our positive, negative, or neutral statements were not overlooked. It was found that 51.48% of products rated 2.0 or lower actually had quite positive or mixed reviews. The cause of this is unknown. Perhaps it is due to the fact that 1.0 sometimes indicates the best possible rating, or perhaps 1.0 is set as the default rating value. Also, these ‘positive’ sentiments could simply be sarcasm and that is why the rating is so low. Note that many mixed reviews threw off our algorithm, as did words being used in both positive and negative ways. We may attempt to refine this method so as to find a more accurate value if time permits.

Appendix

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.