Following our previous post on how to create a word cloud in R, we have decided to try those techniques out. By inspecting the data, we observed that some words seemed to appear more often in reviews with higher scores, while some others were more likely to appear in reviews with lower scores. Our initial hypothesis is that some words are positively correlated with good reviews and some words are negatively correlated with good reviews.
The tools we used include: Excel, R, SQL, and Tableau. We aggregated our data by review scores. A good review had a score of 4 or 5, an okay score had a score of 3 and a bad review had a score of 1 or 2. Unfortunately, due to time constraints we were unable to analyze the entire dataset. A random sample of 100,000 reviews were used to generate the following word clouds. This involved heavy amount of work to transforming and cleaning the review text.
Our findings are summarized by the following data visualizations.
Words like ‘taste’, ‘coffee’, and ‘good’ appear in all word clouds because they are high frequency words in the sample dataset. Then we computed the probability of a word being in each review score category and picked the top 15 of both good and bad reviews.
For instance, the word ‘paid’ appears in 95 reviews all of which had a score under 2. So the probability, of the word’ being in a BAD review was 100%. Our interesting findings are summarized in the info-graphic below.
Stay tuned for more interesting insights from the Tank Brigade in future blog posts.
Nice data visualizations! It would also be interesting to know the probability of having a determined score given that a review contains a specific word.