Text mining in R

For a mark of up to 85%, produce a document (Word, html, or pdf) using R Markdown that does what we did in class:

  • load any packages you need
  • read in the review data set [subset of Yelp dataset challenge]
  • flag each review as “positive” or “negative” sentiment
  • make a corpus out of the whole set of texts and clean as needed
  • make a document term matrix out of the corpus
  • generate a word cloud: use your judgement about how many words to include or what minimum frequency to require, and state your observations about the word cloud.
  • generate a comparison word cloud for positive and negative reviews
    • first make two documents, one containing the text of all negative reviews and one containing the text of all positive reviews
    • then make those two documents into a corpus
    • then make a term document matrix out of that corpus
    • state your observations about the comparison word cloud

For a mark of up to 100%, go deeper! See what else you can learn about the Yelp data. (If you like, you can go back to the full Yelp dataset and select a different subset–the files I used for cleaning and pre-processing are in the shared folder W:\SAUD\COE\coeprojects\_Resources-Tutorials\BAMS 580D 2016\text analytics . You could also get a completely different data set: Twitter is one of the easiest sources; see instructions here. Just make sure that what you’re planning is realistic. You could submit something basic for this assignment, and then enter the Yelp dataset challenge for real later.)

You need to do some text analytics, but it’s fine to also do analysis on other structured data like rating or location or type of restaurant. Some ideas for questions you could ask about the Yelp data:

  • Do people use different words about pizza restaurants than about sushi restaurants?
  • How do individual users’ average ratings compare? are some users habitually more negative than others?
  • What words occur most often in reviews that are flagged with anger, anticipation, disgust, and other specific emotions?
  • How do different emotions correlate with a user’s star rating of a particular restaurant?