Author Archives: cpang

Saving Results of Python Classifiers to Disk

If you are building Naive Bayes classifiers using packages such as NLTK, you may notice that if you have a large training set that it can take hours to run. In order to not lose these results between work sessions, you can save the results of your classifier training to a disk file using the pickle commands list further below.

Continue reading

Amazon Foods Reviews Activity – by Tank Brigade

Following our previous post on how to create a word cloud in R, we have decided to try those techniques out. By inspecting the data, we observed that some words seemed to appear more often in reviews with higher scores, while some others were more likely to appear in reviews with lower scores. Our initial hypothesis is that some words are positively correlated with good reviews and some words are negatively correlated with good reviews. Continue reading

The Text Mining ‘tm’ Package in R

Tank Group – Haider Shah, Tony Guo and Chris Pang

To perform text mining in R, there is a useful package called ‘tm’ which provides several functions for text handling, processing and management. The package uses the concept of a ‘corpus’ which is a collection of text documents to operate upon. Text can be stored either in-memory in R via a Volatile Corpus or on an external data store such as a database via a Permanent Corpus.

Example: Building a Word Cloud from Twitter Feeds
Continue reading