Author Archives: ajdueck

Applying Naive Bayes to Text Mining

I applied the Naïve Bayes Classifier method previously described to the Amazon food review data, and the results were encouraging, but unfortunately very slow to come by – the algorithm took about 19 hours to run for the first set of results below, and 43 hours for the second set of results (both contained only 5000 rows). The benefit of using this approach compared to the trial and error method of checking for specific words and their counts across different rating levels is that the algorithm will detect words with predictive power for you; the presence of ALL the words are considered rather than just what we can come up with from a bit of surface-level digging. Continue reading

Naïve Bayes Classifier for Document Classification

Naïve Bayes Classifiers are a family of simple probabilistic classifiers which apply Bayes’ theorem with a strong (naïve) assumption about the independence between observations. In the context of text analytics, the assumption is that the prevalence of words in a ‘document’ (article, email, post, tweet, etc.) are independent of each other. Although this is clearly a very naïve assumption, Naïve Bayes has been shown to produce very strong results. One way of compensating for its shortcoming is to study biagrams or triagrams (sets of two or three words at a time) which helps to capture some of the dependence in the text.

Naïve Bayes essentially works as follows: Continue reading