Tag Archives: Python

Orange Interface Installation Instruction

Explorative analysis and classification trees in Orage

Orange is a powerful data mining package of python, and its interface is easy to use and has quite good visualization especially for classifiers and rules. Since the installation of interface is not that straightforward, we build up an instruction for groups who want to use this software.

orange

Instruction document: Orange Installation
Official website: http://orange.biolab.si/getting-started/

Saving Results of Python Classifiers to Disk

If you are building Naive Bayes classifiers using packages such as NLTK, you may notice that if you have a large training set that it can take hours to run. In order to not lose these results between work sessions, you can save the results of your classifier training to a disk file using the pickle commands list further below.

Continue reading

Applying Naive Bayes to Text Mining

I applied the Naïve Bayes Classifier method previously described to the Amazon food review data, and the results were encouraging, but unfortunately very slow to come by – the algorithm took about 19 hours to run for the first set of results below, and 43 hours for the second set of results (both contained only 5000 rows). The benefit of using this approach compared to the trial and error method of checking for specific words and their counts across different rating levels is that the algorithm will detect words with predictive power for you; the presence of ALL the words are considered rather than just what we can come up with from a bit of surface-level digging. Continue reading

Naïve Bayes Classifier for Document Classification

Naïve Bayes Classifiers are a family of simple probabilistic classifiers which apply Bayes’ theorem with a strong (naïve) assumption about the independence between observations. In the context of text analytics, the assumption is that the prevalence of words in a ‘document’ (article, email, post, tweet, etc.) are independent of each other. Although this is clearly a very naïve assumption, Naïve Bayes has been shown to produce very strong results. One way of compensating for its shortcoming is to study biagrams or triagrams (sets of two or three words at a time) which helps to capture some of the dependence in the text.

Naïve Bayes essentially works as follows: Continue reading