Naïve Bayes Classifier for Document Classification

Naïve Bayes Classifiers are a family of simple probabilistic classifiers which apply Bayes’ theorem with a strong (naïve) assumption about the independence between observations. In the context of text analytics, the assumption is that the prevalence of words in a ‘document’ (article, email, post, tweet, etc.) are independent of each other. Although this is clearly a very naïve assumption, Naïve Bayes has been shown to produce very strong results. One way of compensating for its shortcoming is to study biagrams or triagrams (sets of two or three words at a time) which helps to capture some of the dependence in the text.

Naïve Bayes essentially works as follows:

  1. Pre-process the data to count how many times each word in a document appears in that document
  2. Use this training set to calculate the independent likelihood of each unique word appearing as many times as it did given that the document is of a certain class (such as “spam” or “not spam”)
  3. Similarly, calculate the prior probability of a document being a certain class, as well as the probability of each word appearing as many times as it did in a document, regardless of class
  4. Finally, calculate the conditional probability that a document is of a certain class based on the independent likelihood of each unique word appearing as many times as it did, multiplied by the prior probability of a document being a certain class, and divided by the probability of the words appearing as many times as they did

In summary, the posterior probability of a document being spam = (prior probability of spam) x (likelihood of having each word so many times given that its spam) / (the probability of having each word so many times). This way, high incident words like “the” and “a” receive “discounted” weights in the probability calculation, and rarer words like “fantastic” or “terrible” will be taken as being better predictors of the document’s class.

Naïve Bayes Classification Theory

The i-th word of a given document occurs in a document from class C with a probability:
p(wi|C)
And a given document D contains all the works wi, given a class C. has the probability:
p(D|C)=Õip(wi|C)
We want to know what is the probability that a given document D belongs to a given class C, which is p(C|D).
So we have:p(C|D)=p(C)/p(D)*p(D|C).
For example, we assume that there are 2 mutually exclusive classes, S and ¬S. Then we have:
p(D|S)=Õip(wi|S)
and
p(D|¬S)=Õip(wi|¬S)
Then we can derive p(S|D) and p(¬S|D) as:
p(S|D)=p(S)/p(D)*Õip(wi|S)
p(¬S|D)=p(¬S)/p(D)*Õip(wi|¬S)
Dividing one by the other:
p(S|D)/p(¬S|D)=p(S)/p(¬S)*Õi[p(wi|S)/p(wi|¬S)]
Then we take logarithm on this formular to derive the formula for ln[p(S|D)/p(¬S|D)].
ln[p(S|D)/p(¬S|D)]=ln[p(S)/p(¬S)]+åiln[p(wi|S)/p(wi|¬S)]
When ln[p(S|D)/p(¬S|D)]>0, it means that p(S|D)>p(¬S|D) and this document can be classified into class S.

Applications of Naïve Bayes Classification

Digit Recognition: The Naïve Bayes classification is used to identify the digits by analyzing the pixels in an image and assigning a posterior probability to the image being identified as a certain digit. An image consist of black and white pixels. The rules for assigning probability are first developed on a training sample and later used on actual data to identify the image.
Spam Classification: Represents the messages as vectors that are described and evaluated as spam or non-spam based on the presence of the attributes or words in the email.
Medical Diagnosis: Using the patient data to estimate the likelihood of a patient getting affected by a certain disease.

References
http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Document_classification
http://burakkanber.com/blog/machine-learning-naive-bayes-1/
http://users.utcluj.ro/~igiosan/Resources/PRS/L8/lab_08e.pdf
http://arxiv.org/pdf/cs/0006013.pdf
http://www.ijarcce.com/upload/2014/may/IJARCCE9E%20%20a%20rupali%20%20Heart%20Disease%20Prediction.pdf

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.