Text analytics has been increasingly attracting the attention of the OR-analytics community as the amount of information stored as textual data increases. Just think about the amount of data stored in emails, news articles and social media, not to mention contact-center notes, surveys, feedback forms, and so forth. Some estimate that text analytics market will grow between 25% and 40% during the next 5 years.[1][2] But let’s make sure we are talking about the same thing when we talk about text analytics.
So, what is text analytics?
If defining analytics is a bit of a challenge, defining text analytics is an even greater one. But, basically, text analytics transforms unstructured text into data that can be analyzed with traditional techniques to obtain business insights. [3][4]
The distinctive characteristic of text analytics is that it works with unstructured text. This kind of data does not have a predefined model of organization, resulting in irregularities and ambiguities that make it difficult to extract meaning from. (Although many times unstructured text can be found combined with structured data commonly found in records and surveys, such as dates, demographic information, etc.)
The transformation of unstructured text into useful insights is done by applying methods from fields such as linguistics, statistics and computer science (most notably Natural Language Processing). Some of these techniques are:
- Tokenizing: decomposing sentences into individual words or phrases.
- Filtering: removing from the analysis useless words such as “the”, “a” or others specific to the domain of the analysis.
- Stemming: reducing words to their stem to identify different grammatical forms of words.
- Tagging: adding keywords that help describe the data, making searching it easier.
- Indexing: using indexes to quickly locate keywords without having to search every row in a database.
- Word frequency analysis: creating a matrix of frequencies that enumerates the number of times that each word occurs in each entry or document.
Once text is indexed and quantified traditional statistical and data mining methods kick in. Exploratory data analysis, statistical classification, cluster analysis or predictive modelling techniques can help to make sense out of the text. For instance, topic modelling [5][6] tries to discover the text topics by using probabilistic models based on the statistics of the words in the data. Sentiment analysis aims to determine the attitude of text comments towards a topic, for example to determine the public opinion on political topics based on Twitter entries. Which is the appropriate technique will depend basically on the objective of the analysis.
In a recent COE project we used agent notes from a call center to identify customers complaining about their phone bills and then analyzed the characteristics of these type of customers. Text analytics allowed us to interpret unstructured text from agent notes and to a massive amount of data to better understand customers’ behavior.
Interpreting unstructured text is a challenging problem that Text Analytics tackles. It’s not easy work but its potential rewards are worthwhile.
We choose text analytics applications in Insurance business, because insurance companies deal with unstructured and ambiguous data from different sources. The text data could be in police reports, medical records, underwriter notes, etc. Different agents use different abbreviations to record incidents, so it is very challenging. Insurance company tries to save money, so they are restricted by the limited resources to go through the details in claims. Thus, there is high risk in fraud and missed opportunities in subrogation. In addition, they use text analytics in their daily operations to sort the claims, direct work flow to proper department (auto, house, medical).
The techniques used in insurance industry are as follow:
– In the daily operation, they use tokenization, stemming, filter and tagging(https://www.casact.org/education/specsem/f2008/handouts/ellingsworth.pdf).
– In subrogation, they use tokenization, tagging, indexing. The main goal of the analysis is to identify the opportunity, and recover cost from third parties. As mentioned in an online article, “5% of claims that should go to subrogation don’t”. (http://www.insurancetech.com/3-insurance-business-applications-for-text-analytics/a/d-id/1314975?)
– In fraud detection, the company can use text analytics to do background screening at the point of sale; and monitor the customer’s profile change during the life of policy (http://www.insurancetech.com/3-places-to-look-for-underwriting-fraud-risk/a/d-id/1314903?).
Naïve Bayes Classifier for Document Classification
Introduction
Naïve Bayes Classifiers are a family of simple probabilistic classifiers which apply Bayes’ theorem with a strong (naïve) assumption about the independence between observations. In the context of text analytics, the assumption is that the prevalence of words in a ‘document’ (article, email, post, tweet, etc.) are independent of each other. Although this is clearly a very naïve assumption, Naïve Bayes has been shown to produce very strong results. One way of compensating for its shortcoming is to study biagrams or triagrams (sets of two or three words at a time) which helps to capture some of the dependence in the text.
Naïve Bayes essentially works as follows:
Pre-process the data to count how many times each word in a document appears in that document
Use this training set to calculate the independent likelihood of each unique word appearing as many times as it did given that the document is of a certain class (such as “spam” or “not spam”)
Similarly, calculate the prior probability of a document being a certain class, as well as the probability of each word appearing as many times as it did in a document, regardless of class
Finally, calculate the conditional probability that a document is of a certain class based on the independent likelihood of each unique word appearing as many times as it did, multiplied by the prior probability of a document being a certain class, and divided by the probability of the words appearing as many times as they did
In summary, the posterior probability of a document being spam = (prior probability of spam) x (likelihood of having each word so many times given that its spam) / (the probability of having each word so many times). This way, high incident words like “the” and “a” receive “discounted” weights in the probability calculation, and rarer words like “fantastic” or “terrible” will be taken as being better predictors of the document’s class.
Naïve Bayes Classification Theory
The i-th word of a given document occurs in a document from class C with a probability:
p(wi|C)
And a given document D contains all the works wi, given a class C. has the probability:
p(D|C)=Õip(wi|C)
We want to know what is the probability that a given document D belongs to a given class C, which is p(C|D).
So we have:p(C|D)=p(C)/p(D)*p(D|C).
For example, we assume that there are 2 mutually exclusive classes, S and ¬S. Then we have:
p(D|S)=Õip(wi|S)
and
p(D|¬S)=Õip(wi|¬S)
Then we can derive p(S|D) and p(¬S|D) as:
p(S|D)=p(S)/p(D)*Õip(wi|S)
p(¬S|D)=p(¬S)/p(D)*Õip(wi|¬S)
Dividing one by the other:
p(S|D)/p(¬S|D)=p(S)/p(¬S)*Õi[p(wi|S)/p(wi|¬S)]
Then we take logarithm on this formular to derive the formula for ln[p(S|D)/p(¬S|D)].
ln[p(S|D)/p(¬S|D)]=ln[p(S)/p(¬S)]+åiln[p(wi|S)/p(wi|¬S)]
When ln[p(S|D)/p(¬S|D)]>0, it means that p(S|D)>p(¬S|D) and this document can be classified into class S.
Applications of Naïve Bayes Classification
Digit Recognition: The Naïve Bayes classification is used to identify the digits by analyzing the pixels in an image and assigning a posterior probability to the image being identified as a certain digit. An image consist of black and white pixels. The rules for assigning probability are first developed on a training sample and later used on actual data to identify the image.
Spam Classification: Represents the messages as vectors that are described and evaluated as spam or non-spam based on the presence of the attributes or words in the email.
Medical Diagnosis: Using the patient data to estimate the likelihood of a patient getting affected by a certain disease.
References
http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Document_classification
http://burakkanber.com/blog/machine-learning-naive-bayes-1/
http://users.utcluj.ro/~igiosan/Resources/PRS/L8/lab_08e.pdf
http://arxiv.org/pdf/cs/0006013.pdf
http://www.ijarcce.com/upload/2014/may/IJARCCE9E%20%20a%20rupali%20%20Heart%20Disease%20Prediction.pdf
The Õ there should be the symbol for “product”
And the å there should be the symbol for “sum”
Part-of-Speech Tagging- Methods of Text Analytics
Part-of-Speech tagging a.k.a POST is the process of marking up a word in a text(corpus) as corresponding to a particular part of speech, based on both its definition as well as its context-i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. POST arose from the difficulty caused by ambiguity of meaning for various words in unstructured text analytics. Even simple words often can have multiple meanings so context is the key in order to identify which definition applies[1]. One of the leading pioneers in statistical NLP (natural language processing) is Dr. Fred Jelinek. POST derived from various different approach to speech recognition but in particular, used a rule based approach based on linguistic expertise. However, in the 1970’s IBM found that a data-driven approach worked much better which was later taken on by Google and is now one of the dominant approaches to such problems. In 1990 less than 5% of papers at an ACL conference used statistical methods for text analytics, now around 95% of papers uses these methods. This is indicative of a major paradigm shift and is where text analytics is going and why it is becoming such a major field in the analytics industry.
Algorithms
POS-tagging algorithms are categorized into two groups: rule-based and stochastic.
E. Brill’s tagger[2] is an example of a rule-based algorithms. It’s one of the first and most widely used English POS-taggers invented in 1995 by Eric Brill. This algorithm is an “error-driven transformation-based tagger” which aims to minimize error. A tag is assigned to each word and corrected with a predefined rules. The initial assignment would be the most frequent tag for the word, and a tag “noun” will be assigned to an unknown word. It applies a learned set of patterns instead of optimizing a statistical quantity.
Hidden Markov Models (HMMs)[1] is an example of stochastic algorithms. The HMMs create a table of predefined word sequences and assign probabilities to each sequences. For example, from the table you may conclude the probability of a “the” is followed by a “one” in a sentence is 10%.
Also, there are some other algorithms like the Viterbi algorithm, Constraint Grammar, and the Baum-Welch algorithm (also known as the forward-backward algorithm).
Application
POS tagging is useful for a large number of applications: It is the first analysis step in many syntactic parsers. It is required for the correct lemmatization of work like saw which is ambiguous between the noun saw and the past tense form of the verb to see, and it is used in information extraction, speech synthesis, lexicographic research, term extraction, and many other applications.
Accuracy
The typical accuracy of POS taggers is between 95 % and 98 % depending on the tag set, the size of the training corpus, the coverage of the lexicon, and the similarity between training and test data. This is impressive, but an accuracy of 96 % means that a 20-word sentence is correctly tagged with a probability of just 0.9620 ≈ 44%.[3]
References
[1] http://www.cl.cam.ac.uk/teaching/1213/L100/clark_lectures/clark_lecture2.pdf
[2] http://en.wikipedia.org/wiki/Brill_tagger
[3] http://www.coli.uni-saarland.de/~schulte/Teaching/ESSLLI-06/Referenzen/Tagging/schmid-hsk-tag.pdf
Tag You’re It
We thought it fitting in our discussions of gang cohesiveness that we would talk about tagging. Now we aren’t talking about the sometimes beautiful, often ugly and annoying, spray painted art pieces you find on concrete surfaces everywhere. But instead we are talking about beautifully tagging data with meaningful metadata.
Why do we tag? Because we want to share our content with the online masses in a way such that it is easily found for relevant searches. You can think about Search Engine Optimization and choosing good key words as primary driver of this form. As a business trying to market its services and products this is hugely important.
The difference between tagging and just using keywords is that the tag may not be included in the text data, whereas the keyword always is. This is especially useful when the measure or meaning of the text is highly context dependent and unclear from face literary value. For example, if a call center is trying to parse its agent comments for resolved tickets, they may need to classify the ticket as requests for information, customer bill complaints, or requests for services.
Tagging allows relevant information to be pulled from several different systems with ease. For example, Twitter. Tweets from different users can all be viewed for topics without the need for them to organize themselves. This is a highly flexible and efficient system because of user generated tags. However, there are some considerations we should note, primarily that user generated tags and their meaning evolves with their use. For example, the tag ‘Sick’ evolved from meaning illness to meaning cool, or awesome in some contexts. Conversely, some different tags such as #COOL, #AWESOME, #WHOOO can all mean the same thing. In this way other text analytic techniques may be used to aggregate tags.
Sometimes tags can be misleading and cause unwanted data to be included in the analysis. For example, when we use our best judgment to tag data there is always a chance we misunderstood the meaning of the data. As analysts we can only aim to reduce this type of error through communication with the data source, but it is time consuming and costly to reduce these errors.
All in all, we think tags are a great way to describe data. Tags are essentially, metadata, and data describing data is common practice in today’s world. We aren’t quite sure how to create this metadata and that would be a good topic for another group.