The Text Mining ‘tm’ Package in R

Tank Group – Haider Shah, Tony Guo and Chris Pang

To perform text mining in R, there is a useful package called ‘tm’ which provides several functions for text handling, processing and management. The package uses the concept of a ‘corpus’ which is a collection of text documents to operate upon. Text can be stored either in-memory in R via a Volatile Corpus or on an external data store such as a database via a Permanent Corpus.

Example: Building a Word Cloud from Twitter Feeds

Data Import

Within this example, data from a Twitter account is retrieved using a data import method and then stored into a corpus. This is shown below using a sample Twitter feed.

> library(twitteR)
> library(tm)
> rdmTweets <- userTimeline(“rdatamining”, n=100)
> df< – do.call(“rbind”, lapply(rdmTweets, as.data.frame))
> myCorpus< – Corpus(VectorSource(df$text))

Transformations

First a number of transformations are applied to the corpus to simplify it such as removing uppercase characters and numbers. Additionally, stop words and punctuation are removed.

The tm_map function in the ‘tm’ R package is capable of performing several text analysis operations. The words ‘tolower’, ‘removePunctuation’, ‘removeNumber’ can be used as inputs to the tm_map function to perform such operations.

> myCorpus <- tm_map(myCorpus, tolower)
> myCorpus <- tm_map(myCorpus, removePunctuation)
> myCorpus <- tm_map(myCorpus, removeNumbers)
> # keep “r” by removing it from stopwords
> myStopwords <- c(stopwords(‘english’), “available”, “via”)
> idx <- which(myStopwords == “r”)
> myStopwords <- myStopwords[-idx]
> myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

Stemming

Then to further consolidate the words in the corpus, we apply stemming functions to the corpus. Stemming reduces words to their ‘root’ by removing any suffixes or prefixes. For instance, “example” and “examples” are both stemmed to “exampl”.

The word ‘stemDocument’ can be used as an input to the tm_map function to perform stemming operations.

> dictCorpus <- myCorpus > myCorpus< – tm_map(myCorpus, stemDocument)

Dealing with Spelling Mistakes

If your document has spelling mistakes, you can handle these by creating a synonym list of the incorrect spellings and associate it to a correct spelling. Then this list can be passed to the tm_map function alongside the input word ‘replaceSynonyms’ to replace the incorrect spellings with the correct spelling.

Building the Word Cloud

To perform tasks such as clustering, classification and association analysis, we first need to build a document-term matrix.

We have two mains functions for performing this matrix building, we can use TermDocumentMatrix (term as the row, document as the column of the matrix) and DocumentTermMatrix (document as the row, term as the column of the matrix).

> myDtm <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))

To find the words with high frequency in our dataset, we can use the function findFreqTerms.

> findFreqTerms(myDtm, lowfreq=10)

To find words that are correlated to a specific word, we can use the findAssocs function.

> findAssocs(myDtm, ‘r’, 0.30)

To build a word cloud based on our results, we can use another R package called ‘wordcloud’ as pass in our dataset.

> library(wordcloud)
> m <- as.matrix(myDtm)
> # calculate the frequency of words
> v <- sort(rowSums(m), decreasing=TRUE)
> myNames <- names(v)
> k <- which(names(v)==”miners”) > myNames[k]< – “mining”
> d <- data.frame(word=myNames, freq=v)
> wordcloud(d$word, d$freq, min.freq=3)

References
http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
http://www.rdatamining.com/examples/text-mining
http://stackoverflow.com/questions/24443388/stemming-with-r-text-analysis

1 thought on “The Text Mining ‘tm’ Package in R

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.