In a world awash with text, is there a way of automating the process of extracting meaningful information from unstructured text data such as blog entries, reviews on websites, tweets, or other documents? The answer is yes…sort of. Because humans are always coming up with new figures of speech, misspelling words, and using words ironically (“United lost my bags! Best day ever! Thanks, United! I didn’t want to go to the beach anyway!”), no automated meaning detection procedure will ever have 100% accuracy. However, there are a number of general approaches to analysing such data. Some methods depend on understanding the structure of sentences (identifying verbs and nouns, for example); some methods depend on recognizing patterns to extract pieces of information like addresses or phone numbers; some methods look at the similarity of documents to each other to group them by topic. Here we’re going to be focussing on a bag-of-words approach, that takes words out of their contexts within sentences or documents, and treats each word as its own object.
There are several steps in taking an approach like this.
- Gather and format data. Another blog entry details how to download data from Twitter. In class we’ll be using data that’s already formatted as a .csv file. A collection of texts is usually called a corpus (plural: corpora); in R this is a special data structure.
- Clean the text. Steps you can take include…
- removing numbers
- removing punctuation
- changing all letters to lowercase–this is to avoid “great” and “Great” and “GREAT” being counted as three different words.
- removing “stopwords”–these are words like pronouns that occur frequently in practically every document. There are situations in which (for example) studying pronoun use yields interesting insights, but most of the time they’re a distraction.
- “stemming” words. This means removing endings on words (“running”, “runs”, and “run” would all become “run”) so that you can count the total number of occurrences of the root word rather than the occurrences of each of its similar forms.
- Make a document-term matrix: each row corresponds to a document in the corpus, and each column corresponds to a word that occurs at least once in the corpus. The value of a cell is the count of the number of times that term occurs in that document. A term-document matrix is the same except rows and columns are swapped. For most corpora, the document-term matrix is very sparse.
- Count the number of times particular words appear in the entire corpus. You can also make visualizations of these counts: either bar graphs or (same idea but flashier) word clouds.
- Measure associations of words: if a certain word appears in a document, how likely is it that another word also appears?
- Assign one or more “sentiments” to each word. A word like “horrible” might be counted as expressing disgust and/or anger, as well as being categorized as negative; a word like “lovely” might be counted as expressing joy, as well as being categorized as positive. Then by counting the number of words in different categories you can calculate a total score for each document in any of several sentiment categories.