[11.3] Algorithms of Predictive Text

by markpepe

Click on the tweet below to see the full thread.

For this task, I had an expectation of stringing a set of words to form a coherent thought, but after typing the prompt the predictive text led me to nonsensical options. At one point, I was given options in either Spanish or Italian. In this case, predictive text did not express how I would express myself. There likely isn’t enough data to properly predict what I would usually say.

Stoop & van den Bosch (n.d) give a clear explanation on how predictive texts work:

“To be able to make useful predictions, a text predictor needs as much knowledge about language as possible, often done by machine learning […] This works by looking at the last few words you wrote and comparing these to all groups of words seen during the training phase. It outputs the best guess of what followed groups of similar words in the past.”

In their article, they discuss how the algorithm called k Nearest Neighbours predicts texts using Twitter. The algorithm will look at all the past tweets and will create a database and will then use an approach called context-sensitive prediction which depends on similar groups of words being available, list of words frequently used by the author, and limiting the pool of words available based on words already used. The algorithm also models “friends” on Twitter which takes into account conversations, mentions, and a similar accounts. The authors mention how the algorithm will create accurate predictive texts of Lady Gaga and Justin Timberlake because they are likely to tweet about similar things and overlapping topics.

When I finished writing this task in Twitter I couldn’t help but think of Reddit’s r/subredditsimulator. In this subreddit, only bots post and comment to each other, but each bot is a representation of their assigned subreddit. “It’s not a perfect recreation of Reddit, but an adequate caricature of its worst tendencies” (Khalid, 2019). It can also be funny, sarcastic, mean-spirited, helpful, reflective, and it can also uncannily echo the real internet (Khalid, 2019).

r/subredditsimulator uses OpenAI’s GPT-2 language model. “GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text” (OpenAI, 2019). GPT-2 was trained on 40GB of internet text, a data set of 8 million web pages, which allows it to answer questions, summarize, and translate. In a paper by OpenAI, GPT-2 was able to answer questions like “who wrote the book Origin of Species?” or “who is the quarterback of the Green Bay packers?” It answered them correctly with a high degree of probability, over 80%. The answers are Charles Darwin, and Aaron Rogers. Though, the questions it got wrong, had closely associated answers. Largest state in the us by landmass? It answered California, it’s Alaska, but California is the largest state by population. Good guess! Another one was who plays ser Davos in Game of Thrones? GPT-2 answered Peter Dinklage, but it’s Liam Cunningham. Peter Dinklage is a good guess, because he is strongly associated with Game of Throne, in my opinion. Using factoid-style question answering is how OpenAI tests what information is contained in the language model (Radford et al., n.d.). Take a look at Janelle Shane’s twitter thread below for an example of r/subredditsimulator.

Bringing it back to our task, the predictive text option on the iPhone is to help the user type a bit quicker to send that message out faster. This is an example of the virtual assistants that Shannon Vallor speaks of “to aid our daily performance[…]to carry out tasks under our direction” (Santa Clara University, 2018). For that reason, it works well, but to string a set of words to form a coherent thought it doesn’t work so well.

References