[11.3] Algorithms of Predictive Text
by markpepe
Click on the tweet below to see the full thread.
The following set of tweets is an assignment for ETEC 540 Task 11: Algorithms of Predictive Text. ???? ???? #UBCMET #edtech
— Mark Pepe (@MarkMPepe) July 25, 2022
For this task, I had an expectation of stringing a set of words to form a coherent thought, but after typing the prompt the predictive text led me to nonsensical options. At one point, I was given options in either Spanish or Italian. In this case, predictive text did not express how I would express myself. There likely isn’t enough data to properly predict what I would usually say.
Stoop & van den Bosch (n.d) give a clear explanation on how predictive texts work:
“To be able to make useful predictions, a text predictor needs as much knowledge about language as possible, often done by machine learning […] This works by looking at the last few words you wrote and comparing these to all groups of words seen during the training phase. It outputs the best guess of what followed groups of similar words in the past.”
In their article, they discuss how the algorithm called k Nearest Neighbours predicts texts using Twitter. The algorithm will look at all the past tweets and will create a database and will then use an approach called context-sensitive prediction which depends on similar groups of words being available, list of words frequently used by the author, and limiting the pool of words available based on words already used. The algorithm also models “friends” on Twitter which takes into account conversations, mentions, and a similar accounts. The authors mention how the algorithm will create accurate predictive texts of Lady Gaga and Justin Timberlake because they are likely to tweet about similar things and overlapping topics.
When I finished writing this task in Twitter I couldn’t help but think of Reddit’s r/subredditsimulator. In this subreddit, only bots post and comment to each other, but each bot is a representation of their assigned subreddit. “It’s not a perfect recreation of Reddit, but an adequate caricature of its worst tendencies” (Khalid, 2019). It can also be funny, sarcastic, mean-spirited, helpful, reflective, and it can also uncannily echo the real internet (Khalid, 2019).
r/subredditsimulator uses OpenAI’s GPT-2 language model. “GPT-2 is trained with a simple objective: predict the next word, given all of the previous words within some text” (OpenAI, 2019). GPT-2 was trained on 40GB of internet text, a data set of 8 million web pages, which allows it to answer questions, summarize, and translate. In a paper by OpenAI, GPT-2 was able to answer questions like “who wrote the book Origin of Species?” or “who is the quarterback of the Green Bay packers?” It answered them correctly with a high degree of probability, over 80%. The answers are Charles Darwin, and Aaron Rogers. Though, the questions it got wrong, had closely associated answers. Largest state in the us by landmass? It answered California, it’s Alaska, but California is the largest state by population. Good guess! Another one was who plays ser Davos in Game of Thrones? GPT-2 answered Peter Dinklage, but it’s Liam Cunningham. Peter Dinklage is a good guess, because he is strongly associated with Game of Throne, in my opinion. Using factoid-style question answering is how OpenAI tests what information is contained in the language model (Radford et al., n.d.). Take a look at Janelle Shane’s twitter thread below for an example of r/subredditsimulator.
There’s now a subreddit that’s populated entirely by neural nets who are themselves simulating other subreddits.
They’re sometimes kind to each other, sometimes awful, and they keep trying to ban each other’s posts.https://t.co/Xv0WVNrw00 pic.twitter.com/NMfoVp9pms
— Janelle Shane (@JanelleCShane) June 5, 2019
Bringing it back to our task, the predictive text option on the iPhone is to help the user type a bit quicker to send that message out faster. This is an example of the virtual assistants that Shannon Vallor speaks of “to aid our daily performance[…]to carry out tasks under our direction” (Santa Clara University, 2018). For that reason, it works well, but to string a set of words to form a coherent thought it doesn’t work so well.
References
- Khalid, A. (2019, June 5). This AI-powered subreddit has been simulating the real thing for years. Engadget.
- OpenAI. (2019, February 14). Better Language Models and Their Implications. OpenAI.
- Radford, A., Wu, J., Child, R., Luan, D., Amodei, A., & Sutskever, I. (n.d.) Language Models are Unsupervised Multitask Learners.
- Santa Clara University. (2018). Lessons from the AI Mirror Shannon Vallor.
- Stoop & van den Bosch. (n.d.) How algorithms will know what you’ll type next. The Pudding.