Speech-to-Text Task: revisited

Voice to Unicode to 159 scripted languages

The multi-modal Speech-to-Text (STT) harnesses three powerful technologies of meaning-making: language, writing, and digital. While computers have had this function for a while, it has gained popularity since being incorporated into mobile devices and from functionality advancements in voice recognition software, Unicode use for text data, big data analysis, and machine learning. Being built into search engines, digital platforms, and other software, digital technology such as STT will alter the established power dynamics between language and literacy.

Orality

The permanence of the essential orality of language (Ong, 2002) arises from the universal motivation and desire of humans to connect and communicate with others. (Vygotsky, 1978). Through this socialization from birth onward, individuals “hearing the voices” (Hadley, 2019), surrounding them, pick up a language. Therefore, the child’s perception of the world is not limited to visual input but is broaden through the interaction with speech (Vygotsky, 1978). Thus, he added that “the most significant moment in the course of intellectual development, which gives birth to the purely human forms of practical and abstract intelligence, occurs when speech and functional activity, two previously completely independent lines of development, converge” (p. 24).

Language is the universal motivation and desire of humans to connect and communicate with others.
(Vygotsky, 1978) (Public Domain Picture)

For centuries, through this orality, cultures have passed the collective wisdom, worldview, and histories to the next generations. Moreover, this social interaction went beyond conversation and debates to other oral traditions such as stories, tales, songs, poetry or tap dance to relay the information. Orality takes place in real-time with the listener inside the experience. However, this mode of passing on wisdom requires that the listener hears the sounds as they transpire, attach meaning to those heard phonemes, hold it in the memory for some permanence, and possess the skills to pass it forward orally. “Oral tradition has no residue or deposit. When an often-told oral story is not being told, all that exists of it is the potential in certain human beings to tell it” (Ong, 2002, p. 11). As Gnanadesikan (2011) noted, it is lost forever once the traces vanish from the last memory.

Literacy

Clay tablets inscribed with Linear B script, from the Mycenaen place of Pylos. The script is about the distribution of bovines, pigs, and deer hides, shoes and saddle-making. Dated from 1450 BC.
(Wikipedia: Creative Commons)

The invention of writing expresses and is secondary to language (Gnanadesikan, 2011); however, the coding of phonemes to graphemes eventually is prized over orality as writing stores and transfers information ( Schmandt-Bessard, 2009) externally. Thus, [the] representation succeeds because [it] enables their users to do more” (Schmandt-Besserat, 2009, p. 22). Nevertheless, throughout human history, only a tiny fraction of the world’s cultures have produced writing. According to Ong (2002), writing goes beyond merely transcribing oral performances as some of these cultures produce strictly organized sequential compositions that stimulate cogitation, inquiry, and critique. Moreover, the introduction of the printing press standardized spelling, punctuation, and grammar created grapholect languages like standard Arabic, English, Spanish and Mandarin over the numerous dialects found within their respective oral traditions. As a result, Scholes (1992) argued that standardized texts built fences for the privileged few and marginalized the rest leading to alienation and social stratification. For example, in the Canadian cultures, the marginalized included women, speakers of regional dialects, Indigenous peoples and racialized subgroups. “Marginalized voices have found expressions in forms too humble for canonization or [have been] already discarded” (Scholes, 1992, p. 152).

The Residue transcribed by the Speech-To-Text

This task reminds me of an activity I had to complete way back but I was studying at McGill in the Tesla department we had to interview an individual for a half an hour transcribe the conversation and then we had to analyze it and the point of the exercise was to note that when people talk there’s a lot of unknowns and ours they backtrack they self correct as well as wander the conversation one day the funny thing is that I used to do that activity when I was in speech I also assessor we would interview individuals for 15 minutes the test was three parts the first part was a question answer about familiar of subjects do you what kind of music do you like or do you like tonight sky of the day sky the second part we would ask us give the student a question and the students would need to speak for two minutes and it soliloquy or monologue and the third part was open Open conversation where I would ask a question and the the test he would respond and I could push the conversation as I needed to go to see how far they went they could go without falling apart basically in the interesting thing was we assessed him on how linear their conversation was did they have logic how much logic was in the conversation did they have transitions was the subject well developed for vocabulary we looked at how extensive and complete was the cowboy was it concise with an exact for grammar we looked at the density of grandma did an interview did it interfere with with the understanding did they have a simple grammatical structures like a past tense in articles and do they have complex grammatical structures and finally for the for the pronunciation we were looking at how easy was it first understanding individual plus the ability to move the voice up-and-down to give meaning through nuances of the voice note in doing this activity five minutes is a long time I know what the students two minutes when I practice with my students two minutes is huge even for me to speak it speak for two minutes is a long chunk because usually we we have feedback from other people so it’s hard to keep talking five minutes if you see a play angles premise that writing affects how we we use language that in reality we are checking the individuals on how well do they speak like writing I’ve had the pleasure and the experience to learn from individuals who have not spent a lot of time in school who are not literate in their own language and as refugees some of them had not much experience with printed words a few of these students come to mind there was one one younger younger students who he already could speak probably 10 languages from Africa and he was he was functioning in Canada he had gained employment he stayed at the job for four I think five years his boss would text him the information that he needed for the day first day assignments he would take this he would take his text and send it to his cousin who is in in Saskatoon and his cousin would phone them back and relay the message to him and then the student would dictate a message to his boss and I’m not so sure that the boss didn’t realize that he didn’t read it all now we think that just because they don’t read that they cannot function in with technology but he was quite good with a cell phone for this for them the cell phone was a speaking tool the phone allowed him to connect to other people or Lee and he was he was good at picking out the visuals on the phone he would memorize the symbols or the the icons that the curves nothing to know what buttons he needed to to hit we think that because they are illiterate that they are not functioning but some of them function quite well do you have a greater awareness of the environment around them they can reach faces is always a open book they can they they are not in a linear thought pattern so then they’re not stuck to one one part they can see they see a wider range and he wasn’t the only one I saw who managed to cope in Canada without having having language from his homeland I had another gentleman who had lived in the jungles in Burma Schools are not part of their life because he was spending the time surviving and watching for the army because were in the army cut them off guard and they would they would kill whoever that they caught so he finally got out of Burma was it a refugee camp in Thailand just across from the Bernie’s border his family was still back in Vermont so he would have to go back in the country on a regular basis taking his medicine and rice for the people who are still hiding in the jungle from the army and he said the army would sit on the on the the river banks waiting as a support to pick up anyone who was crossing over now working with him in Canada was quite interesting he had only been I think he’s been in school for one year and in his Burmese life and he didn’t really get the understanding of what the value of paper and Reagan was he used to drive on the link teaches crazy because he was in his own space his own world and on his own time and talked about how illiterate people do not have a sense of before before as a concept of writing and I’m not so sure of that because many students who didn’t have a close relation to to literacy because he spent a whole house trying to survive it is hard to read if you’re running for your life books are not the things that they carry they get rid of everything that’s not required yet the student still knew about the past they were trying to forget their past but they knew it existed they understood that people have been killed and they had gone to her is what they were trying was not remember the Horse what they seem to be lacking was a belief in the future they didn’t have dreams and they didn’t have goals they just lived in the present and I would imagine it is important to live in the present if you trying to stay alive I Contessa with the students I have today who are literate highly educated both in English and they’re in their first language yet these students have an inability to speak which is really interesting because speaking is the primary mode of communication for humans yet they speak by reading and they listen by reading and it makes me have to figure out how to give them an activity where they listen when they cannot look it up cc or transcribe transcriptions because you can never listen if you never listen and you can never speak if you’re always reading script in reality this to the two groups remind me of each other there’s a flipside of each other and it maybe it comes down to believing that you’re capable of doing it the story becomes more interesting when we are in the other technologies of school or education and digital technology mobile technology to name a few how to “that’s funny that’s all folks

What I meant to Say …

This task reminds me of an activity I had to complete way back while studying at McGill in the TESL Department. We had to interview an individual for half an hour, transcribe the conversation, and then we had to analyze it. The point of the exercise was to note that when people talk, there’s a lot of uhs and uhms; they backtrack, self-correct and wander in the conversation.

The funny thing is that I used to do that same activity when I was an IELTs speaking examiner. I would interview individuals for 15 minutes. The test was in three parts: the first part was questions and answers about familiar subjects such as what kind of music do you like or do you like the night sky or the day sky. In the second part, we would give the examinee a prompt. After that, they would need to speak about it for two minutes, a soliloquy or monologue. The third part was an open-ended conversation where I would ask a question. Then the testee would respond; I pushed the conversation as needed to see how far they could go without falling apart. The interesting thing was I assessed them on how linear their discussion was. Did they have logic in the conversation? Did they have transitions? Was the subject well developed? For vocabulary, I looked at how extensive and complete it was. Was it concise and exact? For grammar, I looked at the density of grammar errors. Did it interfere with the understanding? Did they have a simple grammatical structure like past tense and articles or complex grammatical structures? Finally, for the pronunciation, I was looking at how easy it was to understand the individual plus their ability to move the voice up-and-down to give meaning through nuances of the voice.

A note about doing this activity is that five minutes is a long time. I know that the time is enormous when the testees talk for two minutes or students practice for two minutes. But, even for me, speaking two minutes is a lengthy chunk because usually, we have feedback from other people, so it’s much harder to keep talking for five minutes…

Orality through Digital Technology

The Speech-To-Text (STT) augments language. The software converts phonemes not into graphemes like writing but numeric values. The numerical values correspond to characters created for the Unicode standard. According to Wikipedia (2021), 144,697 encoded symbols allow for digital representation of the software’s interpretation of the phoneme to 159 modern or historical scripts. For example, once decoded, Deluca’s (2014) Shetlandic recitation can be recorded symbolically using any combination of the 159 languages while happening in real-time of the leaving behind the transcript as the residue. Good examples of STT functions are texting or obtaining the auto-generated transcript where writing conventions are not necessary.

Wikitongues: Christine Speaking Shetlandic (DeLuca, 2014)

Open Transcripts: English
(auto-generated)

Lack of Writing Conventions

However, as STT provides a verbatim transcription of the phonemes, the document lacks English writing conventions such as capitalization, punctuation, correct spelling, sentence use, and paragraphing.

Furthermore, as written communication, for the most part, is asynchronous, the author is not as available for clarification. Consequently, the writer deals with the envisioned potential misunderstandings and questions of the audience before publishing. Hence, there is the expectation of adherence to the prescriptive spelling, grammar and format; anything less is considered substandard; therefore, less worthy.

Accordingly, the written format introduces the concept adds the controlling perspective, which the writer develops linearly, sequentially, and logically (Ong. 2002; Escobar, 1994). Each clause, like each sentence, and paragraph contains one thought unless connected with conjunctions or prepositions. Next, the writer adds signage through words and phrases that facilitate the reader transitioning through the argument. Examples, explanations, and background information gives greater context to improve comprehension and retention. Finally, the writer recapitulates the concept, thought development, and conclusion to aid the reader in recalling the journey’s main points.

A Scroll of Unpunctuated Stream of Thought

My transcript was an unpunctuated stream of thought like Kafka or Gao Xingjian’s stream-of-consciousness writing technique in a scroll-like document.

Speech, after all, is a primary psychological tool used for reciprocal social interaction (Vygotsky, 1978). Thus oral English traditions can organically meander as the participants mediate and co-construct meaning during the synchronous encounter.

In addition, the speakers use a range of pronunciation features such as pauses, speed, tone to add the affective component into the mix. The word choice is also different. Speakers use more straightforward language overall with fewer syllables. A speaker usually interacts in real-time; thus, the speaker creates and corrects while simultaneously delivering the ideas. Therefore, it is not unusual to have false starts, backtracking, self-corrections, fillers and a higher density of errors resulting in a lack of accuracy in the STT transcripts —a clean recording requires that the speaker speaks slower in a more monotone or automated fashion.

STT does not consistently pick up accented speech or variance from the dialect that the programmers have allocated as the standard. However, my phone promises that I can train the system to recognize my voice.
Moreover, it is also unable to pick up the suprasegmental features used to impart meaning. For example, the speaker must instruct the software program to use standardized writing conventions, such as punctuation, where a speaker only needs to add a long pause. Thus capitalization and punctuation are lacking unless explicitly suggested. Another difficulty is the inability to pick up unstressed syllables such as suffixes or function words like auxiliaries and prepositions that speakers tend to reduce or minimize in their speech patterns.

Scripted or Memorized Orality

In the circumstances of a speaker delivering a written text orally (e.g., news, lectures, stories) or a memorized prescripted speech like McDonald’s (2016) TEDx Talk, the oral communication would resemble a written text with relatively low error density, organized structure, transition, and more precise terms. Also, with practice, the person delivering a presentation, story, recitation could create the illusion that it is not prescripted by modulating the voice and including the appropriate body language. The well-rehearsed known text will resemble a written text and have greater exactness in the content. Ironically, when many English as a Second Language (ESL) students are asked to create a speaking sample, they submit a read written text, unaware that a read and a spoken text do not resemble each other.

Afford Access

Turning the voice-into-text and text-to-voice increases accessibility for many marginalized individuals who lack literate skills to function through print. In my discourse with my phone’s STT system, I give an example of one of my students. He is illiterate and does not understand the purpose of the alphabet. Nevertheless, he was also a polyglot, knowing many sub-Saharan African languages and a few around South Africa. He was knowledgeable about the mobile phone, which was an extension of orality for him. He used the voice-to-text function to reply to his employer daily. Now that he knows how to use the text-to-voice function on his phone, he no longer requires a scribe to interpret his boss’ text communicating the work assignments. There are many others within our community where this combination of technologies (digital, writing, and speaking) would allow them greater access to society.

Major African Languages in 2019
(Maps & Gabriel)

Will Speech-to-Text equalize language with literacy? First, it improves access to communication by the unstandardized margins of society. For example, in the “What is in Your Bag” post, I mentioned my privilege. Growing up on the edge of Canada’s bush country and being a reluctant reader, my privilege arose from having literate parents and having access to books in my home. Literacy provided options that were not available to my female peers. While I went on to university, many of them became mothers at the beginning of their teens.

 

Motherhood brings its rewards; nonetheless, their potential was short-changed by the reduced options available to them. At that moment, which is many years back, it was about having access to books; now, it is about accessing technology and affording the internet. Many female students from developing countries or marginalized communities I have instructed have a similar story of motherhood at a young age like my peers of the past. Their education moves through improved literacy not only of books but also technology, which allows them to gain further education, employment, and access to the larger community. Changing their lives changes the options for their children. Perhaps, it even allows them to dream.

 

References

Hadley, A. (2019, January 11). New Indigenous language app targets “21st-century” learners. CBC Newshttps://www.cbc.ca/news/canada/thunder-bay/indigenous-language-app-1.4970376

DeLuca, C. (2014, September 21). Wikitongues: Christine Speaking Shetlandichttps://youtu.be/m0EwquC6wBU

Escobar, A., Hess, D., Isabel, L., Silbey, W., Strathern, M., & Sutz, J. (1994). Welcome to Cyberia: Notes on the anthropology of Cyberculture and comments and reply. Current Anthropology35(3), 211–231. https://www.jstor.org/stable/2744194

Gnanadesikan, A. E. (2011). “The First IT Revolution.” In The writing revolution: Cuneiform to the internet links to an external site. (Vol. 25). John Wiley & Sons (pp. 1-10).

MacDonald, A. (2016, February 26). Oral Tradition in the Age of Smart Phoneshttps://youtu.be/egO_46P894k

Ong, W. J. (2002). Chapter one: The Orality of language. In Orality and literacy: The technologizing of the word (pp. 1–11). Routledge.

Schmandt-Besserat, D. (2009). “Origins and Forms of Writing.” In Bazerman, C. (Ed.). Handbook of research on writing: History, society, school, individual, text. New York, NY: Routledge.

Scholes, R. (1992). Canonicity and Textuality. Ed. Joseph Gibaldi. An Introduction to Scholarship in Modern Language and Literatures (2nd ed., pp. 138–158). Modern Languages Association of America.

Timpe-Laughlin, V., Sydorenko, T., & Daurio, P. (2020). Using spoken dialogue technology for L2 speaking practice: What do teaches think? Computer Assisted Language Learninghttps://doi.org/10.1080/09588221.2020.1774904

Wikipedia contributors. (2021, October 12). List of Unicode characters. In Wikipedia, The Free Encyclopedia. Retrieved 22:26, October 13, 2021, from https://en.wikipedia.org/w/index.php?title=List_of_Unicode_characters&oldid=1049571352

Vygotsky, L. S. (1978). Mind in Society: The Development of Higher Psychological Processes. Harvard University Press.

Zada, S. A. (2020, August 27). What is Unicode? And why do I need it? A simple versionhttps://youtu.be/EGtcgMlyBhU

Leave a Reply

Your email address will not be published. Required fields are marked *