speech-to-text – Tech, Text & Thoughts Regarding

In our third week of ETEC540, we were tasked with relating an unscripted narrative into a chosen voice-to-text application, record the outcome, and analyze the degree to which English language conventions were deviated from. We were also instructed to observe what we believed to be ‘right’ and ‘wrong’ within the recorded text, and make an intentional link between the distinctions of oral and written storytelling.

I had fun with this experiment, and employed the voice-to-text program (https://speechnotes.co/) in a number of scenarios. I recorded myself narrating a portion of my lesson on The Alchemist to my class, I documented a phone conversation between myself and my partner to observe the degree of accuracy voice-to-text could produce by hearing speech through a separate technology, and I chronicled a conversation I had with a colleague at work.

There are some surface level connections between myself and many of my colleagues: Manize and I both used SpeechNotes, while comparatively, Olga utilized the Dictation tool on her Windows computer. We all recognized that literally mentioning the punctuation mark to the program would have drastically changed the meaning of the text, but conceded that this should not be a necessary step. Regardless, one of the most commonly agreed upon ‘mistakes’ in the voice-to-text scenario was the absence of grammatical and structural conventions. These typographical signs manifest themselves most frequently in basic punctuation like commas, periods, and capitalization and the lack of these proper morphological protocols give credence to the assertion that voice-to-text technology does not yet quite adequately have the ability to discern those written symbolic gestures from oral speech. Both Olga Kanapelka and Manize Nayani are colleagues that reflected on this idea, and went on to suggest that there were also many structural components of writing that were nonexistent within the text. For example, one of the more difficult aspects in comprehending the voice-to-text block of writing is that ideas are not organized or structured through the use of sentences or paragraphs. Through comparing our voice-to-text products, it’s clear that no matter what voice-to-text tool is used, the scarcity of grammatical and structural concordances remain. The lack of these literary principles, coupled with the inability to punctuate, make it increasingly difficult to effectively interpret the true narrative essence of the text.

There are, however, some deeper connections between myself, Olga, and Manize: our voice-to-text body of writing was created through the influence of an accent. Olga, Manize, and myself reflected on the adequacy of spelling and level comprehension within our bodies of text. We all seemed to touch on the degree to which accents played a role in the formation of meaning-making within speech-to-text outputs; both in the sense of the program understanding what has been spoken, and in the sense of ensuring the written product was intelligible.

Manize revealed that English is her second language as she moved to Vancouver from Mumbai, India some years ago. She seems to imply that many of the words picked up incorrectly were a result of her accent. She also posits that she believes having a story scripted would have permitted her to speak with more clarity and the number of spelling mistakes would have decreased. Similarly, Olga discloses that English is also her second language and specifies that English vowels are most difficult for her to pronounce. Similarly, when prompted to think about the difference of the written output if it were influenced by a script, Olga seemed to suggest the same idea as Manize: that the script would have aided in clarity and cohesion, ultimately resulting in a more readable text.

Olga provides a clear example of how her accent directly affects the voice-to-text transcription program:

Olga was clear and intentional about how her accent could be misconstrued by the program. This was interesting to me, and indicated that voice-to-text technologies do not listen for context, they simply listen for sound. In other words, it listens, but it does not hear. On a separate but related note, I find it ironic that many of our chosen A.I voices (think GPS’s) can be manipulated to reflect a plethora of accented voices from across the world, yet struggle in deciphering accented spoken words. I wonder if the Australian GPS voice could effectively transcribe a true Australian accent for example.

Although English is my primary language, and I do not speak with an accent (although some here in Vancouver think I speak with an Ontario or ‘Toronto’ accent), I recorded a conversation with a colleague of mine who speaks with a very thick English accent. The results were astounding in comparison to my original spoken narrative. Perhaps it was the fact that this was a conversation; that more than one person was talking, or that my colleague’s accent made it difficult for the voice-to-text program to discern was was truly being said, but the entirety of the text is blatantly incoherent. It was a stark contrast to my two colleagues who, despite scattered errors in spelling and coherence, theirs was predominantly intelligible.

Ultimately, it seems as if we all agree there is a certain level of flexibility when it comes to oral storytelling. Despite the mnemonic element required in reiterating a narrative, the story does not necessarily follow a strict sequential structure. Verbal strategies like emphasis, energy, intonation, volume, and pace can all contribute to the (in)effectiveness of orality while in written narratives, these elements are much more limited. I would even go as far as saying the accented influence of a narrative bestows it with more character and authenticity. Perhaps these elements appear, but in a fundamentally distinct way (punctuation?). Moreover, there is a certain level of grammatical forgiveness in orality – audiences are much more lenient when it comes to the variety of ‘mistakes’. There is no deleting an oral story, but there can be correction.

Bauman, R., & Sherzer, J. (Eds.). (1989). Explorations in the Ethnography of Speaking (2nd ed., Studies in the Social and Cultural Foundations of Language). Cambridge: Cambridge University Press. doi:10.1017/CBO9780511611810

Gnanadesikan, A. E. (2011).“The First IT Revolution.” In The writing revolution: Cuneiform to the internet. (Vol. 25). John Wiley & Sons (pp. 1-10).

The Boy and the Spoon

Truthfully, I went to great lengths to experiment with this task. I used speech-to-text technology to record various conversations I had with colleagues at work and analyze if and how that conversation evolved. Those results were quite funny. I recorded a phone conversation with my partner to demonstrate how accurately speech-to-text would pick up speech output from other technologies. It was surprisingly more accurate than I anticipated. Ultimately, I decided to use a recording of myself narrating a story told within The Alchemist to my English class (they were thrilled I was able to involve them in this task).

I’ve taught this book a number of times and so the story about the boy and the spoon is one that I am quite familiar with and can recite from memory. It’s a story about balance and how that balance contributes to happiness in our lives. I typically close with asking my students what the oil is representative of and we have a discussion about what this story means, and how it can be applied practically. The text of my narrative is as follows:

So I want to tell you a story that appears in The Alchemist the story about happiness it's a story that's good that gets related by Santiago throw to Journey so there is a young boy who lives in a village and he wakes up and he philosophizes about life and decides that he wants to find the answer to what is happiness and how do I achieve it so we asked his that he asked his father where he can find the answer to this question his father tells him that he can ask the wise man there's a wise man that lives not too far from their Village and he would need to track in Journey to see the wise man and ask him the secret to happiness so the boy undertakes this journey outside of his town walks a long way down ashley finds this Grand Palace if he thinks that you walk up to the Palace knock on ask the wise man a question and you'll find the secret to happiness but that's not what happened in fact he walks into the palace and he sees this man surrounded by Merchants Travelers journeyman they're all talking and having a diet after a long wait the boy lee has his opportunity to ask the man of the secret to happiness and the man responds to him son before I answer your question I want you to take this spoon to give the boys spoon fills it with oil and he says I want you to walk around my palace with this spoon and not drop not one drop of oil so the boy grease and he walks around the palace he spends 4 hours walking around this beautiful palace tapestries paintings and fountains Gardens and all the things he walks through and finally he comes back to the wise man with his spoon full of oil and the wise man says well what did you see and the boy cannot respond he didn't see anything he was focussed his entire time on not spilling the oil so the man says well I can't answer your question just yet in fact I need you to continue to walk around my Palace and I want you to come back in 4 hours time and tell me all the beautiful things that you've seen in my Palace so the boy continues with the spoon walks around the palace sees the beautiful tapestries the forest the gardens The Fountains the paintings and sculptures finally comes back relates to the old wise man what he had seen and the old old wise man asked him what happened to the oil because in the process of him taking in all the beauty and experience around him he spilled all the oil and he says son that is the secret to happiness the secret to happiness is balance balancing the oil and with all of the things and experiences that surround you in life what is the oil

Analysis

When we tend to the deviations within this body of text, the most glaring issue is the absence of basic English grammatical conventions. Rarely, if at all, do we see the use of periods, commas, quotation marks, question or exclamation marks. Paragraphs are not used at all to space out and organize ideas. Capitalization is used haphazardly and there are a number of ‘misheard’ or missing words that fundamentally change the context of the story or create confusion in its plausibility. Perhaps this is the English teacher in me but the entire story is one long run-on sentence. The aforementioned absences of grammatical conventions is what I would consider ‘wrong’ with this text when we look at it from a purely textual perspective. These deviations from the customary processes of written English diminish the significance, impact, and overall meaning of the narrative when it is read. Comparatively, when we speak to one another, we do not mention or indicate the use of a comma, period, or any other grammatical symbol through word, rather it’s expressive, implied, and embedded within our spoken language conventions. This is not something that is available with speech-to-text technology, at least not accurately. To indicate a period or comma, one must say it as they speak in order for the technology to pick up on it. Imagine speaking like that to another human being…

Perhaps that is my mistake in not effectively speaking the language needed to successfully operate speech-to-text technology.

When I related this story to my class, I was animated in it’s retelling, utilized various modes of intonation, gesture and volume, and injected fierce emotion to keep my audience engaged. It does not seem that this recording captured these elements at all. I’m not entirely sure I consider this a mistake, but I certainly feel like this recording did not do my performance justice. This experiment revealed to me the relationship between grammatical conventions and spoken emotion, body language, and intonation. Speech-to-text technology, and writing itself is unable to effectively capture the elements used in spoken storytelling. Despite the symbols used to appeal to these sentiments, it doesn’t seem like our writing can ever truly re-animate the aspects of spoken word; it may be able to inform the way we read the written word. Walter Ong mentions this idea in his book Orality and Literacy: The Technologizing of the Word:

“It would seem inescapably obvious that language is an oral phenomenon. Human beings communicate in countless ways, making use of all their senses, touch, taste, smell, and especially sight, as well as hearing … Some non-oral communication is exceedingly rich —gesture, for example. Yet in a deep sense language, articulated sound, is paramount (Ong, 2002)”.

Body language accounts for an overwhelming percentage of effective human communication- these are factors that embody hand and facial gestures, voice volume and intonation, unconscious reading of facial muscles, and eye contact among various other things. Consequently, these are also aspects of communication that cannot be effectively paralleled in text-based communication. Even despite the rise of emoji usage, icons meant to convey the unspoken and emotional fundamentals of communication can’t convey these aspects to the same degree. As Ong suggests, there is something unique about human articulated sound that results in deeper meaning.

An interesting aspect of the story I’ve chosen was that it is a narrative that is written down. It is a story that can be found within The Alchemist. I decided to relate this story from memory and it’s revealing to see the differences. Had I chosen to tell the story using the written version, I feel as if my re-telling would be more measured and rhythmic; I would essentially be following a script of symbols annotating for me the ways in which the story should be related. With the recollection from memory, I felt that I was afforded a lot more freedom to narrate the story as I saw fit. I was able to repeat certain aspects, emphasize important plot points with visual or verbal gestures; ultimately the story became my own to tell.

Similarly, it was interesting to see that when the narrative body of text was compared to the recording of myself and my colleagues having a conversation about the different assessment strategies we use as humanities and science teachers, the narrative recording was exceptionally more accurate. I think it warrants a mention that my colleague has an incredibly strong English accent and I’m convinced this played a major role in the inconsistencies of the following text-recording. This was another aspect of speech-to-text technology that can could be perceived as ‘flawed’: it’s inability to pick up on accented language, or dialectic speech.

As an individual who is capable of speaking dialectic Italian, it comes as no surprise that accents and dialectics can lead to miscommunication. Many times, when I speak to other Italians, my utterances are often met with looks of confusion or laughter; it’s not seen as speaking the true or purest form of the language. Thus, I wonder if the same can be said for the following recording; my colleague is heavily accented, and this gave way for severe confusion when reading our recorded conversation.

Okay so basically give you copper sources i'm going to go out to eat on some questions on the sources cited fast one strand Moltres chase I was wondering the true then anisocytosis be by Sousa Center might as well stop by if you detox metals to write summary about you understand about similar Behavior drivers I meant turn you up spelling punctuation that's all that they remarked on I was an English teacher. So it's has nothing to do with the content where am I 3 secret right now we didn't surprise you points so i'm not response you should read through the achievements on side honestly each section using the comments the qualities that make demonstrate as instructed during standardizing you can. Why are these statements heavy guy assessment objectives identify and cut right explicit and implicit information ideas Lexington 5 evidence from the text explain, what is use language instructor to achieve effects in infants read this using relevance of a terminology cuz of views this is more about the structure can the rating men it is both content pupusas all this is what you have in a response receptive. Play relevant summary some attempts of summary limited summary hot mop English papers and the arts for that reason like is so much judgment made on your point or there is but I mean you do follow some type of criteria right it goes back to what we were talking about there with with the strands in the the compass talking about right so if if a student is able to relate to me that he understands Canta knowledge if you can apply it and like make inferences like that saying there then that would be the inquiry if you can apply it to something outside extra-textual or even something else in the text there's application and communication just overall is he able to to to express himself in writing when you have those pillars I feel like it's a lot easier to understand what you're looking for mean enjoyed English stop the stop of The Green Mile what was The Green Mile yet was when Tom Hanks it all the old Tom Hanks is in the whose wagon wheels are killer man they're so killer

Tag: speech-to-text

LINK 3 – DEVIATIONS IN CONVENTIONS: VOICE-TO-TEXT AND THE ACCENT

Task 3 – The Boy and The Spoon: A Speech-To-Text Analysis