Task 3: Voice to text task

Here is my unscripted and unedited five-minute story. It was input as voice, and ‘translated’ from voice to text using the Dictation function of the Messages app on an iPhone.

OK we are currently on the way to my favourite restaurant and I am going to narrate what I see along the way and don’t worry I’m not driving so safe we’re currently on SW. Marine Dr. crossing main street and we’re heading towards Coquitlam and I was also thinking if there’s not much going on along the way I will also tell the story of how I came across this restaurant my favourite restaurant so about a year ago I was watching Netflix and came across a Korean TV show it’s called let’s eat to where they eat Korean food and it looked really good so I googled where we can find good Korean good authentic Korean food in Vancouver and I came across three restaurants in Coquitlam they are on North Road and we’re heading to one of the three right now I guess going back to describing what I see are we are currently crossing Fraser Street and there is a gas station on there right now with the gas is at 2:25.9 right now which I guess is in the worst but also not very good what else can I talk about I guess this is kind of like a Vlog except in audio so I guess this is an a a log oh my gosh OK OK oh we are I don’t know what this is we are very close to ninth Street there’s also not too much going on there’s a bear traffic but not too bad I see I just saw a kayak on top of a car so that was interesting I’m not I don’t think it’s a very good story but this is what else what else can I talk about the restaurant I can talk about my favourite dishes at the Korean restaurant so there is a soup it’s called a hangover soup I really like that there is also a Korean rice wine I also like that a lot there is what else do I have also like a hot like a hot plate dish I order and then we’ve also tried different dishes as well and another thing I really like about going to this restaurant is it’s actually right next to one of my favourite grocery stores H Mart so afterwards if it’s not too late we can also do some grocery shopping and I will be all set in terms of food for the week so I am seeing here that I’m at four minutes and 30 seconds so thank you for listening to my first audio log and I am I am going to stop the recording now

My thoughts and reflections below are based on my understanding of this week’s readings.

How does the text deviate from conventions of written English?

I’m someone who tends to plan out the points I want to make next even as I am speaking, so considering that this task requires an unscripted story, I decided to situate myself in an environment where I would have to tell the story as it happens. The text above was captured when I was in a moving car, describing what I saw along the way.

As a result, I believe this text — unscripted, oral language translated into a written format using voice-to-text software — deviates from the conventions of written English in that it is more of a stream of consciousness, whereas written English is typically more intentional and structured.

Firstly, irrespective of whether I’m communicating using oral or written language, the language I use is “the product of [our] culture” (Haas, 2009, p. 15). My thinking itself is influenced by my understanding of the culture through language. In addition to the technology of writing, this includes the consideration of other technologies used within our culture, and how these technologies and writing mutually influence one another. For example, in my text I referenced “vlog”, which refers to “[a] blog composed of posts in video form” (Oxford University Press, 2021), a product that requires the understanding and use of technologies such as video production equipment and social media. I believe this reiterates Scribner and Cole’s (1981) “claim that the practice of literacy is itself deeply contextualized” (Scribner & Cole, 1981, as cited in Haas, 2009, p. 19).

In reflecting on oral versus written language, a few themes jump out at me:

1) Translation from consciousness to text

Since my speech above, or “voicing” as Ong (2002) might call it (Ong, 2002, p. 13), was intended to be unscripted, I tried my best not to think about what I would even say in my next sentence, and forced myself to blurt out my thoughts as they came to my mind. In other words, in considering this in terms of Ong’s (2002) assertion that “[s]peech is inseparable from our consciousness” and Gnanadesikan’s (2009) that “[w]riting is therefore a process of translating time into space” (Ong, 2002, p. 9; Gnanadesikan, 2009, p. 3), I think of this text as my consciousness and thoughts over time, which are voiced, and this voicing is in turn captured as text using voice-to-text software, where text is a form of space.

For written text, consciousness is translated by the mind into writing, where decisions are made in terms of what is written and how it is written. Consciousness, conversely, is also influenced by the understanding of writing, as “writing is a technology that restructures consciousness of individuals who use it” (Ong, 1982, as cited in Schmandt-Besserat & Erard, 2009, p. 20).

2) Past versus present

Writing creates “a disjunction between past and present” (Goody & Watt, 1968, as cited in Haas, 2009, p. 11), and with this transcribed text, my thoughts over the five-minute period in the past can now be revisited anytime in the present in the same way as written language. Similarly, since “information only exist[s] if someone could remember it” (Gnanadesikan, 2009, p. 2), this text could now serve as a documentation of my thoughts at that point in time; for example, I don’t believe I would’ve noticed or remembered the kayak if not for this voicing and resulting text.

3) Perceptions of scholarly work

I can imagine that if my unscripted, oral text were to be compared to this written post, my oral text might be considered “as beneath serious scholarly attention” (Ong, 2002, p. 8), whether that is appropriate or otherwise, as Ong (2002) warned. On the other hand, this post may be considered to be more scholarly.

4) Content

There are details in my text that I would not have normally included in my writing. For example, further to my reference above to the word “vlog”, I recall that as I voiced the word, I experienced Ong’s (2002) hypothesis that “[a] literate person … will normally … have some image … of the spelled-out word” (Ong, 2002, pp. 11-12). I pictured how it compounds the words “video” and “log”, and attempted to replace the word “video” with “audio” as I thought it would be more appropriate. This was captured in my voicing, but would most likely not have made it into my written language.

What is “wrong” in the text? What is “right”?

The “mistakes” in the text are highlighted and corrected below in blue:

OK, we are currently on the way to my favourite restaurant and I am going to narrate what I see along the way. And don’t worry, I’m not driving, so being safe. We’re currently on SW. Marine Dr. crossing Main Street and we’re heading towards Coquitlam. And I was also thinking if there’s not much going on along the way, I will also tell the story of how I came across this restaurant, my favourite restaurant. So about a year ago, I was watching Netflix and came across a Korean TV show — it’s called Let’s Eat to 2 — where they eat Korean food and it looked really good, so I googled where we can find good Korean good, authentic Korean food in Vancouver, and I came across three restaurants in Coquitlam. They are on North Road and we’re heading to one of the three right now. I guess going back to describing what I see, are we are currently crossing Fraser Street and there is a gas station on there the right now with and the gas is at 2:25.9 225.9 right now, which I guess is in isn’t the worst but also not very good. What else can I talk about? I guess this is kind of like a Vlog except in audio, so I guess this is an a “a-log”. Oh my gosh. OK. OK. Oh we are — I don’t know what this is — we are very close to ninth Knight Street. There’s also not too much going on. There’s a bear bit of traffic but not too bad. I see — I just saw a kayak on top of a car, so that was interesting. I’m not — I don’t think it’s this is a very good story but this is — what else what else can I talk about? The restaurant. I can talk about my favourite dishes at the Korean restaurant. So there is a soup — it’s called a hangover soup — I really like that. There is also a Korean rice wine — I also like that a lot. There is — what else do I have — also like a hot like a hot plate dish I order, and then we’ve also tried different dishes as well. And another thing I really like about going to this restaurant is it’s actually right next to one of my favourite grocery stores, H Mart, so afterwards if it’s not too late we can also do some grocery shopping, and I will be all set in terms of food for the week. So I am seeing here that I’m at four minutes and 30 seconds, so thank you for listening to my first audio log and I am I am going to stop the recording now.

I believe what is “right” with the text is that it captured most of the words I used. However, as for what is “wrong”, I’m reminded of Gnanadesikan’s (2009) assertion that with writing, “much information about the actual speech is lost, such as intonation and emotional content” (Gnanadesikan, 2009, p. 9). The text looks more polished than I anticipated. Some elements that would’ve contributed to the text being less polished are missing, such as:

- Pauses: I took pauses throughout as I was waiting for my next thoughts.
- The slight panic in my voice: I think there was a slight panic in my voice from trying to come up with thoughts and words, which is completely lost in the text.
- Laughter: I immediately regretted trying to come up with a compound word for “audio log” and started laughing to the point I was tearing up. I thought some form of that would be captured, but aside from the “Oh my gosh. OK. OK.”, there is very little trace of that in the text. (And if I were to be honest, I considered redoing my voicing because of that part, but I ended up keeping it simply because I thought it would be a good point to write about for this post!)

I also wonder how similar or different — or how much more “wrong” or “right” — the resulting text would’ve been if I used different voice-to-text software. In terms of hardware, I was limited to the use of my phone since I decided to narrate my journey, and I wonder if my desktop computer would’ve produced different results.

What are the most common “mistakes” in the text and why do you consider them “mistakes”?

Aside from the lack of punctuation, I think the software did a decent job and any mistakes were minor, such as capturing “it is in the worst” instead of “it isn’t the worst”, “there’s a bear traffic” instead of “there’s a bit of traffic”, and “ninth Street” instead of “Knight Street”. I consider these to be mistakes in the sense they differ from what I believe I said and/or do not make sense in the context of the sentences.

Considering the cases of “is in”/”isn’t” and “bear”/”bit of”, it makes me wonder to what extent the software considers phonetics versus syntax. Upon reflection of how I pronounce “isn’t” and “bit of”, I noticed I tend to drop the “t” in “isn’t” and pronounce “bit of” closer to “bi’ov”, and I can see how they could sound like “is in” and “bear”, respectively.

I also found it interesting that in reference to gas prices, the text captured was formatted as “2:25.9” instead of “225.9”. I wonder if it might have something to do with Canadian gas prices being listed at cents per litre — I’m thinking that since Apple is based in the United States (Apple, n.d.), where gas prices are listed at dollars per gallon, perhaps the software learned from datasets that are primarily American and “225.9” made little sense in the context of gas prices. This makes me think of the Vygotskian theory that language is “the product of [our] culture” (Haas, 2009, p. 15).

What if you had “scripted” the story? What difference might that have made?

If I had scripted the story, I think I would’ve planned for a much more logical flow in my storytelling, as opposed to the jumpiness in my text above. I also would like to think that I would’ve used less “so”, “also”, “I guess”, and “what else” — in reading over my text, I noticed there are lots of instances of these in my voicing!

In what ways does oral storytelling differ from written storytelling?

I think oral storytelling differs from written storytelling primarily in that oral storytelling could include context that comes from “tone, cadence, and tempo” as well as “intonation and emotional content” (Peña, 2022; Gnanadesikan, 2009, p. 9), which could give the audience clues on the intentions of the words presented. Similarly, as Haas (2009) pointed out, “Ong, Havelock, and Goody … each [identified the contrast between] the decontextualization of the spatial form (writing) with the contextual richness of the temporal form (speech)” (Haas, 2009, p. 12).

On the other hand, I think it is possible that written storytelling allows more room for the audience’s imagination, which could be interesting if more creative or personal interpretations of the text would strengthen the meaning of the text for the audience.

As a side note, the concepts of ‘unscripted versus scripted stories’ and ‘oral versus written storytelling’ remind me of a class discussion I participated in during my undergrad in Visual Art, on painting versus photography. I recall we discussed how painting is about making decisions on what goes within the frame, in the context of the artist’s interpretation of the world; whereas for photography, it’s about making decisions on what goes within and stays outside the frame, in the context of the real world. I think there are parallels between unscripted-scripted, oral-written storytelling, and painting-photography, in that the first of each set focuses on individual consciousness, and the second of each set focuses on (re)interpreting this consciousness in the world.

References

Apple. (n.d.). Contacting Apple. https://www.apple.com/contact/

Gnanadesikan, A. E. (2009). The first IT revolution. In The writing revolution: Cuneiform to the Internet (pp. 1-12). John Wiley & Sons. https://doi.org/10.1002/9781444304671

Haas, C. (2009). The technology question. In Writing technology: Studies on the materiality of literacy (pp. 3-23). Routledge. https://doi.org/10.4324/9780203811238

Ong, W. J. (2002). The orality of language. In Orality and literacy: The technologizing of the word (pp. 5-15). Routledge. https://doi.org/10.4324/9780203426258

Oxford University Press. (2021). Vlog, n. Oxford English Dictionary. Retrieved Jun 5, 2022, from https://www.oed.com/view/Entry/37710857?rskey=BUR6AY&result=1&isAdvanced=false#eid

Peña, E. (2022). [3.2] Before writing: Mapping the psychodynamics of orality. In ETEC 540: Text Technologies: The Changing Spaces of Reading and Writing. The University of British Columbia.

Schmandt-Besserat, D., & Erard, M. (2009). Origins and forms of writing. In C. Bazerman (Ed.), Handbook of research on writing: History, society, school, individual, text (pp. 7-24). Taylor & Francis Group.

2 Replies to “Task 3: Voice to text task”

jessica presta says:

June 8, 2022 at 8:52 pm

Hello Jocelyn!

So lovely to be connected again through this course. I also really enjoy the layout you are using with your blog posts. How are you managing to get these icons into your text? I also enjoyed your approach to the ‘what’s in my bag’ task with the Genially interactive image. Thank you for sharing your creativity!

Your voice to text story was quite entertaining! I could not help but giggle at the part where you said ‘Oh my gosh, OK, OK’. I am grateful you left that in! It definitely added that component of emotion to your storyline.

I wonder how effective these voice-to-text tools are for individuals that have accents or who are ELL?

1. Jocelyn Chan says:
  
  June 10, 2022 at 10:07 am
  
  Hi Jessica! Thanks so much for stopping by. I’m very happy we’re in the same course — I’m looking forward to your posts!
  
  For the icons, I have a paid account for Noun Project (https://thenounproject.com/) so I just have to search for the icons and change them to the colour that matches my site!
  
  Thank you for your kind words on my ‘what’s in my bag’ task. I had tons of fun building the Genially. :)
  
  I’m glad that part made you giggle! I’m thinking I probably wouldn’t have a very successful career in audio logging.
  
  I really appreciate your consideration of the effectiveness of voice-to-text tools for those who don’t use what may be considered the ‘typical’ varieties of North American English. I did a quick search and it looks like it is still an issue (Koenecke et al., 2020; Mengesha et al., 2021). This also makes me think back to one of the readings from ETEC 531 on the topic that AI may be ‘racist’, because “the raw data that [they] are using to learn and make decisions about the world reflect deeply ingrained cultural prejudices and structural hierarchies” (Benjamin, 2019, p. 59).
  
  References
  
  Benjamin, R. (2019). Race after technology: Abolitionist tools for the New Jim Code. Polity Press.
  
  Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J. R., Jurafsky, D., & Goel, S. (2020). Racial disparities in automated speech recognition. PNAS, 117(14), 7684-7689. https://doi.org/10.1073/pnas.1915768117
  
  Mengesha, Z., Heldreth, C., Lahav, M., Sublewski, J., & Tuennerman, E. (2021). “I don’t think these devices are very culturally sensitive.”—Impact of automated speech recognition errors on African Americans. Frontiers in Artificial Intelligence, 4. https://doi.org/10.3389/frai.2021.725911