Task 3: Voice to Text

Context:

  • I decided to record 5 minutes of myself rehearsing with my Grade 9 Concert Band class. We were doing warmups and talking about concepts such as breathing, airflow and articulation.
  • I used the dictation feature on the 2022 version of Microsoft Word with the built-in Microsoft on my Dell work laptop.
  • I teach with a microphone through an external sound system. That will also affect the outcome with the dictation feature.
  • The ensemble may be playing on their instruments while I can talking through the microphone
  • I have inserted my Speech-To-Text (STT) transcript for your reference here: ETEC 540 Task 3 Transcript

How does the text deviate from conventions of written English?

The dictation is extremely inaccurate and unfiltered in comparison to conventions of written English. From grammar to comprehension, there are areas of improvement within the software capabilities to be successfully used. If a third-party read over the dialogue, I believe that the person would not be able to comprehend what is occurring in this conversation without prior context. The dictation is also a testament to spoken English compared to written English (proper English). There are cultural slangs, incomplete sentences that my Grade 3 classroom teacher would be disappointed in. The way people think of language when speaking (at least for myself), is different from the thought process in written English. It opened my eyes on some of the tendencies I have when speaking “without thinking”. In written English there can be a third party narrator guiding and providing context to the audience member. I find that a critical role in the understanding of text in the English language. 

Amalia E. Gnanadesikan  mentions that phonemes and syllables can be recognized by software however it does not have the capability to record speech. Speech can be defined as the emotion and subtle nuances such as tone, intonation, tempo and pitch. 

“Technicalities  aside,  an  important  point  here  about  these  abstract phonemes  and  syllables  is  that  although  writing  represents  information  about  how  words  are  pronounced,  it  does  not  record  the  identifying  details  of  any  individual  utterance  of  those  words.  It  records  language, but not actual speech. Even in cases of dictation or courtroom stenographer,  much  information  about  the  actual  speech  is  lost,  such as  intonation  and  emotional  content.” (Gnanadesikan, 2008)

It is interesting to note as a music educator, there are many software available that have the capability to analyze and interpret data on intonation and pitch in real time. For example, a software called SmartMusic, allows a student to perform etudes while the software displays incorrect pitch, tuning, rhythm and volume. With this knowledge, I believe there is a possibility for speech-to-text dictation to improve. It may just require more focus on the software. 

I can also understand that since it is not a common software used among the general population, there may not be a need to improve the software. 

What is “wrong” in the text? What is “right”?

Wrongs:

  • The next is quite broken in relation to “proper” English. When one writes a sentence there is often information, there is a thought however in my text, oftentimes there were fragmentations and a lack of cohesiveness throughout the entire script.
  • There were many grammatical errors from tenses to sentence structure. I wonder if it is from my own speech or from the interpretation of the dictation.
  • Surprisingly there were words that were spelled incorrectly. It may be due to the vocabulary utilized during our music ensembles rehearsal time. Vocabulary that is uncommon in regular dialogue language were often the spelling errors 

What was right:

  • The one thing that felt “right” was that the software kept pace with my speech throughout the entire 5 minutes. There were moments where the ensemble would play while I was talking and surprisingly the software kept up. At no point does “noise pollution” confuse the software from typing out my speech. The efficiency of the technology is faster compared to manual transcription. 

Overall thoughts:

Initially as I was doing my readings for the week, I believed that STT technologies would be quite useful in terms of accessibility. The idea of doing the 5 minute recording came from the concept of using STT for individuals with hearing impairments and/or language barriers. After the recording is only 5 minutes, there is a feeling that STT technology is simply unreliable. There are accuracy concerns, technical barriers, learning curves to use the software efficiently and hardware limitations as well. STT is a technology that I would categorize as “niche”. It is like the car brand Lucid putting a drone in the hood of their electric car. In the advertisement, it showed all the unique capabilities of having a drone while driving, however in practice, it is not reliable to use. The thought of STT technology in its usage is quite unlimited. From accessibility support to connecting with cultures around the world without language barriers however the technology is not at a point where the general population can use it reliably. Perhaps there are useful situations that I have not been exposed to yet but as a consumer, after 5 minutes of its utilization, I was turned off from it. On a bright note to end, I am excited to see STT technology advance as the application in a variety of contexts. I could see it being used in many situations from a classroom, to a courthouse to even a hospital. The future of STT is able to break language barriers and that thought is truly exciting.

References:

Gnanadesikan, A. E. (2008). The writing revolution: Cuneiform to the internet (Vol. 8). John Wiley & Sons.

Ong, W.J. (2002). Chapter 1: The orality of language. In Orality and literacy: The technologizing of the word (pp. 5-16). Routledge.