Press "Enter" to skip to content

ETEC 540 Task 3: Voice to Text

Task 3: Voice to Text

For this activity, I recorded a conversation with myself and my three-year-old son, Samuel, following a busy morning of activities. I used Microsoft Word’s Dictate program and had a high-sensitivity microphone connected to pick up the conversation. For genuine and authentic spontaneity, I told my three-year-old that the computer was listening to us talk and wanted to hear what he had to say and would be writing it down in words, which produced significant interest from the toddler. Here is the unaltered transcription of our conversation.

How does the text deviate from the conventions of written English?

Well, the initial and most obvious deviation is the lack of punctuation and formalized grammar. The speech-to-text failed to pick up on natural pauses that could indicate the end of a sentence or use paragraphs to separate ideas. The second obvious standout is the lack of identification of a secondary speaker; rather than identifying a change in tone, volume and cadence, it simply sought to identify whatever words were there, regardless of the altered transmission. The lack of descriptive language presents itself as a missing factor; the rolling dialogue would be filled with descriptive language that would attempt to fill in the missing non-verbal and tonal cues.

What is “wrong” in the text? What is “right”?

While efforts were made to ensure that Samuel was speaking loudly, it seemed that the dictation software had a lot of difficulty picking through the toddler’s lisp and young-sounding voice for identifiable words. What is mostly “wrong” is only that a significant portion of Samuel’s responses were not transcribed in whole, merely several words, often repeated, in gaps where he spoke several sentences. The sharp contrast between my deeper, slow voice and the cadence of his toddler speech certainly causes a discord in the flow of text when read above; all awkward and repetitive portions are where he was speaking (loudly) at the computer microphone. As such, rhythm, tempo, loudness, and pitch are lost in translation.

Outside of the secondary speaker, the lack of tonal identifiers really separates this text from the feeling and atmosphere of the original conversation. It was light-hearted, slow, with tonal variations from excitement to joy and laughter at times, but these things are entirely lost in the translation directly to text. The laughter (right into the microphone) was not captured; such a meaningful interpersonal signal being lost through the medium significantly shifts how the content may be received.

What works well is the reflection of slang and daily use vocabulary. The narrative is presented in unfiltered, everyday language, flowing naturally to convey ideas directly and concisely. The capitalization of “OK” reflects most of the natural switching of topics and acknowledges the signals received. It also reflects the common language used in the daily context of the two speakers, between father and son, an established pattern of “Ok, so what then” as a typical response formula engaged dozens (if not more) times daily.

What difference would scripting make?

Scripting would rob the text of the natural flow, everyday language and sentence structure. It would not be easy; it would likely sound significantly more formal and less spontaneous, especially when considering the communication differences between a toddler and an adult. It would be difficult to author authentic “unfiltered” sounding toddler responses as an adult, even for my own child.

I suggest that scripting would allow for a clearer transmission of the message and include more graphic descriptors to facilitate a greater understanding of the many aspects of the exchange. However, the authenticity and realism of the language between the two actors would be lost in place of the desired message.

In what ways does oral storytelling differ from written storytelling?

Aside from what has been mentioned above already, a significant number of variables change between the two methods of storytelling. All visual and minor linguistic cues are either lost or become a bit flatter and perhaps “forced”.  Responsive and real-time facial expressions are missing, alongside tonal fluctuations, hesitations, sighs, posture (and its shifting), pacing, natural pauses, errors, and backtracking. There are several that stand out, but they are far from an exhaustive list.

In completing this exercise, I’ve come to appreciate the extent to which written and prepared communication is so thoroughly pruned when compared to oral communication. The dynamics of oral communication are so numerous and unique that it’s impossible to capture them all in text without an overabundance of descriptors that would be so bulky it would be hard to read fluidly. When mediated by technology, at least with its current common state as experience in this task, an ugly in-between is reached, not demonstrating the strengths of either method.

Spam prevention powered by Akismet