Story, as produced by the speech recognition model
Finding something to talk about for five minutes unstructured is difficult, but thankfully something happened this week that made it a little bit easier. It’s a two-parter. The first part is that we had two exchange students coming from New Zealand. They’re staying with a couple of our students’ parents’ homes. They’re going to be working and coming to the courses, they’re going to school, they’re going to experience the life of our school, they’re going to experience Canadian life as well. They’ve never been to Canada so this is a new experience for them. They’re going to take our courses, they’re going to do homework, they’re going to take tests, they’re going to experience the Canadian school system, an education system which is close to their own but also has quite a few differences. And they’re going to also at the same time meet new people and sightsee around the city. And it’s been very interesting to watch these two kids come in and just kind of already fit well with the area of the school, fit well with the environment. And it seems like they’re already students in our school, the culture is different, but you wouldn’t notice difference if you were walking by. And of course, you know, they’re taking different courses from New Zealand, and different cultures in New Zealand, and different experiences in New Zealand, and having them experience the Canadian world around them, you know,
We had to make them accounts for our intranet, we had to make them accounts for Schoology to get them ready with everything, give them an email address. They are officially students at our school and it’s amazing to see them fit so quickly and well with our school and just enjoy, see them enjoy their times here at our school. The second part of the story is We actually had another group of New Zealand students, but a soccer team, coming and visiting our school on a North American tour. They’re visiting North America, they’re going around different schools in the United States and Canada, and they’re just sightseeing and they’re also participating in some soccer events.
So, our school was playing them in a friendly team battle on this past Friday and they were a formidable foe to fight against.
Usually our school is pretty good with soccer, so they usually win their games, but they tied with this team because these guys were really good. These guys were top-notch in terms of the style of their play in soccer. And they also stopped by an assembly organized by our elementary school kids. And the elementary school kids had a lot of questions to ask this New Zealand soccer team. They asked them what their favorite food is, they asked them what their favorite sport is, What’s the difference, what’s their favorite type of experiences in New Zealand, and what if they notice any similarities of those experiences in New Zealand and in Canada. And our elementary students introduced them to Terry Fox and what he stood for and his experience was and what he did for cancer research and they also participate in the Terry Fox run that we did at the school.
Thinking about Whisper and the Story
Having recorded myself speaking, I ran the recording through a recently released AI model, Whisper and its medium model, to see what it could do with that text. As usual with this model, the results in terms of word recognition were quite good. That is, the words I intended to speak were, almost always, reflected on the page. I should say that my own frame of reference for this comes from when I was, some years ago, working with assistive technology. Though I mainly concentrated on work with visually impaired users, some had difficulty typing and wished to use dictation in addition to a screen reader. Helping them with dictation was often very difficult because they had to train the software to understand them and that training required them to read back text which was not accessible to the screen reading software. Even when the training was done, the interaction between the two technologies often caused problems with correction and, of course, the more the technology misunderstood words, the more issues arose in correcting it. The need for this reduced over time and with training of the program, but was never completely eliminated. The model I am using here can, like these older programs, be run locally. However, it was not trained on my voice at all. The main thing that was right with the text was the recognition of actual words, and this is the most surprising thing to me about speech recognition through these fairly new models, I am used to dictation programs which get more things wrong.
The text’s punctuation, presentation, and formatting is not nearly so good considering the conventions of writing. The model attempted to split the text into chunks and, while this worked fairly well for sentences, it worked less well for presentation and formatting. Further, the punctuation, even within sentences, was by no means perfect. Just for example, I said that “They’re staying with a couple of our students’ parents.” The model produced “They’re staying with a couple of our students’ parents’ homes”. I am not sure where it got homes, and the apostrophe after parents probably came from its mistranscription. This would probably have been worse with several speakers as, having tried this model with many speakers in one file, it attempts to distinguish one from another but usually does not do so accurately according to the audio recording. Again, it would sometimes understand a as the, and part of the story simply was not transcribed at all, though the microphone was working. I think this is one type of action I would classify as a mistake, where the text does not match the recorded audio in some way.
The other issues, and possible mistakes, in the text arise from my own speaking into a microphone. Had I scripted the text, it would certainly have been far less repetitious. It is also worth noting that, had I been telling the story to another person, I would have expected, and probably wanted, to be interrupted and asked questions. I would also have been observing the interest and attention of my listeners to know what they might find more interesting so that I could talk about details they wished to hear about. Dictation stands in a middle ground between discussing with a person or group and writing with an instrument. The atmosphere which produces the give and take of conversation, even if there is only one speaker most of the time, is missing. Telling a story differs from writing it in several ways, of course, it is less formal, unscripted, and unformed until it is spoken; but telling a story in certain circumstances differs from telling a similar story in different circumstances depending on the listener, whether that listener is a person, group, or artificial listener like a computer.