The pdf version: Text and Speech Recognition Technologies.

OCR (Optical Character Recognition)

History (Chaudhuri et al., 2016)

The inception of character recognition technology occurred in the mid-1940s with the advent of digital computers. Initial efforts in automating character recognition focused on machine-printed text or a limited set of clearly defined handwritten text and symbols.

By the mid-1950s, Optical Character Recognition (OCR) machines became commercially available.

The OCR systems introduced between 1960 and 1965 were commonly called first-generation OCR, characterized by constrained letter shapes specifically designed for machine reading.

In the mid-1960s and early 1970s, second-generation reading machines emerged. These systems recognized regular machine-printed characters and possessed capabilities for recognizing hand-printed characters.

The mid-1970s saw the introduction of third-generation OCR systems, marked by significant advances in hardware technology that achieved low-cost and high-performance objectives. This led to the developing of sophisticated OCR machines catering to a broader user base.

Despite the commercial availability of OCR machines in the 1950s, only a few thousand systems were sold until 1986, primarily due to the high cost of these systems.

Substantial progress in OCR systems occurred during the 1990s, leveraging new development tools and methodologies empowered by the continuous growth of information technologies. In the early 1990s, combining image processing, pattern recognition techniques, and artificial intelligence methodologies significantly enhanced OCR capabilities.

Today, advancements continue with more powerful computers and precise electronic equipment, including scanners, cameras, and tablets.

Applications

Assistance for the visually impaired: OCR, coupled with speech synthesis systems, empowers individuals with visual impairments to comprehend printed documents (Chaudhuri et al., 2016).

Automated license plate recognition: Several systems for automatically scanning car number plates are available. Unlike other OCR applications, the input image is not a natural bilevel image and needs to be captured by an extremely fast camera (Chaudhuri et al., 2016).

Automated cartography: Using computer technology and algorithms to create, analyze, and interpret maps. Recognizing characters from maps poses unique challenges due to the intertwined symbols and text that appear at varying angles and fonts.

Language translation: This application facilitates the translation of printed or handwritten text into different languages.

Banking applications: The primary application of OCR is found in the banking industry. The system verifies the customers’ identities by comparing their signatures to patterns stored in a reference database (Chaudhuri et al., 2016). Moreover, OCR proves highly beneficial in ATMs, where designated customers can use their mobile phones to scan and deposit checks (Sarika et al., 2021).

Document digitization: Document digitization transforms images into digital format, aiming to enhance materials through OCR. According to Sarika et al. (2021), document digitization is primarily employed to modernize libraries and provide online services utilizing OCR.

Challenges (Awel & Abidi, 2019)

Many OCR techniques are facing accuracy problems due to the following reasons:

Complex scenes: Separating non-textual content (buildings, painting, and more) in the input data complicates preprocessing, thereby impacting character recognition.

Varying lighting conditions: Images captured by cameras are susceptible to the influence of varying light conditions and shadows, complicating the task of detecting and segmenting characters.

Skewness and rotation: Photography with a camera is often affected by incorrect image angles, leading to inaccurate results.

Blurring and degradation: Blurring and degradation occur when pictures are taken from a distance, attempting to capture moving subjects and lacking proper focus.

Diverse fonts and styles: Characters connected or overlapped, such as Arabic or Italic, create challenges to accurately detect and separate words into individual characters.

Multilingual settings: Text with characters spanning multiple environments, such as those found in languages with a large number of characters like Chinese, present unique challenges.

Damaged documents: When dealing with input documents that are extremely aged and damaged, the presence of extensive noise often results in the unintentional loss of essential content or characters.

Educational implications

Text digitalization: OCR transforms printed or handwritten text into machine-readable digital content. This process facilitates the establishment of digital libraries, providing students and educators with effortless access to extensive information.

Accessibility and inclusivity: OCR plays a pivotal role in the utilization of text-to-speech technologies and various assistive tools, enhancing the accessibility of educational materials for individuals with visual or learning disabilities.

Document administration: OCR simplifies the management of extensive document collections within educational institutions, improving administrative efficiency and minimizing the time dedicated to manual document processing.

Automated grading: OCR can streamline the grading and assessment procedures, saving time for educators while facilitating quicker feedback to students.

Language learning: Utilizing OCR can enhance student language acquisition by facilitating the translation of printed or handwritten text into various languages.

ASR (Automatic Speech Recognition)

History (Wang et al., 2019)

In 1952, Bell Labs in the United States achieved a ground-breaking milestone by developing the first truly comprehensive speech recognizer.

A rudimentary voice-activated typewriter and a speaker-independent vowel recognizer were developed in the following several years. During this period, speech recognition systems were limited to recognizing single words or vowels.

The 1960s witnessed the emergence of Japanese laboratories showcasing their ability to construct specialized hardware for speech recognition tasks. Noteworthy examples included “the vowel recognizer of Suzuki and Nakata…, the phoneme recognizer of Sakai and Doshita…, and the digit recognizer of NEC Laboratories” (p.2). Kyoto University’s efforts laid the groundwork for future continuous speech recognition systems.

Around the 1970s, the development of linear prediction technology, dynamic programming technology, and Linear Predictive Coding (LPC) cepstrum fueled the rapid evolution of speech recognition for speaker-specific tasks with isolated words and small vocabulary.

Researchers began to expand speech recognition for non-specific-speaker tasks but met serious difficulties with existing technologies. In the mid-1980s, statistical HMM technology gained widespread attention and application in speech recognition, marking significant progress. The SPHINX system, particularly, achieved a breakthrough in Large Vocabulary Continuous Speech Recognition (LVSCR), standing as a milestone.

During the 1990s and early 2000s, extensive research was conducted on the HMM-GMM framework, which dominated the field of speech recognition until the application of deep learning techniques.

More recently, deep learning has brought notable improvements and new developments in speech recognition. In 2001, a research team from Microsoft Research Institute introduced the (CD)-DNN-HMM system, demonstrating significant performance gains compared to the traditional frameworks.

Applications

Voice command systems: Technologies that enable users to interact with electronic devices or software using spoken commands, providing a hands-free way to control and operate devices, applications, or services. For example, intelligent virtual assistants like Siri, Google Assistant, and Alexa (Vadwala et al., 2017).

Dictation systems: ASR is used in dictation applications to convert spoken words into written text. Google’s speech-to-text service and Apple’s dictation feature are good examples of dictation systems.

Accessibility services: ASR contributes to accessibility features, making technology more inclusive for individuals with disabilities who may struggle with traditional text input methods (Fendji et al., 2022).

Telecommunications: ASR is integral to Interactive Voice Response (IVR) systems, commonly used in customer service and support. Ibrahim and Varol (2020) claimed that ASR allows verbal interaction between users and automated systems and directs users to appropriate operators based on their needs.

Transcription services: ASR is widely used in transcription services, automating the process of converting spoken content into written form. For example, in the healthcare industry, medical transcriptionists can capture reports verbally instead of hand typing (Ibrahim & Varol, 2020).

Language learning: ASR in language learning apps helps users improve their pronunciation by analyzing their spoken words and providing instant feedback. Simulated conversations with virtual characters or AI chatbots also allow learners to practice speaking and listening skills.

Challenges (Vadwala et al., 2017)

In order to attain high accuracy, efficient speech recognition systems must cope with challenges associated with:

Vocabulary: The complexity, processing demands, and accuracy of a speech recognition system are influenced by the size of its vocabulary. Applications requiring very large dictionaries must have enough vocabulary to reach high accuracy.

Channel variability: One aspect of variability pertains to the perspective from which the sound wave is emitted. Challenges arise from noise that changes over time, diverse types of microphones, and other factors that influence the content of the sound wave.

Utterance approach: How words are articulated, individually or in a connected fashion, holds significance. For example, an isolated word ASR system will be extremely aberrant for multiple-word inputs.

Utterance style: All humans speak differently due to personal terminologies, unique ways to emphasize, or emotions. Natural speech, whether spontaneous or extemporaneously generated, includes disfluencies and poses a greater recognition challenge than continuous speech.

Speaker model: All speakers have their specific voices. Speaker-independent systems offer greater flexibility but pose more challenges to develop and yield less accuracy than speaker-dependent systems, which are designed for a specific speaker.

Educational implications

Remote Learning: ASR empowers seamless real-time communication and feedback between students and teachers, enhancing virtual classrooms and fostering interactive learning environments.

Accessibility and inclusivity: ASR plays a crucial role in improving accessibility for students facing disabilities, especially those with challenges related to speech or language, such as dyslexia (Ibrahim & Varol, 2020).

Multilingual education: ASR systems can facilitate multilingual education by offering language assistance and feedback across various languages. This is especially advantageous in educational environments characterized by linguistic diversity.

Automated grading: ASR can be utilized to automate the evaluation of spoken assignments or presentations. This frees up time for educators and ensures students receive prompt and consistent feedback.

Language learning: Utilizing ASR offers students real-time feedback on pronunciation and fluency, elevating language acquisition and improving speaking skills.

References

Awel, M. A., & Abidi, A. I. (2019). Review on optical character recognition. International Research Journal of Engineering and Technology (IRJET), 6(6), 3666-3669

Chaudhuri, A., Mandaviya, K., Badelia, P., & Ghosh, S. K. (2016). Optical character recognition systems. In Studies in fuzziness and soft computing (pp. 9–41). https://doi.org/10.1007/978-3-319-50252-6_2

Fendji, J. L. K. E., Tala, D. C., Yenke, B. O., & Atemkeng, M. (2022). Automatic speech recognition using limited vocabulary: A survey. Applied Artificial Intelligence, 36(1), 2095039.

Ibrahim, H., & Varol, A. (2020, June). A study on automatic speech recognition systems. In 2020 8th International Symposium on Digital Forensics and Security (ISDFS) (pp. 1-5). IEEE.

OCRology. (2021, December 10). A quick history of OCR – OCRology – Medium. Medium. https://medium.com/ocrology/a-quick-history-of-optical-character-recognition-ocr-c916d58e2170

Sarika, N., Sirisala, N., & Velpuru, M. S. (2021, January). CNN based optical character recognition and applications. In 2021 6th International conference on inventive computation technologies (ICICT) (pp. 666-672). IEEE.

Vadwala, A. Y., Suthar, K. A., Karmakar, Y. A., Pandya, N., & Patel, B. (2017). Survey paper on different speech recognition algorithm: challenges and techniques. Int J Comput Appl, 175(1), 31-36.

Wang, D., Wang, X., & Lv, S. (2019). An overview of End-to-End Automatic Speech recognition. Symmetry, 11(8), 1018. https://doi.org/10.3390/sym11081018

Final Project: Describing Communication Technologies

OCR (Optical Character Recognition)

History (Chaudhuri et al., 2016)

Applications

Challenges (Awel & Abidi, 2019)

Educational implications

ASR (Automatic Speech Recognition)

History (Wang et al., 2019)

Applications

Challenges (Vadwala et al., 2017)

Educational implications

Leave a Reply Cancel reply