The UBC Digitization Centre is responsible for the creation of more than 50 collections, all available through the Open Collections website. Our collections are diverse in formats, information and languages.

Having non-English materials, or materials that are not written using the Latin-based alphabet, may be a barrier to access and retrieving information. But technology can be used to help us minimize these barriers.

Laura Ferris and Rebecca Dickson, from the Digitization Centre, have discovered a process to generate searchable transcripts for non-Latin text. The idea originated from an article about a workshop on Optical Character Recognition for Bangla. The result of the workshop was the realization that Google Drive was the most accurate tool for generating transcripts for non-Latin text.

With that information in hand, Ferris and Dickson started to explore Google Drive to create an automated workflow for transcribing batches of items.

Are you interested in trying the workflow out for yourself? If so, check the instructions that Rebecca prepared and give it a try!

  1. Access Google Drive, create a “New folder” and rename it
  2. Create a Google Sheet inside the folder
  3. Open the Sheet, click on “Share”, “Receive shared link” and look for the sheet identifier (the numbers and letters between /d/ and /edit?)
  4. In the Sheet, under “Tools” menu, click “Script editor”
  5. Paste the content from “gs” into the script editor
  6. Update the “folderName” with the name of your folder (defined in step 1)
  7. Update the “sheetId” with the identifier that you found in step 3
  8. Click the “clock” icon and select the options: “extractTextOnOpen”, “From spreadsheet” and “On open”
  9. Save the script editor and close it
  10. Upload jpegs to the folder (you can check out the sample items prepared for this work)
  11. Open the spreadsheet and wait for Google to do the work!

 

If you want to check Laura and Rebecca’s presentation about the topic, check out their slides. If you have questions, feel free to contact us.

 

Sources:

A workshop on Optical Character Recognition for Bangla (British Library)

OCR for non-English language text (Pixelating)

Pixelating-ocr (GitHub)

a place of mind, The University of British Columbia

UBC Library

Info:

604.822.6375

Renewals: 

604.822.3115
604.822.2883
250.807.9107

Emergency Procedures | Accessibility | Contact UBC | © Copyright The University of British Columbia

Spam prevention powered by Akismet