Text Analysis Using a Non-English Script

The digital humanist scholar and historian Thomas Mullaney (Stanford University) has argued that there is an “Asia deficit” within Digital Humanities due to the platforms and digital tools that form the foundation of digital humanities (DH). Digital databases and text corpora – the “raw material” of text mining and computational text analysis – are far more abundant for English and other Latin alphabetic scripts than they are for non-Latin orthographies. Although text mining, an emerging area in DH, enables researchers to work with textual content, they are often not applicable to texts (such as the Chinese language) due to the differences in language structures. In western languages, words are usually defined by white spaces or punctuation while the lack of punctuation and whitespace in Chinese texts represents one of many significant barriers to entry in this area of research.

Minghui Yu, Programmer Analyst, UBC IT has been conducting research in the area of text analysis for a number of years, including a TLEF-funded research project called Daxue 2.0, and will examine some tools that will examine the current state of non-DH text analysis.


Thursday, November 16th, 2017 at 12:00PM – 2:00PM.


Registration online. Link for registration.

The Web as Infinite Archive: Why we turned to Machine Learning, Distributed Computing, and Interdisciplinary Collaboration to understand the Recent Past


The continually-growing volume of cultural heritage held in web archives is a vast resource awaiting the use of researchers in fields as varied as history, political science, sociology, and computer science. While web archives have been collected and saved since 1996, scholarly use has lagged due to the sheer scale of the data that confronts potential users. In this talk, Ian Milligan argues that interdisciplinary collaboration, bringing together librarians, computer scientists, and historians, holds the best pathway forward. Drawing on two case studies, one using web archives to systematically explore the transition between two US Presidential Administrations, and the second on developing scholarly infrastructure for the study of web archives, this talk highlights efforts to “unleash” web archives. This is the fourth talk of the UBC History Department’s Colloquium Series 2017 – 2018, Histories on the Edge.

Dr. Ian Milligan is an associate professor in History at the University of Waterloo. He describes himself as a digital historian of Canada.


Thursday, November 9, 2017 – 12:30 to 14:00

Location
Buchanan Tower 1197
1873 E Mall
V6T 1Z1 Vancouver , BC
Canada

OCR for Non-English Language Text


This Pixelating Mixer will demonstrate how the Digital Himalaya project is generating searchable transcripts for non-English materials, and the surprisingly accessible tool that makes it possible. Come learn about how staff at the Digitization Centre discovered this process, how it is being implemented, and try it for yourself. Notes and slides from this session can be accessed online.

Presenters: Rebecca Dickson and Laura Ferris


Facilitator(s): Larissa Ringham, Susan Atkey, Allan Cho

We provide soft chairs, tables, wireless internet, and interesting people to talk to, collaborate with, and bounce ideas off of. You bring your laptops, DH projects, and ideas. This is an open event – drop in and out as your schedule allows. Please bring your laptop if possible for this workshop, as this will be a hands-on session.

Spam prevention powered by Akismet