Weird Science?

I’ve been encountering scholarly work on Virtual distance learning and library service since my library tech program – in particular articles about Second Life. It seems to be the common thread on the topic. Second life sticks out in the search strategy Dennis Beck and Ross Perkins report in their 2016 “Review of Educational Research Methods in 3D3C Worlds” in Handbook on 3D3C Platforms.

“Online worlds, virtual worlds, Multi-User Virtual Environments, Massively Multiplayer Online Role Playing Games,Virtual Reality, Second Life, Online worlds, role playing games, Cyberspace, and Immersive Worlds“ (217)

This kind of research is probably the only area where Secondlife overshadows World of Warcraft. My experience in “virtual worlds” tends to be more of the WoW sort, so what really struck me in Beck and Perkin’s conclusions was their call for more longitudinal studies of programs in virtual spaces.

Second life is unusual in a lot of respects, but most so in that it is over ten years old. Virtual worlds are unstable environments. I am reminded of the ESRB’s (the American rating agency for computer games) universal boilerplate for these environments: “game experience may change during online play.” They probably mean somebody might swear, but there’s a deeper truth in it. Developer Brad McQuaid, godfather of EverQuest, was quoted in an interview 2007 interview with the New York Times summing this feature of online worlds:

“People ask me, ‘are you launching a finished game?’ and the answer is ‘no, we’re launching a game that is good enough to launch, but it’s not finished.’ And that’s why I love these games: because they should never be finished.”

This is as true for virtual worlds that are not “games”, Second Life as the prime exemplar, which still depends on new content over time – second life’s model is simply more “social” in the sense that the users provide this content.

Not only does the nature and population of a virtual space change from year to year, often it disappears entirely. Servers are expensive! MMORPG.com maintains one of the more complete lists of lost worlds. For those with dates, many did not last even a year. How do you designed a reproducible study in an environment that might not exist by the time your study is published? By the time your study is completed?

When the servers running a virtual world ultimately go down, by and large that world – and your access to it – ceases to be with no real record except memories and screenshots. There was a joint effor in 2010 between the Library of Congress, Sanford, and Champaign-Urbana to create archiving standards for Preserving Virtual Worlds. Ironically, almost nothing survives online of the project except the final report. While they established initial standards and methodologies for archiving virtual worlds, the report emphasized the continuing difficulties in doing this. Second Life was a major test case for the project, aiming to archive a single island from the game and while the methods developed “could potentially allow us to reinstantiate the island in another virtual environment platform, in practice our efforts can only be described as partially successful at best” (96).

Look who’s talking

I feel remiss that, in a blog called “socially speaking” with a computational linguistics joke up there in the banner, that I haven’t actually touched on any of the linguistics of social media. Let’s fix that:

A major field in the study of language is corpus linguistics. Its methodology revolves around the creation and use of large databases, called corpora, containing thousands if not millions of transcribed utterances and passages of written material. Copora are typically indexed down to the word and heavily encoded with metadata to allow researchers to search for subsets in the data that can be used to test a hypothesis about the use of language.

One of the largest handcrafted corpora is COCA, the Corpus of Contemporary America English. COCA was developed around 2008 by researches at Brigham Young University, and continues to grow. The size of COCA is only possible because of the volume of American English text available online — it was originally built with, of all things, Internet Explorer — but COCA doesn’t actually include any natively online content. The corpus was built as a retrospective, balanced, and American corpus. The corpus archives data from back to 1990, and splits the data in each year evenly between the five genres it includes. In 1990 there simply wasn’t enough internet communication to make up an equal percentage of the data, especially if you limited to American sources (if you could even tell), so it was declared out-of-scope for the project.

Still, the COCA is a behemoth. It has 520 million words from sources spanning 25 years, divided evenly between transcribed speech, fiction, popular magazines, newspapers, and academic journals. The corpus comprises some 190,000 texts in total. The use of the data is free to the public, you can check out their search interface here. For most of my linguistic training, it was one of the best — if not the best — English-language corpora.

Compare that to this Facebook corpus a group of researchers generated just for their own research. It comprises 700 million words in it contributed from 75,000 volunteers (15.4 million facebook status updates). They also got every volunteer to take a personality test. I can’t even.

age gender

They’ve published some neat visualizations for their data on the links between word use, personality, age, and gender. It brings new meaning to “word cloud.”The power in these corpora is how easily they can be produced, and how easily their contents can be statistically manipulated and compared. Researchers are not only distributing their data sets, they’re sharing the code they used to collect them! (one such code release amusingly attempts to coin the term ‘tworpus’ for a twitter corpus)

Socially Speaking

I don't think that's what they meant by computational linguisitics.

Monthly Archives: March 2016

Weird Science?

Look who’s talking