Student club at the iSchool@UBC

Liveblogging CAIS 2008: Assessing a Genre Based Approach to Online Government Information

Luanne Freund and Christina Nilsen, UBC SLAIS

[Please note: These are merely my notes on the presentation, taken live while the presentation was in progress and edited for sense afterwards. They are not a verbatim transcription of the presentation, and any errors are mine. Please contact the researchers directly for more information on their work. — JRD]

[Written on the blackboard] “It is a very sad thing that nowadays there is so little useless information.”

Notes on Christina Nilsen’s part of the presentation:

Does anyone know the source of this quote? [Oscar Wilde]

The problem is abundance of information and the difficulty in finding the information we need. Government initiatives have made information available online, and it is generally high quality information, but online information causes another problem — the difficulty of people being able to find what they’re looking for with the huge amount of information available. Most people use standard search services like Google to find government information.

Our project looks at a genre-based approach to metadata.

The first question is, what is “genre”? Most people think of literature when they think of the concept of “genre”. Instead it might be useful to look at genre as “context-specific communicative acts” — functional genre theory looks at genre in organizational settings.

Document genres are not necessarily a form: it’s the communicative act itself that informs the genre; a genre has shared form, purpose and content aspects.

Our interest in genre is in discovering whether genre can help members of the community find information. Because genre is socially constructed, it carries context, and thus can enhance findability. We think the genre approach is specifically useful in the government context.

Government of Canada “type” metadata element, which is optional in the GC metadata standard. It was based on Dublin Core and intended to support resource discovery. It’s defined as the ‘nature or genre of a resource’ and its intent is to help with management of websites as well as helping users to narrow their search for information.

The gc type taxonomy is consistent with DC recommendations in that it uses a controlled vocabulary and an aggregation scheme. More than one type can be applied and types can be post-coordinated. There are fifty different types of documents.

Notes on Luanne Freund’s part of the presentation:

Each type might have an example, for instance “assessment” has a couple of examples and a very brief explanation.

We wanted to find out how the GC type taxonomy is being used, and whether it is useful for people. We used an off-the-shelf webcrawler to crawl a sample of 10,000 gc.ca pages, which is relatively easy to do because we were focusing on a single domain. We found that 22% of pages in our sample had GC type metadata, so there is some use of it. A further 14.5% had a gc.type field, but it was blank. So it was included in the template but just not filled out. 34 of the 50 type values in the GC schema were being used, and the values that weren’t being used at all were very specific genre like music and sound that weren’t likely to be used in this sample anyway. There were also 85 values that weren’t from the type schema, which is fine because you can choose other schema. There was also a fair amount of error in those different values.

There is uneven distribution of gc.type data. In Service Canada there was 100% use of type metadata, so they probably have an information management policy that requires that. Health Canada was 88% and Environment Canada was only 1%.

The top values were:

  • fact sheet
  • home page
  • resource list
  • report
  • media release
  • contact information
  • organizational description
  • text
  • service
  • administrative page

Fact sheets were 10% of the values, the top 10 made up about 60% of the usage. Text is a value in DC but not really reflective of genre, and it’s not a value in the GC scheme because it’s not reflective of content.

It’s an interesting starting point.

Our second study was to look at whether, if we were government document workers, how useful this schema would be. We manually collected a sample of 400 gc.ca documents, using the first 20 results from Google within the gc.ca domain on queries based in the area of health and environment. We used simple queries like “healthy meal plans”, “noise in the workplace”, and “gripe water”.

We immediately ran into consistency issues. We took the first 20 documents, went home and tagged them and met again, and we found that there was almost no overlap. Genre is a socially constructed concept, it’s not a real or clear feature of a document, so the likelihood to reach really high levels of consistency is low in general, but we were really not in the same ballpark.

One reason is the ambiguity in schema, a large number of possible values, and difficulty in determining part/whole relationships. It’s a hierarchical concept, a book, a chapter, are both types of genre but they fit within one another. We had to determine levels of granularity. So we added more detail to disambiguate tags, we added structure to the tags, and we identified groups of related tags. We also reorganized that schema so that there were groups of related tags. I’ve switched to calling them “tags” but it’s still the same thing, genre values. We lumped together different tags to add context. After a few more iterations we had reached a reasonable level of overlap. Once we had that set up, the student researchers classified the documents again.

About 10% of tags were left untagged. A lot of those were lead-in pages. That might need to be added. But in general most of the documents were possible to tag. 66% were tagged with a single value, which implies there is a distinct characterization of those documents. 40% were assigned a sub-genre. 17% were assigned a meta-genre; you can post each “chunk” of a document as a separate document, which has implications for using search engines — if your search result leads you to the references page of a report, that’s not so useful.

There were 34 type values used out of 50. Not the same 34 that we found in the first study, but it was similar. The top ten were:

  • fact sheet
  • resource list
  • reference material
  • report
  • guide
  • media release
  • news publication
  • educational material
  • promotional
  • terminology
  • standard

It was a zipf-like distribution with a long tail.

… Here are some examples. This was “mercury poisoning”. This result was a fact sheet, this was a newsletter, this was a resource list that links to much more in-depth reports and information. The implication is that if we can somehow identify who the searcher is, what their context is, we can promote the genre that might be most useful for them.

Another example, for this query, “safety antenna radiation”, there was a lot of variance in the types of results. This query, “seniors home renovation”, was similarly varied.

What have we learned?

Type metadata is being used in about 1/5 of Government of Canada (gc.ca) web pages

Prevalence of the use of metadata varies by department

The GC type schema is heavily used

A very small number of genres identify the majority of information, but there is a long tail of less common genres. If you’re a proponent of the ‘long tail’, that very specific information can be valuable also. If you want ‘musical notation’, for instance, you really want that specifically, and not other types of documents.

Genre variation occurs within topics and domains, and consistent tagging is difficult to achieve. I’m not that into the idea of manual tagging, so my tendency would be to look towards automated systems to decrease the variation. I’d also like to see it be more robust.

Here’s something to think about: Service Canada is 100% tagged by genre, but on the Service Canada website’s search page, there is no way for visitors to make use of this metadata.

Future work:

Most of all, what we need to do is some user studies. We need to talk to people and see if they care and understand, and if they know what the genres are. The long-term goal would be to develop a genre enabled search engine for government information.

Question from the audience: When you were looking at these, did you find that you generally agreed with how government documents were tagged, or were they tagged kind of lackadaisically because tagging would not be the person’s main focus?

Answer: We found some overlap, but not much. But we found it was complete guesswork ourselves, so I wouldn’t call it lackadaisical. People did sometimes seem to just “throw something in there”.

Comment from Luanne Freund: If you’re interested in genre, the current ASIS&T Bulletin has a section on Genre with a number of articles about it.

« »

Spam prevention powered by Akismet