This is part of a series on web archiving at the UBC Library. For all posts about web archiving, please see https://digitize.library.ubc.ca/tag/web-archiving/

From May 2017 to April 2018, as part of my work as a Digital Projects Student Librarian and a Professional Experience Student at the Digitization Centre, I worked with Larissa Ringham to develop the University of British Columbia Web Archive Collection. This post should give you a sense of why we embarked on this project, some of the major challenges associated with doing large-scale web archiving, and how we adapted or work to address those challenges.


Preserving UBC’s web-based institutional memory

In the past, a major part of UBC’s history was documented in physical records like letters and photographs. After 25 years of public access to the internet, though, many major University activities are documented within websites instead. This presents new challenges for organizations, like the Library, who have an interest in preserving content related to institutional memory. With that in mind, the Digitization Centre decided to develop a new Web Archive collection focused on websites created by UBC faculty, staff, students, administrators, and other community members.

 

Scaling up from small thematic collections to a large domain crawl

Since the Library started archiving websites in 2013, most collections have been created around a central theme (e.g. the 2017 BC Wildfires collection). These thematic collections have usually included less than 100 target websites (a.k.a. “seeds”), and averaged about 25GB in size. Each site is often crawled individually, and each capture is, ideally, checked by a human for quality issues like missing images.

The universe of UBC-affiliated websites exists on a much larger scale. When we initially ran week-long test crawls of the ubc.ca domain, each of them resulted in about 500GB of captured data from almost 200,000 unique hosts (e.g. library.ubc.ca). We quickly realized we needed to find a way to scale up our workflows to deal with a collection this large.

Selection, appraisal, and scoping: How do we identify and prioritize high-value “seed” websites within UBC.ca, as well as flag content to be excluded?

As of Summer 2017, when we started our test crawls on ubc.ca, Archive-It test crawls could run for a maximum of 7 days. That meant our tests would time out before capturing the full extent of the UBC domain, leaving some content undiscovered. We were also unable to find a comprehensive, regularly updated list of UBC websites that would help us make sure no important sites or subdomains were missed.

Additionally, if saved, each one of our test crawls of UBC.ca would have been large enough to use up our Archive-It storage budget for the year. Even with a one-time doubling of our storage to facilitate this project, we needed to do work to exclude large data-driven sites.

Quality Assurance: How can we identify and fix important capture issues in a scalable way

Manually clicking through and visually examining millions of captured web page would not be possible without an army of Professional Experience students with iron wrists that are miraculously immune to tendonitis.

 

New workflows, who dis?

After a lot of trial and error, this is how we went about our first successful capture of the ubc.ca domain.

A two-pronged approach to selection, appraisal, and scoping

First, using the results of our first few test crawls, I created a list of all of the third-level *.ubc.ca subdomains our crawler found on its journey. Some of these were immediately flagged as out of scope (e.g. canvas.ubc.ca) or defunct (e.g. wastefree.ubc.ca). The rest were classified by the type of organization or campus activity associated with the subdomain. Sites associated with major academic or administrative bodies (e.g. president.ubc.ca) were added to a list of high-priority websites, and added as a seed in Archive-It where they would be crawled individually and reviewed for issues in closer detail. For each subsequent test crawl, I ran a Python script that compared the results with our master subdomain list, flagging any new third-level subdomains for assessment.

Second, I consulted lists of UBC websites that we felt could reflect their level of usage and/or value for institutional memory. While it’s not always comprehensive or up-to-date, UBC does have an existing website directory. It’s especially helpful for identifying high-priority that aren’t a third-level *.ubc.ca subdomain, and for double-checking we’re capturing the websites for all major academic and governing bodies. In addition to that, we used a tool in Wolfram Alpha to get a list of the most-visited *.ubc.ca subdomains. This list helped us identify commonly encountered subdomains that exist for functional reasons rather than for hosting content (e.g. authentication.ubc.ca).

Screenshot of the Google spreadsheet used for tracking potential, selected, and excluded websites. Each row corresponds to a website, and contains data about what stage it's at in the archiving process.

Archive-It’s selection and appraisal tracking capabilities are limited, so we export data from the service and consolidate it with our appraisal tracking in this spreadsheet full of V-lookup horrors.

More sustainable, targeted quality assurance

Following our existing QA workflow, I started by manually examining captures of our high-priority seeds for problems like missing images. This time, though, I carefully tracked the issues I encountered and the scoping rules we added to fix them. Pretty quickly, patterns emerged that allowed me to start addressing capture issues in bulk.

Shows a website with plain text links and no formatting

An initial capture of artscoop.ubc.ca with missing look and feel files

Homepage of artscoop.ubc.ca. Features a banner ad for Co-op students of the year with a stock photo of a group of students walking under a large tree.

The live Arts Co-op website, featuring look and feel files and image assets.

For example, captures of WordPress sites using the UBC Common Look and Feel were all missing similar CSS files. Once I found a set of rules that fixed the problem on one impacted site, it could be set it at the collection level where it would apply to all affected sites. That includes those that aren’t considered high-enough priority to check manually, where we otherwise might be unaware there was an issue.

 

The results and next steps

As of today, the Library has captured and made 164 UBC websites publicly available in its Archive-It collection. These captures add up to over 700GB of data! This number continues to grow thanks to regularly-scheduled re-crawls of important sites like the Academic Calendar (crawled quarterly) and the main University domain at www.ubc.ca (crawled monthly). Next will be to establish regular crawl schedules for other high-priority sites, bearing in mind how other web archiving projects impact our crawl budget.

Our internal list of “seeds” also includes an additional 397 UBC websites that are sitting at various stages of the web-archiving process. Many of these have already been crawled, and the captures will be made available on the Archive-It collection page after some additional quality assurance and description work.


Get in touch

Are you the owner of a UBC-affiliated website that you think should be preserved? Read more about our Web Archiving Work, or get in touch with us at digitization.centre@ubc.ca.

 

This is a series on web archiving at the UBC Library. For all posts about web archiving, please see https://digitize.library.ubc.ca/tag/web-archiving/

From the new report by the United Nations Intergovernmental Panel on Climate Change, to the new NAFTA (USMCA) Agreement to Vancouver’s housing crisis, government information is all around us. Historically, government information was sent to academic libraries via depository agreements, but with the phasing out of print publishing in favor of born digital publications, the majority of these deposit agreements have ceased.

Born digital information can be taken down as quickly as it is published and government information is no exception. Websites are removed for a variety of reasons including the site being seen as outdated, perceived national security issues, changes in administrations or organizational and departmental website guidelines. Canada’s federal Guidelines on Implementing the Standard on Web Accessibility includes a section on website links perceived to be redundant, outdated or trivial (ROT). What may be trivial according to government guidelines could be of value to researchers, historians or the general public, which is where the importance of web archiving comes in.

Since 2013, archiving government websites has been at the forefront of UBC Library’s archiving initiatives.  One of the Library’s first web archiving projects involved archiving federal government websites. In 2013, the federal government announced that the government’s web presence would be consolidated from over 1500 websites down to essentially one – canada.ca.   Librarians were warned that the merger would result in the removal of valuable information including reports and data, which wouldn’t be transferred to the new site. Due to the enormous scale of the project, UBC Library collaborated with other academic libraries across Canada to quickly archive nine federal departments including Citizenship and Immigration Canada, Canadian Heritage, the National Research Council, Elections Canada and the Canadian Human Right Council. These sites are now preserved and viewable on the Library’s Archive-IT collection page.

Canadian Government Information – Digital Preservation Network (CGI-DPN)

The federal government website project was initiated by the Canadian Government Information – Digital Preservation Network (CGI-DPN), a national collaborative web archiving group established in 2012, of which UBC is a partner.

Modelling itself on the U.S. Digital Federal Depository Library Program (FDLP), CGI-DPN uses LOCKSS to distribute copies of replicated Canadian government information in secure dispersed locations including British Columbia.

The CGI-DPN web archive includes copies of the Depository Services Program E-collection, at-risk government websites of all jurisdictions (federal, provincial, municipal) as well as thematic collections. UBC is a LOCKSS node for the CGI-DPN and participates in curating various collections for the project. The collections are all available via https://archive-it.org/organizations/700

Municipal government collection

Along with archiving federal websites, we have also partnered with Simon Fraser University and the University of Victoria to capture local municipal content. UBC Library archived 132 municipal websites which are hosted on the University of Victoria’s British Columbia Local Governments Archive-it collection.

One of the benefits of archiving sites and curating a collection is that the content is all located in one area. Some cities, like the City of Vancouver, archive their own web domain but a researcher would have to visit each site as opposed to viewing all the collections in one account and in some cases these municipalities view archiving purely as preservation and keep their collections “dark” and not open to the public.

Challenges

The challenges of web archiving government content include copyright issues as well the necessity of working in an agile environment. Copyright for government websites varies from province to province as each province and territory interprets crown copyright differently. Some governments allow their domains to be archived while others do not; the Province of British Columbia is one that does not allow their site to be archived.

Websites can come down very quickly and sometimes we only have hours or days to capture this content. Working collaboratively with other institutions across British Columbia and Canada has allowed us to preserve material that would have otherwise disappeared forever.

Current government collections we are actively engaged in archiving include the BC local government elections, impacts of the legalization of marijuana, and Vancouver’s recently announced rapid transit projects.

We always welcome suggestions, so if you have any ideas for government collections please fill out our web archiving proposal form!  https://digitize.library.ubc.ca/work/

By Susan Paterson, Government Publications Librarian

This is a series on web archiving at the UBC Library. For all posts about web archiving, please see https://digitize.library.ubc.ca/tag/web-archiving/

The Digital Initiatives unit at UBC Library offers an opportunity every term for students to have a Professional Experience project in web archiving. During the Summer of 2018, I had the chance to work with them in the web archiving initiatives.

The Professional Experience was great in many ways. I had the opportunity to learn about web archiving, from understanding its importance to performing quality assurance on crawled web pages. During the term, I focused on creating a web archiving collection of sites related to Marijuana Legalization in Canada, and more specifically in British Columbia. The collection will enable people to access web content that was created, like awareness campaigns, and also see different perspectives about the topic.

 

Why web archiving?

Developing a working knowledge of web archiving seemed like a great opportunity. New data and content are posted to the internet every day and a great portion of it is made of information with a short life cycle—social media and news, for example.

Along with the massive production of information, there is also information loss. Websites, web pages, and links just stop working, because someone decided that the information was no longer useful or the content was moved from one site to another. Web archiving is an alternative to prevent information loss. Archiving websites and webpages uses a process that involves information curation, copyright, crawling the web, quality assurance and a lot of troubleshooting.

 

Application

Web archiving initiatives are growing. Not only are academic and public libraries investing in web archiving, but companies and cities are as well. Libraries have created collections to serve different purposes and preserve information on specific topics like institutional memory, elections, natural disasters, landmark laws, politics, educational purposes, and more.

Companies that use web archiving services tend to do so for two main reasons: competitiveness and litigation purposes. For example, a company may want to avoid or create legal processes, based on what is on the internet, or to save statements and information released about a competitor.

As a future librarian, I perceive several opportunities with web archiving, due to the profile of our profession. In general, librarians are experts when it comes to monitoring information, content curation, metadata, users’ needs, copyright, and technology. Those are some of the skills and knowledge needed to work with web archiving in the mentioned contexts.

In turn, web archives are a useful tool for librarians as they can help in many ways, for example:

  • Reducing the amount of work needed to update broken links on Research Guides
  • Making it easier to find new resources to substitute ones that are no longer available for access
  • Ensuring access to great resources on the web, without worrying if they will still be available
  • Registering information that is easily lost, like social media and news

 

Challenges

While web archiving is full of opportunities, it is also full of challenges. The main ones in working with web archiving are:

  • To perform quality assurance (QA): sometimes web crawlers have problems collecting information from websites with interactive content, for example. Figuring out how to scope and define rules for crawling in order to properly display web pages may be challenging sometimes.
  • To find a balance between archiving content and data loads: finding the ideal scope and rules helps to find the balance but is not everything. Decision making is required to find the balance between how much of data will be saved (meaning how much will be invested) and archiving the website/web page (how much content should be web archived) is another challenge.

 

Recommendations

A Professional Experience in web archiving is an excellent opportunity to learn about the topic, have a hands-on experience, work with professionals from the field and strengthen your resume. The position will enable you not only to learn about web archiving, but also to exercise and improve your skills related to time and project management, reporting, and to work autonomously.

If you are interested in getting to know more about web archiving, then check the resources:

 

Written by Paula Arasaki, MLIS student at UBC

This is a series on web archiving at the UBC Library using the Internet Archive’s Archive-it service. For all posts about web archiving, please see https://digitize.library.ubc.ca/tag/web-archiving/

Proposal

Web archiving projects can be proposed by anyone: community members, students, researchers, librarians etc. We have also completed larger collaborative projects that were proposed by a group of libraries – such as the 2017 B.C. Wildfires collection. The proposal is reviewed by Digitization Centre Librarians as well as subject-matter experts that we identify as collaborative or consulting partners. For example, the Site C Dam is a hydroelectric project managed by B.C. Hydro, a Crown corporation, and has a significant impact on the province’s First Nations groups. For this reason, the proposal for our B.C. Hydro Site C Dam web archiving collection was evaluated and later developed with the UBC Library’s Government Publication Librarian Susan Paterson, and Aboriginal Engagement Librarian Sarah Dupont.

Evaluation

To be selected as a candidate for web archiving, websites are evaluated based on their risk of disappearance, originality, availability in other web archiving collections, and copyright considerations … among other factors. While we are currently updating our collection policy for web archiving collections, potential collections and websites are evaluated based on their intrinsic value and significance to researchers, students, and the broader community. We aim to capture content that is relevant to the needs of the wide range of subject areas taught and researched at UBC, as well as content that contributes to the institutional memory of the university.

Resources

Web archiving projects are resource-intensive. Websites are assessed for how extensive we anticipate that the content captured will be and therefore how much of our subscription’s data storage will be used. We also consider how much time the project will take, and who is available to undertake the work within the required time frame. Some projects need to be responsive to current events unfolding in real time – such as a political rally, or catastrophic event such as an earthquake – with resources required to identify the content and set up the crawls immediately. While we are fortunate in having had a number of students from the iSchool interested in working on our web archiving projects, they may already be committed to working on another project.

Technical considerations

Websites are constructed in many different ways, with a range of elements that dictate how the site behaves. Before starting a project we consider archive-friendliness: how easily the content, structure, functionality, and presentation of a site will be captured with Archive-it. Dynamic content – anything that relies on human interaction to function, or that contains database-driven content – can be problematic. Sites constructed using Javascript or Flash often mean that web crawlers have trouble capturing certain elements on a web page. While there are ways to customize the crawls in certain cases, a successful crawl can take time to construct and success is not always guaranteed.

Metadata

Aside from capturing the web content itself, we create metadata for each collection and each website (or “seed”) that we capture; this often includes a description for the collection and the seed, as well as the creator of the seed and subject terms for the content, such as Environmental protection for a seed in the Site C Dam collection.

This metadata provides context for why the seed was included in the collection, and helps users discover the content relevant to their interests when searching in Archive-it.

 

By Larissa Ringham, Digital Projects Librarian

Estimates put the current size of the internet at around 1.2 million terabytes and 150 billion pages. Sites go up, sites come down, pages are removed, content changes continuously. And an increasing amount of this information is available only online. You might not care if you can no longer access the comments about watching someone’s grass grow, but you may be concerned one day to find that a political candidate no longer has their statement of opposition on an important local topic up on their campaign website – a statement that you desperately need for your research.

Fortunately much of the content on the web today is captured by the Internet Archive, which harvests and makes available web content through its Wayback Machine. Sites are crawled by the Wayback Machine for archiving on an irregular schedule; depending on a variety of factors such as how heavily a site is linked, the Internet Archive web crawlers may crawl a site several times a day – or only once every few months. Web content can change so frequently that unless you can specify exactly when the content on a specific site is captured, there is a chance that information will be lost forever. The Wayback Machine does what it can, but it has billions of web pages to try to crawl.

Enter Archive-it. Archive-it is a subscription web archiving service that the Internet Archive created in order to give organizations like the UBC Library the ability to harvest, build, and preserve, collections of digital content on demand. This service gives us control over what we crawl and how often, and allows us to apply the metadata that will permit users to find our archived web content more easily. And information can now be pulled out of our collections for analysis using Archive-it’s API. The sites we harvest are available on our institution’s Archive-it home page, and are added to the Wayback Machine’s own site crawls so that our information is full text searchable, and freely available to anyone in the world at any time.

We started web archiving in 2013, when a group of university libraries – including UBC – began crawling the Canadian federal government websites collaboratively in order to capture content important to Canadians that was scheduled for removal online. Since then, we have created nine collections of archived web content, with three more under active development. These collections are representative of the research interests of UBC and its community, and include such topics as the BC Hydro Site C Dam project and First Nations and Indigenous Communities websites, as well as the University of British Columbia websites themselves.

Over the next few weeks we will be exploring some aspects of our web archiving work at UBC, and will hear from some of our library partners and past students who have done work in the area. Stay tuned for posts on developing web archiving projects, archiving government web content, and the technical limitations of web archiving.

See all posts related to web archiving: https://digitize.library.ubc.ca/tag/web-archiving/.

 

By Larissa Ringham, Digital Projects Librarian

Have you ever been in a situation where you’re looking for information on the internet, find a super useful and interesting website that seems to contain a lot of relevant information, and then when you click… you get an error page! This may be because that website no longer exists or because that page was removed.

Online content can be very ephemeral, and information can be easily lost. Because of this, there is an initiative called the Internet Archive, a non-profit organization dedicated to saving websites and web pages from all over the world with the goal of providing access to those pages for future generations.

The Internet Archive initiative began in 1996 and was created when people realized that there was a lot of content on the web, freely available, but no one was taking steps to preserve it and ensure future access to it. It is common for digital content to simply disappear or change over time.

The Internet Archive contains two main services:

  • Archive-it: a web archiving service that helps organizations harvest, build and preserve collections of digital content
  • Wayback Machine: where users can access digital collections

Due to the growing importance of web archiving, several organizations from all over the world have begun archiving their own web content and other sites they deem important. In the context of universities, much of this work is being done by libraries. UBC has its own web archiving program, which is currently being run out of the Digitization Centre.

 

UBC Web Archiving Initiative

The UBC Web Archiving Initiative was created in 2013, with the purpose of ensuring the preservation and access of web content that contributes to the fulfillment of the institution’s mission. The content should fulfill at least one of the following criteria:

  • Be of interest to the University
  • Contribute to research, learning and teaching
  • Be associated with UBC’s corporate memory

Thus, the following types of websites are considered within the scope of our collection:

  • Research, public or governmental interest
  • Historical or geographically local significance
  • Complementary to relevant existing collections
  • Content produced by the University or affiliated organizations

Currently, UBC has nine collections in their web archive:

If you have suggestions for web content that should be archived, feel free to fill out the form! If you want to learn more about web archiving, check out About the Internet Archive (Internet Archive) and Web-archiving (DPC Technology).

 

Sources:

About Archive-it (Archive-it)

UBC Library web archiving (Slideshare)

Understanding web archive access and use with Google Analytics: lessons and questions from the Federal Depository Library Program (Archive-it)

Web archiving FAQ (UBC Library)

Work with us (UBC Wiki)

a place of mind, The University of British Columbia

UBC Library

Info:

604.822.6375

Renewals: 

604.822.3115
604.822.2883
250.807.9107

Emergency Procedures | Accessibility | Contact UBC | © Copyright The University of British Columbia

Spam prevention powered by Akismet