Finding Web Archiving Support Through a Virtual Community Co-Working Space

This post was written by Amanda Greenwood, Archivist of Historical Collections at the University of Virginia’s Claude Moore Health Sciences Library.

Frustrated with the difficulties of managing our web archives programs, Corinne Chatnik, the Digital Collections & Preservation Librarian at Union College, and I would often meet via Zoom to chat about web archiving. In the beginning, it started as a time we blocked off to hold each other accountable for our web archives work as the calendar invites we scheduled acted as reminders to check our test crawls. However, we started to actively web archive during this time by sharing our screens and trying to problem-solve issues we individually encountered. Eventually, we decided we would open up our meetings to a broader audience to create a collaborative community-based group with the hopes of bringing those who work on web archives together. Thus, Webjam: A Web Archiving Jamboree, was formed.

As members of the Society of American Archivists Web Archiving Section Steering Committee, we often discuss how managing web archives programs is an isolating task. As the solo managers of the web archives programs at our respective institutions, Corinne and I understand this theory in practice; as being the singular archivist with web archiving skills means we do not have a team to collaborate with, so we often look for outside support. Web archiving requires constant maintenance and labor because of the constant change in dynamic and interactive web content, so finding time to consistently work on it can also be challenging because of our other work responsibilities. We discovered that Webjam helps us stay on top of our web archives duties as we meet quarterly throughout the year.

Last Webjam meeting, we had folks with various levels of web archives experience join and ask questions; assist others with crawling and scoping issues; and share tips, tools, and technology that assists them with their web archives programs. If you are interested, you are welcome to join us at our next Webjam meeting, where all levels of web archiving are encouraged. Even if you don’t have questions, you are welcome to (virtually!) hang out, meet fellow practitioners, and work on your web archives collections!

What: Webjam: A Web Archiving Jamboree (Q3 meeting)

When: Sep 24, 2024 11:00 AM Eastern Time (US and Canada)

What we’ll do: Actively web archive, ask questions, chat, and offer support.

Who: Hosted by Corinne Chatnik (Union College) and Amanda Greenwood (University of Virginia). Questions or comments? Email [email protected].

How: Fill out this Google form and you’ll be sent the Zoom meeting information a couple of days prior to the meeting.

Web Archiving Roundup: April-June 2024

Events: 

  • If you missed our recent coffee chat on WARC-GPT, web archiving, and AI, you can view the blog post with the recording and slides here
  • Registration is now open for the Web Archiving Section’s joint annual section meeting with the Public Library Archives and Special Collections section on Wednesday, July 17, 1:00-2:30pm (CT). This virtual event is open to all and will be centered on community web archiving efforts to preserve underrepresented, at-risk voices. Register now

Here are a few quick links on interesting web archiving topics we’ve found recently: 

  • “On the Technical vs. Public Accessibility of Historical Web Content as Patent Prior Art” (live URL | archived URL
  • “End of Term Web Archive – Preserving the Transition of a Nation” (live URL | archived URL
  • “When Online Content Disappears” (live URL | archived URL)
    • “38% of webpages that existed in 2013 are no longer accessible a decade later” – Pew Research Center
  • “Making Web Archives More Accessible: Insights from a GAAD Perspective” (live URL | archived URL
  • “As China’s Internet Disappears, ‘We Lose Parts of Our Collective Memory’” (live URL | archived URL
  • “Saving the First Draft of History” (live URL | archived URL
  • “From Hyperlinks to Queer Histories: Uncovering LGBTQ Web Archives” (live URL | archived URL

News: 

  • “Internet Archive and the Wayback Machine under DDoS cyber-attack” (live URL | archived URL
  • “Paramount Global Erases Archives of MTV Website, Wipes Music, Culture History After 30 Plus Years” (live URL | archived URL

**Note: Our section created a Zotero library for web archiving resources, so you can find all the resources featured in our monthly web archiving roundups in one place! You do not need to create an account to access our Zotero library. Each month, we’ll be updating our Zotero library with new links. 

What is an archived URL? An archived url or permalink is a stable, archived record of a website with a unique permanent URL that points to the archived record so you can reference the source, even if the original disappears from the web.

Coffee Chat Recap: WARC-GPT, Web Archiving, and AI

On June 14, 2024, the Web Archiving Section hosted Matteo Cargnelutti and Kristi Mukk from the Harvard Library Innovation Lab for a demo and discussion about WARC-GPT: an experimental open-source Retrieval Augmented Generation tool for exploring collections of WARC files using AI.

Blog post: https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/
WARC-GPT on Github: https://github.com/harvard-lil/warc-gpt

In addition to walking through the tool and explaining how it works, they discussed their experience testing out WARC-GPT with a small thematic collection of URLs, and shared their concept of “Librarianship of AI” and how web archivists can respond to this AI moment.

Recording

View the transcript.

Slide Deck

Thank you to all the attendees for an engaging discussion, and stay tuned for more coffee chats in the future!

Upcoming Coffee Chat: “WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI”

Join the SAA Web Archiving Section on Friday, June 14, 12:00-1:00pm Eastern for a discussion with Matteo Cargnelutti and Kristi Mukk from the Harvard Library Innovation Lab about web archiving and AI! 

Description:

Can the techniques used to ground and augment the responses provided by Large Language Models be used to help explore web archive collections? That question led us to develop and release WARC-GPT: an experimental open-source Retrieval Augmented Generation tool for exploring collections of WARC files using AI. WARC-GPT functions as a highly-customizable boilerplate the web archiving community can use to explore the intersection between web archiving and AI. Specifically, WARC-GPT is a RAG pipeline, which allows for the creation of a knowledge base out of a set of WARC files, which is later used to help answer questions asked to a Large Language Model (LLM) of the user’s choosing. In this session, we will demo the tool and explain how it works, discuss our experience testing it out so far, and share our perspective on how web archivists can respond to this AI moment.

Blog post: https://lil.law.harvard.edu/blog/2024/02/12/warc-gpt-an-open-source-tool-for-exploring-web-archives-with-ai/
WARC-GPT on Github: https://github.com/harvard-lil/warc-gpt

Registration:

Please register in advance for this meeting: https://harvard.zoom.us/meeting/register/tJElceGrrDwpGdBcPae2eMWwrv4j_TqDmLcT

After registering, you will receive a confirmation email containing information about joining the meeting.

This presentation will be recorded, and the recording link will be made available afterward.

Upcoming Webinar: Oral History Indexing (OHI)

The SAA Web Archiving Section invites you to a virtual coffee chat with Doug Lambert of the University at Buffalo on Friday, April 26th at 12:00pm Eastern. Doug will discuss Oral History Indexing (OHI) and it’s relation to web archives, AI, and system interoperability.

Description:

“Oral history indexing (OHI) is a set of practices that emerged in recent decades for digital content management and multimedia presentation of large audio/video (A/V) collections. Driven by the desire to publish complete interviews (and made possible by computer-based media), oral historians working with web developers introduced a variety of custom interfaces for A/V access centered around thematically defined passages within digital files. Akin to an indexed book, OHI systems allow cross-referencing to specific points within media documents, describe content through natural language, and promote users browsing and exploring across collections. Different than transcript-based models for mapping content, OHI focuses on access to A/V directly and dynamically through meaningful metadata elements like segments, summaries, and terms controlled and uncontrolled, all methodically structured at the timecode level. Since the 1990’s, a range of approaches, methodologies and system attributes evolved led by a variety of libraries, museums, and other institutions. I inventoried some of these practices in 2023 in the Oral History Review and began to characterize a distinctive phase in OHI—a pre-AI era of software-enabled but fully human-performed indexing processes.

The 20+ year OHI enterprise as I defined it is arguably much more advanced than any sub-file level content management practices known in A/V web archiving, yet OHI sorely lacks metadata standardization or strategic system interoperability. In this talk, I will expand my 2023 inventory to characterize the fundamental elements of OHI across systems—a necessary baseline for future development. I will also talk about how the next generation of OHI will incorporate AI tools and I will invite a discussion on how the future of these practices can benefit from the involvement of experienced web archivists.”

Oral History Indexing by Doug Lambert: https://www.tandfonline.com/doi/full/10.1080/00940798.2023.2235000

Registration:

Please register in advance for this meeting: https://union.zoom.us/meeting/register/tJctcu2qrz0oH9FlXi_uhYVhvG7sFDiVHILW

After registering, you will receive a confirmation email containing information about joining the meeting.

Web Archiving Roundup: February/March 2024

Note: Our section created a Zotero library for web archiving resources, so you can find all the resources featured in our monthly web archiving roundups in one place! You do not need to create an account to access our Zotero library. Each month, we’ll be updating our Zotero library with new links. 

Here are a few quick links on interesting web archiving topics we’ve found recently: 

News: 

**What is an archived URL? An archived url or permalink is a stable, archived record of a website with a unique permanent URL that points to the archived record so you can reference the source, even if the original disappears from the web.

Web Archiving Roundup: January 2024

Note: Our section now has a Zotero library for web archiving resources, so you can find all the resources featured in our monthly web archiving roundups in one place! You do not need to create an account to access our Zotero library. Each month, we’ll be updating our Zotero library with new links. 

Here are a few quick links on interesting web archiving topics we’ve found recently: 

  • “Exploiting the untapped functional potential of Memento aggregators beyond aggregation” (live URL | archived URL
  • “Migrating to pywb at the National Library of Luxembourg” (live URL | archived URL
  • “Archival HTTP Redirection Retrieval Policies” (live URL | archived URL
  • “Using Wayback Machine and Google Analytics to Uncover Disinformation Networks” (live URL | archived URL
  • “Examining the Challenges in Archiving Instagram” (live URL | archived URL
  • “Continuity and discontinuity in web archives: a multi-level reconstruction of the firsttuesday community through persistences, continuity spaces and web cernes” (live URL | archived URL
  • “Updates to Memento Damage” (live URL | archived URL
  • “Auditing Web Archiving Livestreams” (live URL | archived URL
  • “Mozilla Report: How Common Crawl’s Data Infrastructure Shaped the Battle Royale over Generative AI” (live URL | archived URL

New tools: 

News: 

  • “Google will no longer back up the Internet: Cached webpages are dead” (live URL | archived URL
  • “DISCMASTER rises again” (live URL | archived URL). DISCMASTER offers users the ability to perform semantic search of thousands of shareware and compilation CD-ROMs at the Internet Archive (note: the site is not run by the Internet Archive)

A Tale of Two Tools: Archiving the University at Buffalo’s Web Periodicals

This post was written by Grace Trimper, Digital Archives Technician at University at Buffalo.

University Archives has been collecting alumni magazines, school and departmental newsletters, student newspapers, and literary magazines for decades – likely since the department was established in 1964. For a long time, this meant getting on snail mail lists or walking to departments to pick up the print issue of whichever publication was just released. These days, collecting periodicals also looks like clicking “download” or copying and pasting a URL into a web archiving tool.

Our work with digital periodicals ramped up in 2020, partially because the pandemic caused more and more publications to be delivered online. Most of UB’s periodicals are now available both in print and as PDFs, which makes preservation relatively straightforward: we download the PDF, create Dublin Core metadata, and ingest the package into our digital preservation system, Preservica.

We also saw schools, departments, and organizations begin publishing strictly web-based content without print or PDF surrogates. One of the first web periodicals we added to our collection was “The Baldy Center Magazine,” the semesterly publication of the Baldy Center for Law and Social Policy. The Digital Archivist set up a web crawl using Preservica’s Web Crawl and Ingest workflow, which uses Internet Archive’s Heritrix web crawler to capture websites and their various subpages.

The main benefit of this approach is convenience. We set the workflow to make a predetermined number of hops and capture a maximum depth, so we can crawl the magazine and its linked pages without capturing UB’s entire website. Other than checking the resulting web archive file (WARC), there’s not much else we need to do once the workflow is run. When the web crawl is complete, the WARC automatically ingests into the collection’s working folder, and we can link it to the finding aid without any other intermediary steps.

However, we did notice the Heritrix web crawler struggles with some university webpages and does not always capture images and multimedia. The captured magazine looked almost right – it was just missing a few of the pictures. We learned that the renderer the system uses has trouble intercepting URLs for some of the images throughout the website. This is a known limitation, and the construction of UB’s website can be more complex than the system can handle.

We ran into similar obstacles when running weekly crawls on the website for UB’s student newspaper, The Spectrum. As an independent newspaper, its website is not hosted by the university. At first, this had some benefits: the Heritrix crawler didn’t have the image problems we saw with UB-hosted sites, and the rendered WARCs looked basically identical to the live site.

Then, they redid their website. Our captures haven’t looked good since. Certain fonts are wrong, and the navigation menu expands to cover 90% of the page any time you try to click on or read an article. It seems to be another issue with the renderer, so we continue to crawl the site as usual to make sure we don’t miss out on preserving new content.

Even though the Heritrix crawler worked well for some of our collections, it was costing us time and energy that could be spent on other projects. We needed another option. Enter Conifer.

Conifer is a free and open-source web archiving tool with a web interface. We have had better luck capturing images and multimedia with Conifer and Webrecorder. Like The Spectrum, UB’s undergraduate literary magazine, NAME is hosted independently from UB’s website. Its construction is relatively straightforward: there’s a webpage for each issue with links to contributions, and the website is full of images. There are also a couple of technologically unique works on the site, including a link to a JavaScript multimedia piece.

When I crawled the site for preservation in UB’s Poetry Collection, Conifer had no problem capturing any of these, and the resulting WARCs display perfectly in our public user interface.

This approach doesn’t come without its drawbacks. Where the Web Crawl and Ingest workflow in Preservica is convenient and automatic, using Conifer can be tedious. First, there is no setting and forgetting; if you want to capture a complex website with various links and subpages, you must start the application and then open each link to every page you want to capture. If you have too many tabs open, the application can randomly stop in the middle of a crawl, leaving you to start all over again. On top of that, we have the extra steps of downloading, unzipping, and ingesting the WARC, plus manually copying and pasting the URL into the asset’s metadata before the captured page will display in the digital preservation system.

No approach has been perfect thus far, and I don’t expect it will be for a while. Web archiving technology is constantly growing and improving, and how we attack web archiving depends heavily on the material. But with the tools we have available to us, we can preserve important pieces of UB’s history we wouldn’t have been able to before. And that’s kind of the point, isn’t it?

Upcoming Webinar: WARC your Email: How Web Archives Works for Email Preservation

Join the Web Archiving Section on Jan 26, 2024 at noon EST for a discussion with Gregory Wiedeman about web archiving and email preservation!

Preserving email is a challenge. Not only are there proprietary and loosely structured formats, but CSS or image content within email might be hosted on external servers. PDFs are useful for access, but the lack of data structure in PDFs limits future use. Since email is built on web technologies, web archives can help ensure long term use of email, and WARCs may be an effective technical solution for email preservation. This talk will overview and demonstrate how the mailbagit tool converts email export files to WARCs and discuss the strengths of web archives for email preservation. Gregory Wiedeman is the university archivist in the M.E. Grenander Department of Special Collections & Archives at the University at Albany, SUNY where he helps ensure long-term access to the school’s public records. He oversees collecting, processing, and reference for the University Archives and supports born-digital collecting, web archives, and systems implementation for the department’s outside collecting areas.

Please register in advance for this meeting:
https://union.zoom.us/meeting/register/tJYvce2rrj0rH92yDNkKDGjPD0UCu8BxmHX8

After registering, you will receive a confirmation email containing information about joining the meeting.

When the Past Becomes Present: Reparative Description in NYU’s Web Collections

This post was written by Lizzy Zarate, Web Archives Student Assistant for NYU Archival Collections Management and Student Member of the Web Archiving Steering Committee. She is currently completing an MA in Archives & Public History at NYU.

Among the technical elements involved in web archiving, it’s easy to neglect the importance of description. It doesn’t take long to notice that an archived website is not playing videos or that the images are missing. It is harder to discern what is missing in a website’s description. Unlike a physical document, most websites are constantly changing. As such, it can be difficult to write descriptions that persist over time. Many of the websites in NYU’s collections were first captured in the 2010s; naturally, society has changed, and descriptive language should evolve with it. This consideration led me to wonder: how can we engage in reparative description work for NYU’s web collections?

In February of 2022, with the help of Web Archivist Nicole Greenhouse, I began researching best practices for inclusive and anti-oppressive description. While the resources I discovered were extremely helpful, I wasn’t able to find much guidance specifically geared towards web archives. Granted, many of the practices from traditional archival description can and should be applied, but there is still the problem of describing materials that can rapidly and drastically change at any time. With this consideration in mind, I utilized resources such as the Digital Transgender Archive Style Guide and Anti-Racist Description Resources to inform my work.1 As I began to comb through the web collections, it became clear that much of the reparative description would focus on revising languages of exclusion. 

Here’s one example. The Communications Workers of America website has been crawled 324 times since 2007. This is how the website was originally described:

“CWA, America’s largest communications and media union, represents over 700,000 men and women in both private and public sectors.”

The use of “men and women” implicitly erases nonbinary people and other individuals who don’t identify with these categories. To figure out a course of action, I started by visiting the most recent version of the archived website to read their current organizational biography, which referred to members as workers rather than in terms of their gender. Next, I used the history of the website itself to verify that the language I was using was faithful to the organization’s history. In the earliest crawls, the website had also used “men and women” to refer to its members. Using the archive, I was able to determine that this was changed in 2015.

Screenshot of CWA's "Profile & History" webpage captured in 2008.

CWA’s webpage in 2008 refers to its members as “men and women”

Screenshot of CWA's "About CWA" webpage captured in 2016.

CWA’s webpage in 2016 refers to its members as “workers”

Because of this change, I felt it was appropriate to broaden the language used in this description. Here is how I revised it:

“CWA, America’s largest communications and media union, represents over 700,000 workers in both private and public sectors.”

This is a small shift in wording, yet it has larger implications for the archive: our descriptive practices should not default to language such as “men and women” when we’re really just talking about a group of people, gender identity irrelevant. The value of archiving CWA’s website is to document the history of labor organizations. In this case, the language that was initially used actually ends up being a distraction from the primary function of the description. Much has been written in the field about archival silences. For web collections, this is present not only in whose websites we choose to collect, but in how we represent them.

Many of the descriptions I flagged belonged to entries in the Student Organizations collection. It appeared that most of the descriptions in this collection were reproduced from the organization’s own pages at the time of their first capture, which raised a few questions about gender-inclusive description. If a club for women referred to itself as an “all-female” group in 2013, what obligation did we have to preserve such language in 2023, if at all? Given that student members had written their own descriptions, what authority did I have to define the stance of their organization? After all, these descriptions were written by students who may have changed their views since then, but I am also a student. What if the work I was doing ended up later being seen as inadequate, the same way I was labeling theirs? I wasn’t entirely sure how to proceed.

In most cases, I tried to look up the club’s current page on the live web. Many had updated their information with trans-inclusive and gender-equitable language, so I could revise the description without qualms. Still, a few websites remained where they had either gone inactive or still retained this language. For these instances, I decided to keep the language as it was while adding quotation marks as needed. As written in Archival Collections Management’s Statement on Harmful Language, “While we have control over description of our collections, we cannot alter the content.”2 Making these changes avoided misrepresenting the position of the organization, but clarified that the language used did not necessarily align with the stance of our department. 

Working through this issue forced me to confront the idea that I had less power over the archive as a student worker. The choices that I made would directly impact how users interact with NYU’s web collections. They would also indirectly reflect ACM’s position on these topics. Consequently, I had to take responsibility for the choices I made in reparative description. I did so with the understanding that all description is iterative and no language could ever perfectly represent all the voices of one community.

Reparative description is often discussed in the classroom, but to engage in it practically as a student worker in web archives has helped clarify my own personal ethos as an archivist. The work is ongoing with no clear endpoint, but it is important to make the time and space for it within our daily work. As Dorothy Berry writes in “The House Archives Built”, “Our descriptive systems are often the first interaction patrons have with our institutions, and when the language and systems feel alienating, patrons will take what they need and leave the rest.”3 By repairing harmful descriptions where we see them, we can remove an unnecessary barrier to access for users of web archives.

References:

  1. “Anti-Racist Description Resources,” Archives for Black Lives in Philadelphia, Oct 2019. Accessed Dec 2023. https://archivesforblacklives.files.wordpress.com/2019/10/ardr_final.pdf.
    “DTA Style Guide,” Cailin Roles and Eamon Schlotterback, Fall 2020. Accessed 1 Dec 2023. https://docs.google.com/document/d/1qou1h4DLFQEZg4BIvXiEpGy_TI3rDnrJsPXCsRL-Ki8/edit.
  2. “Inclusive and Reparative Work”, Archival Collections Management, NYU Libraries, updated 4 Dec 2023. Accessed 4 Dec 2023. https://guides.nyu.edu/archival-collections-management/inclusive
  3. “The House Archives Built,” Dorothy Berry, 22 June 2021. Accessed 1 Dec 2023. https://www.uproot.space/features/the-house-archives-built

Author bio:

This post was written by Lizzy Zarate, the Web Archives Student Assistant for NYU Archival Collections Management. She is currently completing an MA in Archives & Public History at NYU.