Conifer Blog | Self-Hosted Archival Embeds

The embedded web page is everywhere: Many web sites, social media in particular, allow users to embed content—text posts, videos, map widgets, or full social media timelines—into other pages located outside of their respective originating services. This practice has become a cornerstone of online discussion culture and criticism, political journalism, brand-building, and helps providing services to visitors of a page.

Ever since political discourse has largely moved to social media, news sites have embedded tweets, facebook posts, or other social media items directly into their articles.

But what if the embedded content changes after the fact? Depending on the social media platform, text posts might be edited, comments be added or deleted, the whole account might change ownership resulting in modified user names or profile images, or the content might be removed altogether, because of changed platform policies or by choice of the creator. In many cases, this will cause confusion on why the item was embedded in the first place, as the embedding page, for instance a piece of political journalism, will provide context that doesn’t apply anymore to the item discussed. A common workaround to that problem have been foregoing embeds in favor of screenshots. However, screenshots lack the interactive depth and fidelity that a full embed provides.

In 2015, Rhizome published a blog post on archiving Vines in collaboration with National Football Museum. This post references and embeds video which at the time of publications were at high risk of being deleted from the Vine platform. An earlier version of Webrecorder was used to capture the Vine embeds.

Today, 5 of the 17 embedded videos are no longer online, and since 2017 the Vine platform is hanging in an insecure state of availability as it was officially discontinued. Had the article used the standard process of embedding vines, at least 30% of the content would no longer be availabe due to “embed rot,” and the remaining 70% would be at high risk of vanishing in the near future.

In 2015, preserving the embedded vines was realized by publishing the blog post on rhizome.org, capturing the post including all embeds with an early version of webrecorder.io, and finally changing the embed codes in the live blog post to point to the archived version of the post at webrecorder.io.

From the perspective of an online publisher who wishes to maintain important embedded items, creating a partnership with a web archiving service provider or starting their own inhouse web archiving program (for instance by deploying an internal instance of Webrecorder) might be quite a leap from just using standard embeds or just screenshots.

But what if small web archival “snippets,” such as a single social media embed, could be directly served from the publisher’s own domain and rendered in the reader’s browsers?

Client-side web archive rendering, introduced in a previous blog post, opens up interesting new possibilities: “micro web archives” can be handled like other types of files—images, MP3 audio, or PDF documents—; web archival snippets can be bundled and managed together with news articles or any online publication that they’re referenced from; captures of social media embeds can be presented side-by-side with the live version of these resources, or several versions of the same resource can be embedded and contextualized.

Archival Embeds

A proof-of-concept of archival embeds is presented in this post, using a few examples from popular sites that provide embedding functionality.

The approach takes inspiration from the Memento Projects’ Robust Links, which describes a way to “decorate” hyperlinks or entire pages with metadata pointing to web archives that should hold a copy of a resource at an exact point of revision. What if instead of decorating links, we could also decorate embeds, and point to a copy of the embed in a micro web archive?

This prototype presented here suggests doing this via HTML <template> tags and custom data- attributes.

For example, code provided by Twitter to embed a single tweet looks like this:

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Archiving a 5000-post Instagram account using <a href="https://twitter.com/webrecorder_io?ref_src=twsrc%5Etfw">@webrecorder_io</a> <a href="https://t.co/zzmzun5bKO">pic.twitter.com/zzmzun5bKO</a></p>&mdash; Michael Connor (@michael_connor) <a href="https://twitter.com/michael_connor/status/697466853900337152?ref_src=twsrc%5Etfw">February 10, 2016</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

A tool—even a browser extension—could take the embed code as input, create the WARC file, and produce the decorated archival embed code:

<template data-digest="855e512c5369e4274f96f58f2cdcf0fde4e718837021c7833083303b580ee832" data-archive-name="tweet" data-width="800px" data-height="550px" data-archive-file="./embeds/warcs/embedtest1.warc.gz">
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Archiving a 5000-post Instagram account using <a href="https://twitter.com/webrecorder_io?ref_src=twsrc%5Etfw">@webrecorder_io</a> <a href="https://t.co/zzmzun5bKO">pic.twitter.com/zzmzun5bKO</a></p>&mdash; Michael Connor (@michael_connor) <a href="https://twitter.com/michael_connor/status/697466853900337152?ref_src=twsrc%5Etfw">February 10, 2016</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</template>

Either the WARC file would be automatically become an asset of the target publication, or it could be manually uploaded. Authors would then use the archival embed code instead of the one provided by Twitter.

The creation of such a tool will be a topic for another post, but the end results might look like the examples below.

Examples of Archival Embeds from Common Sites

The examples work best in Chrome and Firefox, and latest versions of Safari, but are still in prototype stage. Each example provides an archive and live version of an embed, selectable as tabs. Please try reloading if the examples don’t render at first.

If you’re having trouble seeing the embeds, please refer to this screencast GIF.

1. Twitter: Archived Tweet vs Live Tweet

In this tweet sample, the archived and live versions should look the same unless the live tweet was changed.

2. Twitter: Archived Tweet vs Deleted Tweet

Tweets can not be edited, but they can be deleted. If an embedded tweet is deleted, the embedding page has no way to know, but an archived version can still persist:

3. Facebook: Archived Post vs Modified Post

Facebook allows altering posts after publication. The following example allows comparison between an archived and a live version:

4. Google Maps: Archived Map vs Live Map

Mapping services offer lots of possibilities to add custom data, like routes, landmarks, and descriptions. The API to inject these custom items might change, an archived snipped of a map embed can help preserving the work invested into the customization.

In other cases, names of places or the accuracy of map data can be a topic of discussion. An archived snipped of a map can provide comparison with the live map.

5. Instagram: Archived Video Post vs Live Post

Here’s an additional example of a more complex embed with video, along with original:

Embeds in WARCs and Client-Side Rendering

All of the captured embeds in these examples are loaded from regular WARC files that can be handled by Webrecorder.io, https://wab.ac/, or any other WARC processing tool. The WARCs are stored on the same domain this blog is hosted on, blog.webrecorder.io.

Rendering of the archived embeds happens via JavaScript in the browser, using the replay system available more generally on wab.ac. This JavaScript framework is loaded from the blog’s domain as well and currently consists of the following four script files:

embedlookup.js – Main script which initializes archival embeds <templates> markup.
sw.js – The Service Worker script from wabac.js
brotliDecode.js – Brotli Decoder for Service Worker
wombat.js – client-side replay from https://github.com/webrecorder/wombat

There are no external dependencies on webecorder.io or any other site. Everything is hosted as static files, with no server-side computation required.

Trusting Web Archives

The system provides a proof-of-concept for self-hosted archival embeds, a sort of “micro web archive.” But if anyone can host a micro web archive, how can one verify that the content is accurate and has not been tampered with? The WARC format itself provides no means of verification, so additional measures will need to be taken. Built-in verification methods are a precondition for distributed web archives to be taken seriously. Future work in this area should focus on exploring ways to make web archives verifiable and more trustworthy.