Announcing pywb 2.0 release!
We’re happy to announce that an updated release of pywb, the Python open-source web archiving engine that powers Webrecorder has finally been released!
The 2.0 release of pywb represents a major refactoring and improvement of pywb, which has become the core engine that powers much of Webrecorder’s functionality. New documentation is also available!
Initially, pywb started out as “Python Wayback” machine but has grown into a flexible, customizable framework for creating and replaying web archives.
Now, pywb has evolved to be a core of component of the Webrecorder web archiving stack, while continuing to function as a power standalone web archiving tool.
New Features and Improvements
At it’s core, pywb provides state-of-the-art web archive replay. The server and client side rewriting systems have been overhauled to support modern websites, including support for highly complex modern web applications. The server-side rewriting includes special improvements for handling HTML5 based video, JSONP requests, non-GET HTTP requests, while the client-side system includes many overrides to support modern JS.
An extensible fuzzy matching rule system is designed to support replay of dynamically built urls that may change from one viewing to another.
A major addition to pywb is the ability to record as well as replay web archives.
For example, simply by running pywb --record --live
, pywb can also act as a web capture/recording tool.
The recorder functionality forms the basis Webrecorder’s symmetrical web archiving approach to web archiving.
pywb also supports dynamic updates to collections, metadata and UI templates. Indexes can be updated automatically when WARC files are added. New collections can be added and updated simply by adding files to a directory, no restart required of pywb required. An automatic “all” collection can aggregate results from multiple collections, while maintaining collection provenance.
Together with support for the Memento protocol, pywb can also be used to provide a full-fledged “Memento Aggregator”, aggregating resources from remote and local archives, and creating fallback chains for more complex behavior.
The architecture of pywb is split into a few distinct components, including a Warceserver, Recorder (recording system) and Rewriter (rewriting system).
The Warcserver component can be run as standalone server, and is a successor to the CDX Server API, providing an API for querying the web archive index, as well as retrieving full archival (WARC or ARC) records.
Thank you to Contributors
We also wanted to include a few shoutouts to contributors. pywb in an open-source project, and while we are the main developers, it is always great to receive contributions from outside developers.
We want to include a shoutout to John Berlin from ODU WS-DL for an important contribution to the robustness of client-side rewriting).
We also want to thank Fernando-Melo from arquivo.pt for contributing a much improved query UI, one of the many new features in this release.
And also props to Rebecca Cremona from Harvard LIL for fixing a vexing bug related to url-rewriting of image srcset attributes.
What else can you do with pywb? Thanks to new SOCKS proxy support, it can now be used to record and browse Tor sites. For a working example that does just that, try pywb-recorder-tor from our friend and colleague Raffaele Messuti
Documentation
This version of pywb includes the most detailed documentation yet, but it is still a work in progress. A few outstanding features still need to be documented (tracked by this issue)
If you find anything that you’d like explained better, please let us know by opening an issue on Github. And if you would like to contribute (documentation or additional fixes), feel free to open an issue and a pull-request.