Archive for the ‘Open Source’ Category

Archive-It and LOCKSS Interoperability!

July 21, 2009

The Archive-It team is excited to announce that a successful transfer of Archive-It data moved from the Internet Archive data center into the LOCKSS network.  The transfer was part of a Andrew W. Mellon foundation project with the University of Rochester.   

We are excited to be able to provide these and other preservation options to Archive-It partners as we increase the interoperability of the Archive-It service.  If you are interested in learning more, please contact the Archive-It team. More information about the LOCKSS system can be found at www.lockss.org

WARC File Format Published as an International Standard

June 3, 2009

An exciting announcement from the International Internet Preservation Consortium regarding the preservation file format generated using the Heritrix web crawler (used for all Archive-It and Internet Archive crawls for partners):

The International Internet Preservation Consortium is pleased to
announce the publication of the WARC file format as an international
standard: ISO 28500:2009, Information and documentation — WARC file
format.
[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]
For many years, heritage organizations have tried to find the most
appropriate ways to collect and keep track of World Wide Web material
using web-scale tools such as web crawlers. At the same time, these
organizations were concerned with the requirement to archive very large
numbers of born-digital and digitized files. A need was for a container
format that permits one file simply and safely to carry a very large
number of constituent data objects (of unrestricted type, including many
binary types) for the purpose of storage, management, and exchange.
Another requirement was that the container need only minimal knowledge
of the nature of the objects.
The WARC format is expected to be a standard way to structure, manage
and store billions of resources collected from the web and elsewhere. It
is an extension of the ARC format
[http://www.archive.org/web/researcher/ArcFileFormat.php ], which has
been used since 1996 to store files harvested on the web. WARC format
offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for
every contained file, the management of duplicates and of migrated
records, and the segmentation of the records. WARC files are intended to
store every type of digital content, either retrieved by HTTP or another
protocol.
The motivation to extend the ARC format arose from the discussion and
experiences of the International Internet Preservation Consortium [
http://netpreserve.org/ ], whose core mission is to acquire, preserve
and make accessible knowledge and information from the Internet for
future generations. IIPC Standards Working Group put forward to ISO
TC46/SC4/WG12 a draft presenting the WARC file format. The draft was
accepted as a new Work Item by ISO in May 2005.
Over a period of four years, the ISO working group, with the
Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener,
collaborated closely with IIPC experts to improve the original draft.
The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the
standard and prepare its future revision.
Standardization offers a guarantee of durability and evolution for the
WARC format. It will help web archiving entering into the mainstream
activities of heritage institutions and other branches, by fostering the
development of new tools and ensuring the interoperability of
collections. Several applications are already WARC compliant, such as
the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the
WARC tools [http://code.google.com/p/warc-tools/ ] for data management
and exchange, the Wayback Machine
[http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX
[http://archive-access.sourceforge.net/projects/nutch/ ] and other
search tools [http://code.google.com/p/search-tools/ ] for access. The
international recognition of the WARC format and its applicability to
every kind of digital object will provide strong incentives to use it
within and beyond the web archiving community.
A press release is available on the IIPC website:
General information about the IIPC can be found at:
———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org

The International Internet Preservation Consortium is pleased to
announce the publication of the WARC file format as an international
standard: ISO 28500:2009, Information and documentation — WARC file
format.

[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]

For many years, heritage organizations have tried to find the most
appropriate ways to collect and keep track of World Wide Web material
using web-scale tools such as web crawlers. At the same time, these
organizations were concerned with the requirement to archive very large
numbers of born-digital and digitized files. A need was for a container
format that permits one file simply and safely to carry a very large
number of constituent data objects (of unrestricted type, including many
binary types) for the purpose of storage, management, and exchange.
Another requirement was that the container need only minimal knowledge
of the nature of the objects.

The WARC format is expected to be a standard way to structure, manage
and store billions of resources collected from the web and elsewhere. It
is an extension of the ARC format
[http://www.archive.org/web/researcher/ArcFileFormat.php ], which has
been used since 1996 to store files harvested on the web. WARC format
offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for
every contained file, the management of duplicates and of migrated
records, and the segmentation of the records. WARC files are intended to
store every type of digital content, either retrieved by HTTP or another
protocol.

The motivation to extend the ARC format arose from the discussion and
experiences of the International Internet Preservation Consortium [
http://netpreserve.org/ ], whose core mission is to acquire, preserve
and make accessible knowledge and information from the Internet for
future generations. IIPC Standards Working Group put forward to ISO
TC46/SC4/WG12 a draft presenting the WARC file format. The draft was
accepted as a new Work Item by ISO in May 2005.

Over a period of four years, the ISO working group, with the
Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener,
collaborated closely with IIPC experts to improve the original draft.
The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the
standard and prepare its future revision.

Standardization offers a guarantee of durability and evolution for the
WARC format. It will help web archiving entering into the mainstream
activities of heritage institutions and other branches, by fostering the
development of new tools and ensuring the interoperability of
collections. Several applications are already WARC compliant, such as
the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the
WARC tools [http://code.google.com/p/warc-tools/ ] for data management
and exchange, the Wayback Machine
[http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX
[http://archive-access.sourceforge.net/projects/nutch/ ] and other
search tools [http://code.google.com/p/search-tools/ ] for access. The
international recognition of the WARC format and its applicability to
every kind of digital object will provide strong incentives to use it
within and beyond the web archiving community.

A press release is available on the IIPC website:
http://netpreserve.org/press/pr20090601.php

General information about the IIPC can be found at:
http://netpreserve.org

———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org

Internet Archive at OSCON

July 24, 2008

Tomorrow, at the O’Reilly Open Source Convention in Portland, I’ll be presenting a session about our open source web archiving tools. Full details:

Build Your Own Web Archive: archive.org’s Open Source Tools to Crawl, Access & Search Web Captures
Gordon Mohr (Internet Archive, Web Group)
11:35am Friday, 07/25/2008
Web Applications
Location: E145

The Internet Archive, with support from other libraries around the world, has helped develop a collection of open source tools in Java to support web archiving. These include the Heritrix archival web crawler, “Wayback” for replaying historic web content, and extensions to Nutch for web archive full-text search. This session will explain the design and capabilities these tools, and quickly demo their use for the creation of a small personal web archive.

Heritrix has been designed for faithful and complete content archiving but has also found use in other web search contexts. Wayback allows URL-based lookup and follow-up browsing of archived web content. Nutch, as applied to archival web crawls, allows Google-style full-text search of web content, including the same content as it changes over time. Together, they provide everything necessary to archive and access accurate historical records of web-published content.

Also: last month James Turner of O’Reilly Media spoke to me in advance of OSCON. You can read or hear the interview at: Gordon Mohr Takes Us Inside the Internet Archives.

Access to Around the World in 2 Billion Pages!

January 2, 2008

Thanks to a generous grant from the Mellon Foundation, Internet Archive completed a 2 billion page web crawl in 2007. This is the largest web crawl attempted by Internet Archvie. The project was designed to take a global snapshot of the Web.

Please browse through the resulting collection.

Special thanks to the memory institutions who contributed URLs to the crawl. The crawl began with 18,000 websites from over 60 countries.

Internet Archive and Zotero

December 19, 2007

Internet Archive and Zotero will be joining forces thanks to a grant from the Andrew W Mellon Foundation. Here are a few links to more information about this collaboration.

Dan Cohen’s Digital Humanities Blog
The Mason Gazette
The Chronicle of Higher Education

More details to follow as we begin our exciting work with Zotero.

Heritrix 2.0.0 Has Arrived!

December 7, 2007

The 2.0.0-RC1 Heritrix release includes full functionality from phase two of the smart crawler project as well as a major refactoring of the crawler interfaces. The goal of smartcrawler phase 2 was to improve Heritrix capabilities for prioritizing URLs and sites, both via manual operator configuration and as an output of automated between-crawl analysis.

See below for more details on this release.

Release notes, with instructions to download and install, are at:

http://webteam.archive.org/confluence/display/Heritrix/2.0.0-RC1+Release+Notes

Four notable differences in Heritrix 2 are:

(1) A more rigorous separation of the Web UI from the ‘crawl engine’, giving greater flexibility to control crawlers remotely.
(2) A new settings system, easing module development and offering new opportunities for dynamic configuration construction.
(3) A new mechanism for custom override settings for sets of related URIs, extending beyond Heritrix 1.x’s domain-centric overrides.
(4) A new system for ordering URIs within a single URI-queue, and for allocating frontier effort among different URI-queues, based on assigned integer ‘precedence’ values.

A tutorial of starting a basic crawl in the changed web UI is available at:

http://webteam.archive.org/confluence/display/Heritrix/2.0+Tutorial

Other updated documentation is not yet available but material will be improved on the wiki on an ongoing basis. Most settings and components from 1.x versions remain, though the on-disk settings format and job directory layout has changed somewhat. We are especially interested in whether people are able to use the web UI to duplicate and successfully launch crawl configurations equivalent to what they relied upon in 1.x.

Internet Archive at IWAW

June 21, 2007

On June 23rd Internet Archive will be presenting at the International Web Archiving Workshop (IWAW) in Vancouver.

Brad will be starting off the day of sessions with a presentation on the Wayback Machine. Here is the abstract from the paper Brad is presenting.

‘Wayback’ is an open-source, Java software package for browser-based access of archived web material, offering a variety of operation modes and opportunities for extension. In its basic, usual configuration it can both list available URL captures by date and offer recursive archive browsing starting from any capture. Advanced configurations offer better performance for challenging archived material and improved navigation.

‘Wayback’ is implemented as a collection of loosely coupled alternate implementations of core modules, for which an overview of each is provided. The functionality and implementation is also contrasted with its inspiration and predecessor, the Internet Archive’s classic public Wayback Machine software, and other ways of accessing archived web material. Finally, future directions for improvement are outlined.

After 4pm Gordon will be giving updates on IA’s tool and format developments.

Please come and and introduce yourself if you are attending the workshop!

Heritrix 1.12.0 – Crawling Smarter

March 17, 2007

We are excited to announce the release of Heritrix 1.12.0. It is available for download on sourceforge.

Release 1.12.0 is the first of several planned releases enhancing Heritrix with “smart crawler” functionality. The smart crawler project is a joint effort between Internet Archive, British Library, Library of Congress, the Bibliothèque Nationale de France and members of the IIPC (International Internet Preservation Consortium). This is the first year of a multi-year project.

The first stage of smart crawler aims to detect and avoid crawling duplicate content when crawling sites at regular intervals. The new release of Heritrix addresses this in two ways. First by using a conditional get when fetching pages from http servers. Second, if the responding server does not support conditional get, Heritrix will compare the new content hash with what has previously been crawled. Additional de-duplication features will be added later this year.

Release 1.12.0 also includes updated WARC readers and writers to match the latest revision of the specification, 0.12 revision H1.12-RC1. WARC is the next generation archiving file format, a revision of the Internet Archive ARC file format. Please see the release notes for more information about these and other included features and bug fixes.

Subsequent phases of the smart crawler project will also focus on enhanced URL prioritization and crawling that is sensitive to the rate at which individual web pages change.

As always, all Heritrix code is open source. We are proud to help support the open source community. If you would like to get more involved or contribute code to Heritrix visit crawler.archive.org.

Around the World in 2 Billion Pages

March 9, 2007

In December 2006, Internet Archive was honored to receive a grant from the Mellon Foundation for our ongoing development of the Heritrix web crawler. Using this grant, Internet Archive will be embarking on a 2 billion page web crawl this Summer. This will be the largest web crawl we have ever attempted.

We are currently seeking url submissions for this historic crawl from libraries and archives as well as other cultural and memory institutions. We especially want international web content from a large variety of countries, geographic regions and language bases.

Please help us gather this content! You will need a log in name/password to contribute URLs. Please email aroundtheworld at archive.org for an invitation.

Confusion at The Register and Slashdot about the Wayback Machine

February 27, 2007

A recent story in the The Register, as picked up by Slashdot, has created some mistaken impressions about the Internet Archive web archive.

To help clear things up:

Nothing related to the site www.iowaconsumercase.org has been “pulled” from the Wayback Machine.

As noted in our FAQ, it currently takes 6-12 months for crawled material to reach the Wayback Machine. According to whois domain information the iowaconsumercase.org domain was only registered on January 5, 2007. So, this site has not yet even had time to appear in our archive. We are working to reduce this lag, but for now: check back midyear.

If you try to access this site in the Wayback Machine (wayback: iowaconsumercase.org), you will currently see an accurate message reflecting that the site is simply not in the archive: “Sorry, no matches.”

If this site had been blocked by a site owner’s robots.txt file,
as some have speculated, the message shown would reflect this. Here’s an example of a robots.txt exclusion. As of this writing, the iowaconsumercase.org site does not return any robots.txt.

If this site had been excluded by a site owner or copyright holder request, as others have speculated, the message shown would reflect that. Here’s an example of a publisher request exclusion.

As of this writing, no party to the Iowa case has made any such requests of the Internet Archive. Indeed, it would be silly for them to do so, since such a request would be both premature in one sense and too late in another. Premature, because it will be several months before any January content could appear in the Wayback Machine. Too late, because GrokLaw, and possibly others, have indicated they will be hosting complete copies of the case material.

We are a small nonprofit library, with limited resources, and it is not part of the mission of the Internet Archive to preserve or offer access to material against the wishes of the material’s publisher or rightsholder. When the Internet Archive receives a bona fide request from a site owner or copyright holder, we handle it in accordance with the policies described and linked to in our FAQ. (See especially this, this, and our exclusion policy, which was created in collaboration with academic and legal experts.)

Other archives may have other policies. In partnership with institutions worldwide, the Internet Archive has created a number of free, open source tools that can be used by anyone to create their own web archives, including the Heritrix crawler, a new Wayback Machine, and the NutchWAX tools for full-text search of web archives with the Nutch search engine. These tools are now used by libraries and archives around the world.

The Internet Archive shares the concerns of the The Register reporter and Slashdot commenters about the preservation of historically significant web content. We hope this post has clarified what hasn’t happened in this particular case, as well as how to understand what would happen in other situations. We archive what we can, while we can, but could always use a hand — effective web archiving will benefit from diverse approaches by many independent actors.