Archive for March, 2007

Iraq War Anniversary

March 19, 2007

With the anniversary of the Iraq War, its hard not to think about the start of the war in 2003. One way to look back is to review news coverage of the start of the war in the Wayback Machine at http://www.archive.org. Listed below is a small sampling of news and government home pages from the first day of the war and the week following:

Arab News, a Saudi Arabian daily
Australia News Network
BBC
CNN
India Times
New York Times
San Francisco Chronicle

White House home page
Department of Defense

The general web archive is always growing. Check out our FAQ for information about submitting your own site to the collection.

Heritrix 1.12.0 – Crawling Smarter

March 17, 2007

We are excited to announce the release of Heritrix 1.12.0. It is available for download on sourceforge.

Release 1.12.0 is the first of several planned releases enhancing Heritrix with “smart crawler” functionality. The smart crawler project is a joint effort between Internet Archive, British Library, Library of Congress, the Bibliothèque Nationale de France and members of the IIPC (International Internet Preservation Consortium). This is the first year of a multi-year project.

The first stage of smart crawler aims to detect and avoid crawling duplicate content when crawling sites at regular intervals. The new release of Heritrix addresses this in two ways. First by using a conditional get when fetching pages from http servers. Second, if the responding server does not support conditional get, Heritrix will compare the new content hash with what has previously been crawled. Additional de-duplication features will be added later this year.

Release 1.12.0 also includes updated WARC readers and writers to match the latest revision of the specification, 0.12 revision H1.12-RC1. WARC is the next generation archiving file format, a revision of the Internet Archive ARC file format. Please see the release notes for more information about these and other included features and bug fixes.

Subsequent phases of the smart crawler project will also focus on enhanced URL prioritization and crawling that is sensitive to the rate at which individual web pages change.

As always, all Heritrix code is open source. We are proud to help support the open source community. If you would like to get more involved or contribute code to Heritrix visit crawler.archive.org.

Around the World in 2 Billion Pages

March 9, 2007

In December 2006, Internet Archive was honored to receive a grant from the Mellon Foundation for our ongoing development of the Heritrix web crawler. Using this grant, Internet Archive will be embarking on a 2 billion page web crawl this Summer. This will be the largest web crawl we have ever attempted.

We are currently seeking url submissions for this historic crawl from libraries and archives as well as other cultural and memory institutions. We especially want international web content from a large variety of countries, geographic regions and language bases.

Please help us gather this content! You will need a log in name/password to contribute URLs. Please email aroundtheworld at archive.org for an invitation.