December 19, 2007

Internet Archive and Zotero will be joining forces thanks to a grant from the Andrew W Mellon Foundation. Here are a few links to more information about this collaboration.

More details to follow as we begin our exciting work with Zotero.

December 7, 2007

The 2.0.0-RC1 Heritrix release includes full functionality from phase two of the smart crawler project as well as a major refactoring of the crawler interfaces. The goal of smartcrawler phase 2 was to improve Heritrix capabilities for prioritizing URLs and sites, both via manual operator configuration and as an output of automated between-crawl analysis.

See below for more details on this release.

Release notes, with instructions to download and install, are at:

Four notable differences in Heritrix 2 are:

(1) A more rigorous separation of the Web UI from the ‘crawl engine’, giving greater flexibility to control crawlers remotely.
(2) A new settings system, easing module development and offering new opportunities for dynamic configuration construction.
(3) A new mechanism for custom override settings for sets of related URIs, extending beyond Heritrix 1.x’s domain-centric overrides.
(4) A new system for ordering URIs within a single URI-queue, and for allocating frontier effort among different URI-queues, based on assigned integer ‘precedence’ values.

A tutorial of starting a basic crawl in the changed web UI is available at:

Other updated documentation is not yet available but material will be improved on the wiki on an ongoing basis. Most settings and components from 1.x versions remain, though the on-disk settings format and job directory layout has changed somewhat. We are especially interested in whether people are able to use the web UI to duplicate and successfully launch crawl configurations equivalent to what they relied upon in 1.x.