Heritrix 2.0.0 Has Arrived!

by

The 2.0.0-RC1 Heritrix release includes full functionality from phase two of the smart crawler project as well as a major refactoring of the crawler interfaces. The goal of smartcrawler phase 2 was to improve Heritrix capabilities for prioritizing URLs and sites, both via manual operator configuration and as an output of automated between-crawl analysis.

See below for more details on this release.

Release notes, with instructions to download and install, are at:

http://webteam.archive.org/confluence/display/Heritrix/2.0.0-RC1+Release+Notes

Four notable differences in Heritrix 2 are:

(1) A more rigorous separation of the Web UI from the ‘crawl engine’, giving greater flexibility to control crawlers remotely.
(2) A new settings system, easing module development and offering new opportunities for dynamic configuration construction.
(3) A new mechanism for custom override settings for sets of related URIs, extending beyond Heritrix 1.x’s domain-centric overrides.
(4) A new system for ordering URIs within a single URI-queue, and for allocating frontier effort among different URI-queues, based on assigned integer ‘precedence’ values.

A tutorial of starting a basic crawl in the changed web UI is available at:

http://webteam.archive.org/confluence/display/Heritrix/2.0+Tutorial

Other updated documentation is not yet available but material will be improved on the wiki on an ongoing basis. Most settings and components from 1.x versions remain, though the on-disk settings format and job directory layout has changed somewhat. We are especially interested in whether people are able to use the web UI to duplicate and successfully launch crawl configurations equivalent to what they relied upon in 1.x.

2 Responses to “Heritrix 2.0.0 Has Arrived!”

  1. Paul Jack Says:

    Hi, could you please post your request to the archive crawler mailing list? It’s at archive-crawler@yahoogroups.com, and it’s the official way to ask for help on Heritrix. Thanks!

  2. over Says:

    At first, congratulations about your job!
    No I’m trying to run this new version of heritrix 2.0.0 RC1
    and when i’m trying to run it following the tutoral i get this error:

    $ sudo $HERITRIX_HOME/bin/heritrix -r -a admin
    WARNING: No $HERITRIX_HOME/conf/jmxremote.password found.
    WARNING: Disabling remote JMX.
    mié dic 12 21:16:16 CET 2007 Starting heritrix[: 195: ne: unexpected operator
    .kill: 195: No such process

    [: 195: ne: unexpected operator
    .kill: 195: No such process

    [: 195: ne: unexpected operator
    .kill: 195: No such process

    [: 195: ne: unexpected operator
    .
    Any idea about it? Yesterday i tried the 1.12.1 version and works great with no errors.
    Regards!

Comments are closed.