Heritrix 1.12.0 – Crawling Smarter

by

We are excited to announce the release of Heritrix 1.12.0. It is available for download on sourceforge.

Release 1.12.0 is the first of several planned releases enhancing Heritrix with “smart crawler” functionality. The smart crawler project is a joint effort between Internet Archive, British Library, Library of Congress, the Bibliothèque Nationale de France and members of the IIPC (International Internet Preservation Consortium). This is the first year of a multi-year project.

The first stage of smart crawler aims to detect and avoid crawling duplicate content when crawling sites at regular intervals. The new release of Heritrix addresses this in two ways. First by using a conditional get when fetching pages from http servers. Second, if the responding server does not support conditional get, Heritrix will compare the new content hash with what has previously been crawled. Additional de-duplication features will be added later this year.

Release 1.12.0 also includes updated WARC readers and writers to match the latest revision of the specification, 0.12 revision H1.12-RC1. WARC is the next generation archiving file format, a revision of the Internet Archive ARC file format. Please see the release notes for more information about these and other included features and bug fixes.

Subsequent phases of the smart crawler project will also focus on enhanced URL prioritization and crawling that is sensitive to the rate at which individual web pages change.

As always, all Heritrix code is open source. We are proud to help support the open source community. If you would like to get more involved or contribute code to Heritrix visit crawler.archive.org.

Advertisements

3 Responses to “Heritrix 1.12.0 – Crawling Smarter”

  1. Gojomo Says:

    Ozh,

    Heritrix won’t try anchor-text, but will try strings in Javascript and some form element values — if those strings look like they might be relative URIs. (Chiefly, that means internal slashes or dots.) Sometimes that’s necessary to discover important content which is only linked to via Javascript, but it will also often generate 404s.

    (If this doesn’t explain the behavior you’re seeing, please post additional details, like an example URI.)

    – Gordon @ IA

  2. Ozh Says:

    This bot is a joke, right ?
    Bots with user agents containing string ‘heritrix’ generate an awful lot of 404 on my sites. Why ? Because they just can’t parse properly HTML : if on ‘/blog/‘ there’s a link with anchor text ‘go there‘, guess what ? The bot will try to index ‘/blog/go+there‘ … How smart is that ?

    My take on this bot :
    RewriteCond %{HTTP_USER_AGENT} heritrix
    RewriteRule .* – [F,L]

    Thanks.

  3. Terabanitoss Says:

    Hi all!
    You are The Best!!!
    Bye

Comments are closed.


%d bloggers like this: