Archive for February, 2007

Confusion at The Register and Slashdot about the Wayback Machine

February 27, 2007

A recent story in the The Register, as picked up by Slashdot, has created some mistaken impressions about the Internet Archive web archive.

To help clear things up:

Nothing related to the site www.iowaconsumercase.org has been “pulled” from the Wayback Machine.

As noted in our FAQ, it currently takes 6-12 months for crawled material to reach the Wayback Machine. According to whois domain information the iowaconsumercase.org domain was only registered on January 5, 2007. So, this site has not yet even had time to appear in our archive. We are working to reduce this lag, but for now: check back midyear.

If you try to access this site in the Wayback Machine (wayback: iowaconsumercase.org), you will currently see an accurate message reflecting that the site is simply not in the archive: “Sorry, no matches.”

If this site had been blocked by a site owner’s robots.txt file,
as some have speculated, the message shown would reflect this. Here’s an example of a robots.txt exclusion. As of this writing, the iowaconsumercase.org site does not return any robots.txt.

If this site had been excluded by a site owner or copyright holder request, as others have speculated, the message shown would reflect that. Here’s an example of a publisher request exclusion.

As of this writing, no party to the Iowa case has made any such requests of the Internet Archive. Indeed, it would be silly for them to do so, since such a request would be both premature in one sense and too late in another. Premature, because it will be several months before any January content could appear in the Wayback Machine. Too late, because GrokLaw, and possibly others, have indicated they will be hosting complete copies of the case material.

We are a small nonprofit library, with limited resources, and it is not part of the mission of the Internet Archive to preserve or offer access to material against the wishes of the material’s publisher or rightsholder. When the Internet Archive receives a bona fide request from a site owner or copyright holder, we handle it in accordance with the policies described and linked to in our FAQ. (See especially this, this, and our exclusion policy, which was created in collaboration with academic and legal experts.)

Other archives may have other policies. In partnership with institutions worldwide, the Internet Archive has created a number of free, open source tools that can be used by anyone to create their own web archives, including the Heritrix crawler, a new Wayback Machine, and the NutchWAX tools for full-text search of web archives with the Nutch search engine. These tools are now used by libraries and archives around the world.

The Internet Archive shares the concerns of the The Register reporter and Slashdot commenters about the preservation of historically significant web content. We hope this post has clarified what hasn’t happened in this particular case, as well as how to understand what would happen in other situations. We archive what we can, while we can, but could always use a hand — effective web archiving will benefit from diverse approaches by many independent actors.

Warrick, a tool for recovering websites

February 22, 2007

Anyone who has used the Wayback Machine to recover web material they thought lost will be interested to know about Warrick, a free and open source tool for reconstructing websites using publicly-available caches of old content. From the website:

Warrick is a command-line utility for reconstructing or recovering a website when a back-up is not available. Warrick will search the Internet Archive, Google, MSN, and Yahoo for stored pages and images and will save them to your filesystem. Warrick is most effective at finding cached content in search engines in the first several days after losing the website since the cached versions of pages tend to disappear once the search engine re-crawls your site and can no longer find the pages. Running Warrick multiple times over a period of several days or weeks can increase the number of recovered files because the caches fluctuate daily (especially Yahoo’s). Internet Archive’s repository is at least 6-12 months out of date, and therefore you will only find content from them if your website has been around at least that long. If they don’t have your website archived, you might want to run Warrick again in 6-12 months.

Warrick was created by Frank McCown, a PhD student at Old Dominion University. Thanks, Frank!

If you do face a loss of web material and find yourself in need of Warrick to recover material, here’s an important tip: run Warrick as soon as possible after the loss is noticed, as it consults a number of search-engine caches which are likely to be both more recent and ephemeral than the Archive’s public collection.

Indeed, you should try to run Warrick even before starting to reconstruct your website in place at the original URLs, because as soon as search engines see new content at the same URLs, they’ll start replacing their cached versions with the new content. (It seems that when URLs are responding with ‘404 – not found’ errors, the search engine caches retain the last real content returned, at least for a while.)

Call for papers for the 7th International Web Archiving Workshop

February 22, 2007

Julien Masanès, program chair for the 2007 International Web Archiving Workshop, recently issued the workshop call for papers. From the call:

——————————————————
Objectives:

Main international event in this domain since 2001, IWAW will take place the 3rd week of June, in Canada this year (date and location to be confirmed).
The workshop will provide a cross domain overview on active research and practice in all domains concerned with the acquisition, maintenance and preservation of digital objects for long-term access, with a particular focus on web archiving and studies on effective usage of this type of archives.

——————————————————
Important Dates:

Paper submission: May 1st, 2007
Notification of acceptance: May 15th, 2007
Camera-ready copy due: June 11th, 2007

Details for format and submission will be posted on iwaw.net soon.

——————————————————
Topics:

Case studies:
• Web Archiving Projects,
• Digital Archeology,
• Cyberculture Studies,
• Web Metrics,
• Web Publishing Models.

Data acquisition:
• Harvesting Technology, Focused Crawling,
• Deep Web Capture,
• Site Architecture Migration,
• Authenticity Control of Captured Documents.
• Acquisition of Dynamic Objects,
• Submission Systems,
• Data Ingest,
• Automated Metadata Capture.

Storage Models and Architecture:
• Hierarchical Storage Models,
• Redundant Storage,
• Distributed Storage,
• Storage Media Migration,
• Cost Models,
• Media Life-Time Analysis.

Digital Preservation:
• Conversion/Migration Strategies,
• Emulation Approaches,
• Data Abstraction Technologies,
• Self-Aware Objects,
• Testbeds, File Format Repositories,
• Document Functionality and Behaviour.

Access:
• Access Provision,
• Navigation,
• Web Indexing
• Collection Analysis,
• Information Retrieval,
• Interface Models.

Policy and Social Issues:
• Economics of Information,
• Intellectual Property Rights.
• Challenges and Caveats of Web Archives,
• Scenarios and Visions,
• Privacy Aspects

The Internet Archive has been a frequent contributor to past IWAW events and is looking forward to this year’s event — the first time it has been held in North America.

Archive-It 2.4!

February 14, 2007

On February 8, 2007 we released Archive-It 2.4!

The Archive-It service is a web archive on demand service used primarily by US state archives, state libraries and university libraries. Subscribers are able to create, manage, search and preserve collections of archived web pages. All collections are publicly accessible and full-text searchable at www.archive-it.org.

The production service turned 1 year old on February 8. In the first year of service, there have been 4 major feature releases and we have 3 more planned for 2007. These constant improvements are due in large part to feedback from our Archive-It partners.

New features in 2.4 include an annual crawl frequency and dormant collection state. You can see a demo of the service at one of our upcoming informational webinars: Feb 20, Mar 6, and Mar 20 all at 11am PST. Email archive-it at archive.org for more information.