A recent story in the The Register, as picked up by Slashdot, has created some mistaken impressions about the Internet Archive web archive.
To help clear things up:
Nothing related to the site www.iowaconsumercase.org has been “pulled” from the Wayback Machine.
As noted in our FAQ, it currently takes 6-12 months for crawled material to reach the Wayback Machine. According to whois domain information the iowaconsumercase.org domain was only registered on January 5, 2007. So, this site has not yet even had time to appear in our archive. We are working to reduce this lag, but for now: check back midyear.
If you try to access this site in the Wayback Machine (wayback: iowaconsumercase.org), you will currently see an accurate message reflecting that the site is simply not in the archive: “Sorry, no matches.”
If this site had been blocked by a site owner’s robots.txt file,
as some have speculated, the message shown would reflect this. Here’s an example of a robots.txt exclusion. As of this writing, the iowaconsumercase.org site does not return any robots.txt.
If this site had been excluded by a site owner or copyright holder request, as others have speculated, the message shown would reflect that. Here’s an example of a publisher request exclusion.
As of this writing, no party to the Iowa case has made any such requests of the Internet Archive. Indeed, it would be silly for them to do so, since such a request would be both premature in one sense and too late in another. Premature, because it will be several months before any January content could appear in the Wayback Machine. Too late, because GrokLaw, and possibly others, have indicated they will be hosting complete copies of the case material.
We are a small nonprofit library, with limited resources, and it is not part of the mission of the Internet Archive to preserve or offer access to material against the wishes of the material’s publisher or rightsholder. When the Internet Archive receives a bona fide request from a site owner or copyright holder, we handle it in accordance with the policies described and linked to in our FAQ. (See especially this, this, and our exclusion policy, which was created in collaboration with academic and legal experts.)
Other archives may have other policies. In partnership with institutions worldwide, the Internet Archive has created a number of free, open source tools that can be used by anyone to create their own web archives, including the Heritrix crawler, a new Wayback Machine, and the NutchWAX tools for full-text search of web archives with the Nutch search engine. These tools are now used by libraries and archives around the world.
The Internet Archive shares the concerns of the The Register reporter and Slashdot commenters about the preservation of historically significant web content. We hope this post has clarified what hasn’t happened in this particular case, as well as how to understand what would happen in other situations. We archive what we can, while we can, but could always use a hand — effective web archiving will benefit from diverse approaches by many independent actors.