Author Archive

Wayback Machine & Web Archiving Open Thread, April 2011

April 7, 2011

Anything you want to know or discuss about the Wayback Machine or the Internet Archive’s web archive? This is the place!

What do you want to know about the Wayback Machine and Internet Archive web archive? Do you have problems, concerns, suggestions? This is the place!

If your comment is a question, please check the classic Wayback Machine Frequently-Asked-Questions (FAQ) or new Wayback Machine FAQ site to see if your question has already been addressed before posting.

A few other things to note before posting:

Everything else? Fire away!

Updated Wayback Machine in Beta Testing

January 24, 2011

A new, improved version of the Wayback Machine, with an updated interface and fresher index of archived content, is now available for public testing at:

http://waybackmachine.org

Note that during the beta test period, the availability and functionality of the new service will fluctuate as issues are discovered and addressed.

The classic Wayback Machine will remain in concurrent operation for a period, for comparing functionality, but may not receive any further index updates. (It received its last major update in 2008, with only small piecemeal updates since.) So, please use the new site for accessing material from recent years. For a mixture of technical and policy reasons, most material will still appear 6 months or more after collection.

For more information, see the new Wayback Frequently-Asked-Questions (FAQ) site.

Thank you for your patience while this long-awaited update was under development!

Last Open Thread of 2010 (November-December)!

November 24, 2010

Time again for an open thread for your questions and comments about the Wayback Machine and Internet Archive web archiving!

If your comment is a question, please check the Wayback Machine Frequently-Asked-Questions (FAQ) to see if your question has already been addressed there before posting.

Also, I’m about to shut down the older forum, which has been very hard to guard against spam and search/link – so we may see an influx of new commenters here.

A few key things to note before you post:

Everything else? Fire away!

(The next new open thread should be started in January.)

Wayback Machine & Web Archiving Open Thread, September 2010

September 7, 2010

Time for another open thread!

What do you want to know about the Wayback Machine and Internet Archive web archive? Do you have problems, concerns, suggestions? This is the place!

If your comment is a question, please check the Wayback Machine Frequently-Asked-Questions (FAQ) to see if your question has already been addressed before posting.

A few other things to note before posting:

Everything else? Fire away!

Wayback Machine & Web Archiving Open Thread, July 2010

July 6, 2010

Anything you want to know or discuss about the Wayback Machine or the Internet Archive’s web archive? This is the place!

We’re trying something new here. Our classic forum is a bit clunky by modern discussion standards. For posters, it’s hard to browse or search the archives, leading to a lot of repetitive questions. For IA staff, the options for participation and moderation are limited.

This is the first of what’s planned as a new  ‘open thread’ each month, here on this blog. For the month, it’s where feedback, discussion, and questions about the web archive and Wayback Machine should be directed.

If your comment is a question, please check the Wayback Machine Frequently-Asked-Questions (FAQ) to see if your question has already been addressed before posting.

A few other things to note before posting:

Everything else? Fire away!

Searching the dawn of the 21st Century

October 7, 2008

What was the web of the past really like?

Last Tuesday, Google unveiled a unique new web search, 2001 Google, as part of their 10th birthday celebration.

Using an actual archived version of their search engine index from January 2001, the service answers queries more-or-less how Google did back then — same results, same ranking, same summary ‘snippets’.

But of course, many of those result pages have changed or disappeared entirely since then — and that’s where the Internet Archive’s Wayback Machine comes in. For many of the 2001 search results, the best or only view comes from the Wayback Machine, which Google has helpfully provided in lieu of the usual ‘cached version’ links.

The combination of authentic Google search and the Wayback’s giant web archive is more powerful than either alone: finding needles lost in the Wayback haystack, showing actual prior rankings/popularity of pages for real queries, and highlighting material that would have been lost forever without purposeful public-interest archiving.

We thank Google for this chance to work together and highlight our web archive. Google plans to leave the 2001 search up for one month, and we’ll talk more about what we’ve learned from this service in a future blog post.

In the meantime, try the 2001 Google Search!

Internet Archive at OSCON

July 24, 2008

Tomorrow, at the O’Reilly Open Source Convention in Portland, I’ll be presenting a session about our open source web archiving tools. Full details:

Build Your Own Web Archive: archive.org’s Open Source Tools to Crawl, Access & Search Web Captures
Gordon Mohr (Internet Archive, Web Group)
11:35am Friday, 07/25/2008
Web Applications
Location: E145

The Internet Archive, with support from other libraries around the world, has helped develop a collection of open source tools in Java to support web archiving. These include the Heritrix archival web crawler, “Wayback” for replaying historic web content, and extensions to Nutch for web archive full-text search. This session will explain the design and capabilities these tools, and quickly demo their use for the creation of a small personal web archive.

Heritrix has been designed for faithful and complete content archiving but has also found use in other web search contexts. Wayback allows URL-based lookup and follow-up browsing of archived web content. Nutch, as applied to archival web crawls, allows Google-style full-text search of web content, including the same content as it changes over time. Together, they provide everything necessary to archive and access accurate historical records of web-published content.

Also: last month James Turner of O’Reilly Media spoke to me in advance of OSCON. You can read or hear the interview at: Gordon Mohr Takes Us Inside the Internet Archives.

Worldwide Wayback Machine Updated: 25% Larger

July 2, 2007

Our primary web archive — the Worldwide Wayback Machine accessible at web.archive.org — has just finished a major index update, meaning that many new months of recent web crawls are now viewable.

Some material from as late as April 2007 is live, and the overall index has grown by about 25%. If for any site you’d wanted more recent material, give your lookups another try.

Confusion at The Register and Slashdot about the Wayback Machine

February 27, 2007

A recent story in the The Register, as picked up by Slashdot, has created some mistaken impressions about the Internet Archive web archive.

To help clear things up:

Nothing related to the site www.iowaconsumercase.org has been “pulled” from the Wayback Machine.

As noted in our FAQ, it currently takes 6-12 months for crawled material to reach the Wayback Machine. According to whois domain information the iowaconsumercase.org domain was only registered on January 5, 2007. So, this site has not yet even had time to appear in our archive. We are working to reduce this lag, but for now: check back midyear.

If you try to access this site in the Wayback Machine (wayback: iowaconsumercase.org), you will currently see an accurate message reflecting that the site is simply not in the archive: “Sorry, no matches.”

If this site had been blocked by a site owner’s robots.txt file,
as some have speculated, the message shown would reflect this. Here’s an example of a robots.txt exclusion. As of this writing, the iowaconsumercase.org site does not return any robots.txt.

If this site had been excluded by a site owner or copyright holder request, as others have speculated, the message shown would reflect that. Here’s an example of a publisher request exclusion.

As of this writing, no party to the Iowa case has made any such requests of the Internet Archive. Indeed, it would be silly for them to do so, since such a request would be both premature in one sense and too late in another. Premature, because it will be several months before any January content could appear in the Wayback Machine. Too late, because GrokLaw, and possibly others, have indicated they will be hosting complete copies of the case material.

We are a small nonprofit library, with limited resources, and it is not part of the mission of the Internet Archive to preserve or offer access to material against the wishes of the material’s publisher or rightsholder. When the Internet Archive receives a bona fide request from a site owner or copyright holder, we handle it in accordance with the policies described and linked to in our FAQ. (See especially this, this, and our exclusion policy, which was created in collaboration with academic and legal experts.)

Other archives may have other policies. In partnership with institutions worldwide, the Internet Archive has created a number of free, open source tools that can be used by anyone to create their own web archives, including the Heritrix crawler, a new Wayback Machine, and the NutchWAX tools for full-text search of web archives with the Nutch search engine. These tools are now used by libraries and archives around the world.

The Internet Archive shares the concerns of the The Register reporter and Slashdot commenters about the preservation of historically significant web content. We hope this post has clarified what hasn’t happened in this particular case, as well as how to understand what would happen in other situations. We archive what we can, while we can, but could always use a hand — effective web archiving will benefit from diverse approaches by many independent actors.

Warrick, a tool for recovering websites

February 22, 2007

Anyone who has used the Wayback Machine to recover web material they thought lost will be interested to know about Warrick, a free and open source tool for reconstructing websites using publicly-available caches of old content. From the website:

Warrick is a command-line utility for reconstructing or recovering a website when a back-up is not available. Warrick will search the Internet Archive, Google, MSN, and Yahoo for stored pages and images and will save them to your filesystem. Warrick is most effective at finding cached content in search engines in the first several days after losing the website since the cached versions of pages tend to disappear once the search engine re-crawls your site and can no longer find the pages. Running Warrick multiple times over a period of several days or weeks can increase the number of recovered files because the caches fluctuate daily (especially Yahoo’s). Internet Archive’s repository is at least 6-12 months out of date, and therefore you will only find content from them if your website has been around at least that long. If they don’t have your website archived, you might want to run Warrick again in 6-12 months.

Warrick was created by Frank McCown, a PhD student at Old Dominion University. Thanks, Frank!

If you do face a loss of web material and find yourself in need of Warrick to recover material, here’s an important tip: run Warrick as soon as possible after the loss is noticed, as it consults a number of search-engine caches which are likely to be both more recent and ephemeral than the Archive’s public collection.

Indeed, you should try to run Warrick even before starting to reconstruct your website in place at the original URLs, because as soon as search engines see new content at the same URLs, they’ll start replacing their cached versions with the new content. (It seems that when URLs are responding with ‘404 – not found’ errors, the search engine caches retain the last real content returned, at least for a while.)