Anyone who has used the Wayback Machine to recover web material they thought lost will be interested to know about Warrick, a free and open source tool for reconstructing websites using publicly-available caches of old content. From the website:
Warrick is a command-line utility for reconstructing or recovering a website when a back-up is not available. Warrick will search the Internet Archive, Google, MSN, and Yahoo for stored pages and images and will save them to your filesystem. Warrick is most effective at finding cached content in search engines in the first several days after losing the website since the cached versions of pages tend to disappear once the search engine re-crawls your site and can no longer find the pages. Running Warrick multiple times over a period of several days or weeks can increase the number of recovered files because the caches fluctuate daily (especially Yahoo’s). Internet Archive’s repository is at least 6-12 months out of date, and therefore you will only find content from them if your website has been around at least that long. If they don’t have your website archived, you might want to run Warrick again in 6-12 months.
Warrick was created by Frank McCown, a PhD student at Old Dominion University. Thanks, Frank!
If you do face a loss of web material and find yourself in need of Warrick to recover material, here’s an important tip: run Warrick as soon as possible after the loss is noticed, as it consults a number of search-engine caches which are likely to be both more recent and ephemeral than the Archive’s public collection.
Indeed, you should try to run Warrick even before starting to reconstruct your website in place at the original URLs, because as soon as search engines see new content at the same URLs, they’ll start replacing their cached versions with the new content. (It seems that when URLs are responding with ‘404 – not found’ errors, the search engine caches retain the last real content returned, at least for a while.)