Wayback Machine & Web Archiving Open Thread, July 2010

by

Anything you want to know or discuss about the Wayback Machine or the Internet Archive’s web archive? This is the place!

We’re trying something new here. Our classic forum is a bit clunky by modern discussion standards. For posters, it’s hard to browse or search the archives, leading to a lot of repetitive questions. For IA staff, the options for participation and moderation are limited.

This is the first of what’s planned as a new¬† ‘open thread’ each month, here on this blog. For the month, it’s where feedback, discussion, and questions about the web archive and Wayback Machine should be directed.

If your comment is a question, please check the Wayback Machine Frequently-Asked-Questions (FAQ) to see if your question has already been addressed before posting.

A few other things to note before posting:

Everything else? Fire away!

About these ads

22 Responses to “Wayback Machine & Web Archiving Open Thread, July 2010”

  1. jeroxma1977 Says:

    I replied earlier to this thread with concerns that sites couldn’t be removed once added to the Archive.Bibalex.org mirror.

    Well I have good news. For anybody wondering, it is possible to have sites removed from Archive.Bibalex.org. (The Wayback Machine’s Mirror) It’s just not very clear as to how it’s done. (It’s pretty clear as to how a site’s removed from the original Wayback Machine at Archive.org’s FAQs page…)

    What you need to do is click on the envelope located under their main banner. This will open a form you can use to contact them.

    In the form: It will ask for your Email address and name. Fill it out as you wish. In the “Feedback on:” dropbox, select projects. In the category dropbox select “Advancing Science and Technology”. In the title dropbox select “The Internet Archive”.

    It took them around 2 or 3 months to reply to me and have my websites removed but they did so without question. I recommend linking to the removed sites in the original Wayback Machine, so as to avoid any unnecessary lag in communication.

    Perhaps these instructions could be reposted somewhere so people like me don’t have this problem again.

  2. kennethyap1202 Says:

    How can I upload my ning site into the wayback? I am one of those who will be affectedby the shutdown of their free service. So, I want to upload my website for me to see it even if my site will be the one to be closed down.

    • gojomo Says:

      The Wayback Machine doesn’t accept direct content uploads, but if a site is in imminent danger of going away we’d like to hear about it and can often do an extra crawl to improve/update whatever’s already archived for that site.

      What’s the cutoff date for free Ning sites, and is there a master directory we could consult anywhere?

  3. zakrhino Says:

    Hi Gojomo, I have a website that been around for a good eight or so years. When I go to the Wayback search it says it was last archive on March 24, 2008. Its been over about three years and was wondering why would it not update the archive when it been around for so long?
    Thanks

  4. I.M.O.G. Says:

    Taking a moment to just say thanks!

    I was originally a bit concerned about our site in relation to the Internet Archive. Overclockers.com in light of a fairly consistent crawl history going all the way back to 1999, hasn’t had any updates since undergoing a complete site redesign. After viewing the FAQ and reading the responses here, it would appear there’s nothing to worry about… And now that I’m looking at other similar sites in the archive, it appears they are also seeing similar trends to us in the past year or two.

    So here’s to letting you know your work is valuable and appreciated. Thanks again!

  5. jimnixon Says:

    Hi there,

    I’m guessing this is the place to post technical questions, since my e-mail inquiry returned a form letter directing me to the forums, and the forums redirected me here?

    At the risk of sounding like a dime-a-dozen internet drama diva, I’m in desperate need of assistance in accessing the archives of a certain website on the Wayback Machine. There are many, many crawls listed for the site in the time frame I need, and lots of pages have been archived, apparently. But they all seem to just link to “Data Retrieval Errors”. Sometimes after repeated clicking I’m able to get through to the site (maybe one time out of fifty), but almost everything I click on just brings me to a Data Retrieval Error, from that point or from when I’m first trying to access the archive.

    The archived version of this website has information that could prove a big legal matter in my favor… so although I realize this is a free volunteer-based project, I’d sure appreciate some help. Is the Data Retrieval Error an issue that will be fixed? Do I need to provide more information? Can anyone help me?

    • gojomo Says:

      ‘Data Retrieval Errors’ for certain URLs or URL ranges are often temporary failures that should clear up in few minutes or hours.

      For brief periods, an error message may be cached even if a forced-retry would succeed. In Firefox (and perhaps other browser), you can force a fresh (non-cached) result by holding ‘shift’ while clicking the ‘reload’ browser button.

      If they persist more than one day, they might be indicative of a deeper problem I’d be happy to investigate. You can email me more details if you’re not comfortable posting here. My email is what you might guess from my handle here and the archive.org domain.

  6. jeroxma1977 Says:

    Hey Gordon.

    I think there needs to be a way to remove sites from the mirror @ archive.bibalex.org.

    I’ve had my personal blog removed from the Wayback Machine, however, it still appears at the bibalex mirror. I’ve not been able to contact them. I’ve sent a few emails, but have received no reply. I’ve also sent emails from multiple addresses, in case one was blocked by spam filter – no reply. I’ve called them by phone, they were not able to assist me. There doesn’t seem to be a clear cut way to remove sites from the bibalex mirror, if one at all. This can be problematic for Webmasters like me – who would opt not to have their content archived. (You can’t place a robots.txt on a public blogging site, some don’t even support meta tags…)

    • jeroxma1977 Says:

      Also, I have a question sort of regarding the subject.

      When you guys do eventually do a new index update with Archive.Bibalex.org, will the list of ‘blocked’ sites sync?

      (So that the sites blocked from Archive’s Wayback will be current with Bibalex’s at the time of the update.)

      • jeroxma1977 Says:

        (gonna bump, I think you might have scanned over this. Thanks in advance for replying.)

  7. videogamer555 Says:

    When I type in a page I want at the Wayback Machine, it takes like ONE TO FIVE WHOLE MINUTES for the page to load! When I first started using this service, pages loaded at the normal rate they would have loaded right from the regular internet originally. So I thought this was the addition of an intentional slowing of the connection for non-members. That’s when I registered. But it still it runs just as slow. I wish it would run as fast as it did when I discovered this wonderful service a year ago.

    Why the slow down? Really, what’s going on here?

    • gojomo Says:

      I can assure you there’s no intentional slowing of Wayback page display for anyone.

      Occasionally, a few of the ~30 machines serving index-lookup and content-replay duties get swamped by traffic — often by automated crawlers. When this happens, the timeouts before another machine is tried (or a temporary failure is reported) can be 30 seconds or more; if the page you’re trying to display is built of many dozens of resources (inline images, scripts, etc.), several of these can stack up, causing the worst-case loading times.

      It takes a while — too long — for the systems to recover, but usually the problem passes in hours (or at worst in cases requiring special attention, days). And, for any archived pages that become popular, a front-end cache is in place to make loads by subsequent visitors fairly fast.

      This classic Wayback setup — only changed a little over the years, and straining under the service’s popularity and size — will be replaced this year by a new architecture that should offer better performance. In the meantime, we’re sorry for the occasional delays. If you see a problem persisting for more than a day, let us know with as much detail as possible in case it’s a novel failure requiring special investigation.

      • videogamer555 Says:

        I was wondering something. In this switch over, is it possible that you guys will lose some of the sites archived? Possibly forever it the site doesn’t exist on its own anymore? Maybe file transfer could have a glitch? Maybe the index.html file for a given archived site might get lost? That’s what I’m worried about now that you talk of upgrading?

        There are a couple sites I liked that went down a while ago, and I only found them again through this web archive service you have. I really liked the sites in question, and the thought that they might get lost makes me nervous. Is it possible for you to transfer the pictures, javascript, html, other resources etc for a given site to me so I could keep a copy of it on my own computer safely until your server upgrade is complete? I’ll give you my email address:
        videogamer555@gmail.com
        send me a message to that email address, and I’ll reply with the archived site I’m interested in having a copy of on my own computer for safe keeping until your archiving site’s server is upgraded.

        Thanks in advance.

        • gojomo Says:

          The launch of new Wayback software won’t actually involve moving any of the archived content; it will be a new index tier and front-end for accessing the same permanent records.

          (Data can be lost on occasion from hardware failures and software glitches, though during physical migrations of data to new generations of hardware or new datacenters, we watch this fairly closely to minimize the incidence.)

          We don’t provide a bulk dump service. You are welcome to save off copies of pages or sites using your own tools, provided it is legal in your jurisdiction for you to do so with the content in question.

          You may also want to look at the ‘Warrick’ service/software, for reconstructing websites from public cache info (including the Internet Archive, Google cache, and other sources). See: http://warrick.cs.odu.edu/

  8. sabineeller Says:

    I deal with languages and normally maintain links to glossaries on delicious, but more and more often I find that these sometimes very valuable glossaries simply vanish. Nobody actually has the time to copy/paste any glossary to a database on the own computer, so I believe it would be relevant to be able to submit links easily while navigating the web and having then the possibility to see which are “my links” to search in the glossaries or whatever it may be, I passed to Internet Archive.

  9. stbalbach Says:

    Search string length appears to have a hard limit of around 1640 characters. Is there any hope or possibility or workaround to increasing the length? Firefox can handle URLs over 100,000 characters, old versions of IE 2,048.

    Here’s a sample long search string you can cut-n-paste into the search field and it will return an error, it’s slightly over 1640 characters:

    mediatype:(texts) -contributor:gutenberg AND (subject:”Stevenson, Robert Louis, 1850-1894″ OR subject:”Stevenson, R. L. (Robert Louis), 1850-1894″ OR subject:”Stevenson, Robert L. (Robert Louis), 1850-1894″ OR subject:”Stevenson, Robert Louis” OR subject:”Stevenson, R. L. (Robert Louis)” OR subject:”Stevenson, Robert L. (Robert Louis)” OR subject:”Robert Louis Stevenson” OR subject:”Robert L. Stevenson” OR subject:”R. L. Stevenson” OR creator:”Stevenson, Robert Louis, 1850-1894″ OR creator:”Stevenson, Robert Louis, Sir, 1850-1894″ OR creator:”Stevenson, R. L. (Robert Louis), 1850-1894″ OR creator:”Stevenson, Robert L. (Robert Louis), 1850-1894″ OR creator:”Stevenson, Robert Louis” OR creator:”Stevenson, R. L. (Robert Louis)” OR creator:”Stevenson, Robert L. (Robert Louis)” OR creator:”Robert Louis Stevenson” OR creator:”Robert L. Stevenson” OR creator:”R. L. Stevenson” OR title:”Robert Louis Stevenson” OR title:”Robert L. Stevenson” OR title:”R. L. Stevenson” OR description:”Robert Louis Stevenson” OR description:”Robert L. Stevenson” OR description:”R. L. Stevenson” OR description:”Stevenson, Robert Louis” OR description:”Stevenson, R. L. (Robert Louis)” OR description:”Stevenson, Robert L. (Robert Louis)” OR subject:”Chesterton, Gilbert Keith, 1874-1936″ OR subject:”Chesterton, G. K. (Gilbert Keith), 1874-1936″ OR subject:”Chesterton, Gilbert K. (Gilbert Keith), 1874-1936″ OR subject:”Chesterton, Gilbert Keith” OR subject:”Chesterton, G. K. (Gilbert Keith)” OR subject:”Chesterton, Gilbert K. (Gilbert Keith)” OR subject:”Gilbert Keith Chesterton” OR subject:”Gilbert K. Chesterton” OR subject:”G. K. Chesterton” OR creator:”Chesterton, Gilbert Keith, 1874-1936″ OR creator:”Chesterton, Gilbert Keith, Sir, 1874-1936″ OR creator:”Chesterton, G. K. (Gilbert Keith), 1874-1936″ OR creator:”Chesterton, Gilbert K. (Gilbert Keith), 1874-1936″ OR creator:”Chesterton, Gilbert Keith” OR creator:”Chesterton, G. K. (Gilbert Keith)” OR creator:”Chesterton, Gilbert K. (Gilbert Keith)” OR creator:”Gilbert Keith Chesterton” OR creator:”Gilbert K. Chesterton” OR creator:”G. K. Chesterton” OR title:”Gilbert Keith Chesterton” OR title:”Gilbert K. Chesterton” OR title:”G. K. Chesterton” OR description:”Gilbert Keith Chesterton” OR description:”Gilbert K. Chesterton” OR description:”G. K. Chesterton” OR description:”Chesterton, Gilbert Keith” OR description:”Chesterton, G. K. (Gilbert Keith)” OR description:”Chesterton, Gilbert K. (Gilbert Keith)”)

    Here is a very long search string that does work. It is 1633 characters in length, anything over about 1640 or 1650 or so in length and it stops working.

    mediatype:(texts) -contributor:gutenberg AND (subject:”Stevenson, Robert Louis, 1850-1894″ OR subject:”Stevenson, R. L. (Robert Louis), 1850-1894″ OR subject:”Stevenson, Robert L. (Robert Louis), 1850-1894″ OR subject:”Stevenson, Robert Louis” OR subject:”Stevenson, R. L. (Robert Louis)” OR subject:”Stevenson, Robert L. (Robert Louis)” OR subject:”Robert Louis Stevenson” OR subject:”Robert L. Stevenson” OR subject:”R. L. Stevenson” OR creator:”Stevenson, Robert Louis, 1850-1894″ OR creator:”Stevenson, Robert Louis, Sir, 1850-1894″ OR creator:”Stevenson, R. L. (Robert Louis), 1850-1894″ OR creator:”Stevenson, Robert L. (Robert Louis), 1850-1894″ OR creator:”Stevenson, Robert Louis” OR creator:”Stevenson, R. L. (Robert Louis)” OR creator:”Stevenson, Robert L. (Robert Louis)” OR creator:”Robert Louis Stevenson” OR creator:”Robert L. Stevenson” OR creator:”R. L. Stevenson” OR title:”Robert Louis Stevenson” OR title:”Robert L. Stevenson” OR title:”R. L. Stevenson” OR description:”Robert Louis Stevenson” OR description:”Robert L. Stevenson” OR description:”R. L. Stevenson” OR description:”Stevenson, Robert Louis” OR description:”Stevenson, R. L. (Robert Louis)” OR description:”Stevenson, Robert L. (Robert Louis)” OR subject:”Chesterton, Gilbert Keith, 1874-1936″ OR subject:”Chesterton, G. K. (Gilbert Keith), 1874-1936″ OR subject:”Chesterton, Gilbert K. (Gilbert Keith), 1874-1936″ OR subject:”Chesterton, Gilbert Keith” OR subject:”Chesterton, G. K. (Gilbert Keith)” OR subject:”Chesterton, Gilbert K. (Gilbert Keith)” OR subject:”Gilbert Keith Chesterton” OR subject:”Gilbert K. Chesterton”)

    I use scripts to automatically create search strings for use on external links at Wikipedia. Because of the wide variety of metadata on Internet Archive it requires lots of conditional statements to get an accurate search and the search string limit is presenting a problem finding all occurrences of a particular author or work.

    • gojomo Says:

      stbalbach, I can’t directly address your comments, as they pertain to the collections at the Internet Archive other than the Wayback Machine. I’ve pointed the team which handles download MIME types and collection-search at your comments, and will let you know if they tell me a better place to discuss these issues.

  10. stbalbach Says:

    Archive.org is not sending the proper MIME type when downloading a DjVu file from the main work page. Rather MIME is set to “text”. Thus the LizardTech DjVu plugin doesn’t recognize it as a DjVu file, as it depends on a correct MIME type to function.

    Apparently Firefox got around this problem by including a built-in DjVu viewer that uses the file extension to determine file type. However the built-in Firefox DjVu viewer is poor quality and old, the LizardTech plugin (linked above) is preferable with more features and better quality.

    Another way around the problem is by going to “HTTP:Files” off the main work page and clicking on the DjVu file there – apparently there Archive.org does send the proper MIME type (see MIME type under Tools/Page Info (Firefox)).

    I’d like to see the MIME type set to DjVu when downloading a DjVu file from the main work page, so that the LizardTech DjVu plugin viewer works properly from the main page.

  11. sage Says:

    A few thoughts:

    1. I’d like to see the Internet Archive accept RSS and Atom feeds for updates.

    2. I’d like to be able to add sites directly to the Internet Archive without going through Alexa or DMOZ. The DMOZ process seems pretty broken to me.

    3. I’d like to see a status message on each webpage’s archive page mentioning when it was last spidered, even if the archive isn’t available yet. I have a blog that hasn’t received a new archive page since June 2007. I have no way of knowing if it’s just pending processing, or if there is some error and archive.org isn’t seeing my newer updates.

    • gojomo Says:

      Hi, sage. Regarding (1) and (2), we’re kicking around some ideas that would allow people to directly nominate web pages into the archive for accelerated collection — but were not ready to announce anything yet. The news should appear at this blog first if and when something is offered.

      Regarding (3), for much of the current crawl inputs, we only know we have content when it’s indexed for public display. (Until then, it may not even be on our servers or inventoried.)

      But, for another subset, we could offer some indication like you suggest. (And, a theoretical future archive-on-request service could offer a relatively quick confirmation that material was corrected.) So we’ll keep this idea in mind for future software updates.

Comments are closed.


Follow

Get every new post delivered to your Inbox.

Join 100 other followers

%d bloggers like this: