Last Open Thread of 2010 (November-December)!

by

Time again for an open thread for your questions and comments about the Wayback Machine and Internet Archive web archiving!

If your comment is a question, please check the Wayback Machine Frequently-Asked-Questions (FAQ) to see if your question has already been addressed there before posting.

Also, I’m about to shut down the older forum, which has been very hard to guard against spam and search/link – so we may see an influx of new commenters here.

A few key things to note before you post:

Everything else? Fire away!

(The next new open thread should be started in January.)

10 Responses to “Last Open Thread of 2010 (November-December)!”

  1. scorpil Says:

    Hi. Thank you for great piece of open source software.

    I am using Wayback Machine on my system. Heritrix is constantly crawling few sites (near 40) and Wayback Machine with mortified user interface is used for browsing the results. But there is one little problem: WM internal database (Berkeley DB) is going down once in a while, so i can’t check any site with WM. I tend to think that problem is with BDB because restoring it from backup copy is seems to repair the problem. I can also just rebuild the index, but at this point database is pretty big (nearly 50GB), so rebuilding takes 5 days. I’ve tried to locate the problem for a week now, but still can’t determine what part of system is working incorrectly.

    Maybe you can please suggest some ways to find the solution, or even suggest what the problem may be.

  2. JungleGeorge Says:

    I notice many archive pages are being redirected to flashlink.com for example http://web.archive.org/web/20030425001945/ao.stratics.com/content/guides/general/aggroguide.shtml

    In chrome its shows the original page for a second before sending me to http://www.flashlink.com/bf.php?s=85&h=ao.stratics.com&u=/content/guides/general/aggroguide.shtml

    Is there some setting I can set to stop this?

    • gojomo Says:

      When this happens, it is because the original site, at the viewed date, gave either a redirection response, or included in-page refresh or Javascript directives, which sent the user to another URL. Sometimes, our replay-rewriting can catch these, but not always – so you can wind up at a live current website (which may then itself bounce you someplace else) rather than the matching Wayback archived page.

      One thing that may help in some cases (though at the cost of breaking other in-page and Wayback functionality) is to turn off Javascript in your browser.

      You can also pay close attention to the raw source of the page, or the series of requests made by the browser (for example by using functionality like ‘view source’ or in-browser developer tools), and then manually craft the real target URL in the right era.

      (For example, if you try to visit http://web.archive.org/web/20030303040404/example.com and the page bounces you to http://newexample.net in the live web, you could hand-craft the URL http://web.archive.org/web/20030303040404/newexample.net to see the target URL in the target era, if any.)

      Future revisions to the Wayback machine will improve somewhat the cases that can be automatically handled – and make it more obvious when you’ve been bounced – but some cases will continue to slip through.

      – Gordon @ IA

  3. cooking11 Says:

    Hi,

    It’s pretty quite in here. Hope everyone is having a good holiday.

    I have a question related to robots.txt exclusion. If we are moving some sites to parked pages that do not allow robots.txt is it possible to permanently remove the archived content for the domain? As of now, removing the robots.txt makes the content reappear.

    Thank you. Merry Christmas!

  4. yahudeejay Says:

    SAME IMPORTANT QUESTION AGAIN:

    How to get newest results of your WAYBACK machine for my website [URL] ?
    I’m looking particulary for june 30 – september 30 2009 – BUT NOT VISIBLE.

    Last I can see are:
    Aug 22, 2008

    Yahu Pawul
    editor for [URL]

    YOUR ANSWER:

    —– Original Message —–
    From: “Internet Archive”
    To: “YAHUDEEJAY”
    Sent: Monday, July 19, 2010 11:42 PM
    Subject: Re: ask

    We are currently in the process of updating the Wayback Machine to new
    interface. It is expected to go live at some point this summer/fall,
    whereupon, updates will begin for a great deal of records from
    2008-2009. It is likely that if a site was being crawled before, it will
    be included in those updates, though we can’t say for certain.

    Regards,
    The Internet Archive Team

    YOUR ANSWER ON YOUR FORUM:
    http://www.archive.org/iathreads/post-view.php?id=303838 .

    • gojomo Says:

      Unfortunately, the answer is the same, but the rollout has continued to experience delays. I’m sorry I can’t be more specific; the only estimate I’ve heard is ‘soon’.

      – Gordon @ IA

  5. P.Q.R. Theorist Says:

    Hi Gordon, congratulations on the great work you are doing. Do you have any information as to when we can expect to see more of the content archived since mid-2008 becoming accessible on the Wayback Machine?

    • gojomo Says:

      It’s long overdue, and being actively worked on, but the only estimate I can relay is ‘soon’. Sorry I can’t provide a more specific or encouraging answer.

      – Gordon @ IA

Comments are closed.