Wayback Machine & Web Archiving Open Thread, September 2010

by

Time for another open thread!

What do you want to know about the Wayback Machine and Internet Archive web archive? Do you have problems, concerns, suggestions? This is the place!

If your comment is a question, please check the Wayback Machine Frequently-Asked-Questions (FAQ) to see if your question has already been addressed before posting.

A few other things to note before posting:

Everything else? Fire away!

About these ads

21 Responses to “Wayback Machine & Web Archiving Open Thread, September 2010”

  1. naesten Says:

    I was just wondering why you try to append your URL-rewriting script to XML documents. For example, at the moment, http://web.archive.org/web/20091027213704/http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul comes up with a dreaded “yellow page of doom” in mozilla-based browsers because appending the script results in malformed XML (that is, text that is *not* XML at all).

    • gojomo Says:

      It’s just a bug, that will be fixed in a future version of the Wayback Machine. In the meantime, sorry!

  2. bitingsparrow Says:

    First off, thank you so much for the wayback machine!

    We’d like to use the JSON service (http://www.archive.org/help/json.php) but cannot for the life of us figure out how to specify a URL using the advanced search form (http://www.archive.org/advancedsearch.php#raw).

    For instance, searching for wikipedia.org using the main search form, returns expected results (http://web.archive.org/web/*/wikipedia.org). When you go to the advanced search form, there is no way to specify that wikipedia.org is the URL we are interested in. Using the field “URL” from the “Custom Fields” drop down yields no results.

    In extension, using the following query:
    http://www.archive.org/advancedsearch.php?q=url:(http://www.wikipedia.org)&fl%5B%5D=title&rows=1&output=json

    yields no results.

    The reason we want to use the JSON service is that we expect it’ll be fast, impose less strain on the archive.org servers, and primarily we are really only interested in the title associated with the URL. If there is no easy JSON way to accomplish this, we’d have to resort to using: http://web.archive.org/web/wikipedia.org and parse the “Title” out of the HTML response.

    • redth Says:

      Agreed… This is basically what I’m after with my previous comment. I want to be able to search by url on the advanced search, although in my case I want anything in a given domain using wildcards…

    • gojomo Says:

      The site search and json/’advancedsearch’ interfaces are for ‘items’ in our other media collections — like books, audio, movies, etc.

      They don’t include the web collection’s sites or individual URL/captures.

      You must know a starting URL to view web collection content. The only sort of open-ended search available is that adding a trailing ‘*’ to an URL will show a list of URLs that start with the same string. For example:

      http://web.archive.org/web/*/en.wikipedia.org/wiki/App*

      (Only trailing ‘*’ will work.)

  3. redth Says:

    Do you have an API?

    I’d like to use an api to find out what the oldest date is for any pages archived under a domain or any of its subdomains. So I want to know the oldest date you have a paged archived for say http://mydomain.com/* or http://*.mydomain.com/*

    • gojomo Says:

      See my response to bitingsparrow’s later comment; only a trailing ‘*’ works, to list other URLs that begin with the same string. Other queries aren’t possible.

      API access is a wishlist item for the redeployment of the Wayback on new software, but the first version of such an API would just offer easier-to-parse results similar to traditional views — by exact URL or URL prefix. Building the deeper indexes for things like finding URLs by fragments (or internal wildcards) would be a separate later project.

      • redth Says:

        Thanks gojomo

        It would be nice to have an api, even if it doesn’t do internal wildcard matching at first. I can’t help but think it would put less strain on the servers too by only returning a simple json set, instead of a full html document.

        Right now I’m finding the service to be very unreliable. Some times I’ll get an http response error, other times it works fine. I’d imagine this is due to a large load on the web server(s).

        Glad to hear this is a possibility in the future anyways, will keep watching!

  4. pkosanke Says:

    I’ve viewed our old website many times, but today I’m getting the message, “Publisher could not be verified,” and am not “taken back.” Is there a way around this? Thanks (love your site).

    • gojomo Says:

      I’m not familiar with that message, and would need more details about the URL(s) affected to investigate.

      • pkosanke Says:

        The website is [checked offthread]. After posting my inquiry, I tried it again and got through, but today (11/24) I’m again having a problem. This time I’m asked if I want to save the file or find a program to open it, and neither works of course.

        • gojomo Says:

          I tried the URL you suggested at a variety of dates — at least one for each year we’ve archived — and did not see a similar prompt/error. Almost every page displayed.

          I did get one ‘wbcgi’ message, which indicates one of the 20+ backend servers is having temporary difficulties that usually clear up in an hour to a a day. Perhaps with your software, this same error is triggering a different message? Other than trying again later, sometimes forcing a total browser refresh — with the ‘reload’/’refresh’ button, possibly even holding down the shift key while pressing — can cause the error to be discarded by our caches and real content to be returned sooner.

          If the problem persists, please feel free to forward an exact full URL from the web address bar (including the date) for further investigation.

  5. onlineeducationprogram Says:

    If a site is going to go away or get shut down where do we send the link to so it can be crawled and archived. We have clients who usually last about a year.

    • gojomo Says:

      We don’t have a formal or automatic way to nominate sites, but if the site is notable and you know it will soon disappear, you can recommend it via one of these ‘open threads’. (Even if we trap or edit the comment as possible URL-spam, we will see the original comment and URL.)

      Please give as much advance warning as possible and make note of any definitive last-chance-dates. Note also that collecting a site at a polite pace can takes days, weeks, or longer, and we can’t guarantee any site will be archived at all, or on any specific timeframe.

  6. myndzi Says:

    I have seen a trend that is bothering me in a few sites I tried to look up over the past months. It seems that the robots.txt exclusion is causing the Wayback Machine to prevent access to old archived copies of sites (that were crawled when robots.txt did not disallow it).

    Lots of domain campers / holding pages are denying archive.org (or, more likely, any robots) from crawling them.

    What this amounts to is, archive.org is starting to deny access to sites that no longer exist – exactly the sites that are many times what people are trying to access, and for no good reason that I can see! (None of the ones I have encountered appear to be requested removals, just domains that have dropped and been picked up by some camper.)

    I sincerely hope the content from such sites hasn’t been deleted, but even if it has, is there any hope of getting this problem remedied? I fear many archives will disappear off the face of the Wayback Machine as time moves on if not…

    • gojomo Says:

      We’re aware of the problem. A big challenge in addressing it is that the patchwork of domain-name registrars, private-registrations, domain-sales, and shared-responsibility servers means there’s no reliable single source for ascertaining the ‘title’ to a website (or portion of a website) over time.

      That we trust the current robots.txt as a retroactive indicator of the desires of the ‘owner’ of a website through all time has been a necessary simplifying (and staff-time-saving) assumption. It’s reasonably well-grounded in the practice of other web services (like search engines) which use either robots.txt or the presence of other files to prove webmaster intent. But, this assumption worked better when the history of the web was shorter, and fewer domains had changed hands. Now, as you’ve noticed, many people have found their own content, which they really *want* in the Wayback, blocked by the inadvertent rules of a later domain owner.

      The new Wayback software, due to be rolled out the public worldwide Wayback Machine later this year, offers us some new options for more fine-grained control by era — but it will still be a while before we figure out the proper policies (both technical and procedural) to enable this in the main public archive. We may need to consider new adjuncts to the ‘robots.txt’ standard, or take guidance from customary DMCA practices, to determine how someone may request removal or restoration of some-but-not-all-periods of material.

      • myndzi Says:

        Thanks for the thoughtful reply.

        It seems to me that the simplest solution would be to block archives from the time that robots.txt comes into existence and on. After all, you have a mechanism in place already to allow users to request their site archive to be removed. The way I understand it, robots.txt is to control the behavior of web-crawling bots – that is, active consumption of bandwidth and server resources – such that site owners can politely request that automated programs leave them be. This doesn’t seem to be the same concept as a request not to mirror or archive the content, for which no particular solution exists to my knowledge. I am not familiar with the standards (such as they are) for robots.txt content, though.

        Another possibility would be to retrieve domain registration information along with crawling the site (do you already do that?), and block the archive access for only the “term” of the current domain registration entry. (A new owner would then be a new registration record / origin date, so it would be easy to sort out that as a different “site” from the previous one.)

        Lastly, I suppose you could write some moderately complicated code to compare the actual content to see if it has change drastically in order to determine if it is a “new” site or not.

        Of all of these, I prefer the first (simply don’t crawl, but don’t prevent archive access to previous crawls) and don’t see that it goes against the intent of robots.txt’s existence, either.

        Anyway, I am glad that it is a known problem and there are plans to deal with it. I am grateful for the service and have found it quite useful on many occasions!

      • William J. Croft Says:

        Gordon, hi. I would have to agree with myndzi’s comment:

        “Of all of these, I prefer the first (simply don’t crawl, but don’t prevent archive access to previous crawls) and I don’t see that it goes against the intent of robots.txt’s existence, either.”

        This sounds like the type of software fix that could be done in a very short amount of time. (One day?) Please consider it. I’m sure many others are noticing this disruption in access to their prior content.

        And the fact that the vast majority of these domain name “campers” are just mindless robot attempts to capitalize on prior domain names — I consider this spam-like and disgraceful. Not the kind of “personal choice” you want to be promoting.

        Regards,

        William Croft

      • naesten Says:

        The quickest fix might be to just not actually delete the content due to robots.txt, but merely render it inaccessible without administrator intervention (e.g by renaming it / moving it to different directory trees).

        (Then again, that might not be a terribly quick fix, depending on how the storage is organized.)

        This wouldn’t be a complete fix, but at least would prevent any permanent damage while you figure out your actual policy.

  7. esolutions2 Says:

    We need some guidance. After opening a page from a site archive that we were interested in, the entire page was covered with text that said Huge Domans. Is there a way to stop this? Is it a virus? It makes it impossible to actually view the archived page.

    Any help would be appreciated.

    • gojomo Says:

      Without the http address (URL), I’m only guessing what might be happening. But there are two main possibilities:

      (1) that’s how the site actually looked at a prior date;

      (2) the appearance is an artifact of the site changing design or ownership over time, or including design elements that the Wayback Machine cannot effectively ‘replay’ from our archive.

      The second is more likely, but in defense of the first possibility: there are truly broken and bizarrely unreadable webpages of every imaginable kind somewhere in our giant archive!

      Here comes a lot of inside detail, which I hope will be useful both to casual web users and descriptive for web experts:

      Regarding the second possibility: there could be a background image, or script/frame fetch, or other bit of active content (Flash/Java/etc.), that is coming from a different era or different site than the primary/original content of interest. There are two primary ways this can happen:

      (1) ‘time skew’

      Even on a frequently and deeply-collected site, all the different independent resources that make up one ‘page’ may be collected hours, days, weeks, or months apart. The site could change design or ownership over that time, and so the page you see may be a mosaic of disparate elements. Less-frequently or less-deeply collected sites, or sites that could not be fully collected due to technical limitations (such as crawler-blocking robots.txt), could be assembling resources together from years apart — the Wayback is always doing a ‘nearest’ date match, but ‘nearest’ could be years earlier or later.

      (As an extreme but plausible example: a page in year 2000 might have tried to use a background image, but had a robots.txt that prohibited crawlers from reading the ‘images/background.jpg’ resource, so we have no copies of it and can never render the page perfectly. In 2003, the site changes hands, removes the robots.txt, and — by coincidence or design — starts serving a different image under the same name. Now, it can be archived. Finally, in 2010, trying to view the year 2000 page will find the 2003 image as the nearest-match for the needed background, resulting in a mixed display. If the 2003 image was a bit of advertising from a domain-name-sales outfit, it might result in something like what you’ve described.)

      (2) ‘replay leaking to live web’

      Sometimes when viewing pages in the Wayback, some of what you’re seeing is coming from the ‘live web’ of right now, rather than the archive. This, too, can happen for two reasons:

      (a) intentionally by design — when we see a request to display an exact dated version of a URL we have no copies of, our servers make a quick try to get the current version, and will then show that if possible. (It is, after all, the nearest-date we have.) These fetches will eventually become part of the archive, as well, so to some extent browsing recent pages in the Wayback helps patch holes in the archive, where the original resources still exist.

      (b) unintentionally because of technical limitations — some active content, most commonly interactive Adobe Flash, Java, and Javascript — loads web resources in a manner hard for us to intercept and direct back to the archive. As a result, a page you’re viewing from the archive can sometimes request, and display, some resources from the live current web. In some cases, this content can then redirect your browser to another page outside our archive entirely — so always pay close attention to the browser address bar, with the ‘http’ address, to confirm you’re viewing the pages at our archive that you intended.

      What can be done?

      Well, to fully understand where every element of a composite page is coming from — which era, and which websites — you have to look ‘behind the curtains’ for the details of the HTTP traffic your web browser sends over the net. A little bit of this is possible by doing things like right-clicking on individual page images for more info or single-resource view options, or looking at the ‘view source’ of the page and visiting each address individually. To do it really well takes a web developers’ debugging tools like FireBug (on Firefox) and other comparable tools elsewhere. These tools, or other hand-editing, can also serve to edit inadvertently-included details from a page to make it more readable or marginally more likely to reflect the original appearance of a page on a particular date.

      I know that’s a lot to digest — but I hope it sheds some light on what’s happening when the Wayback shows confusing/mixed content.

      - Gordon @ IA

Comments are closed.


Follow

Get every new post delivered to your Inbox.

Join 102 other followers

%d bloggers like this: