Archive for the ‘Search’ Category

Searching the dawn of the 21st Century

October 7, 2008

What was the web of the past really like?

Last Tuesday, Google unveiled a unique new web search, 2001 Google, as part of their 10th birthday celebration.

Using an actual archived version of their search engine index from January 2001, the service answers queries more-or-less how Google did back then — same results, same ranking, same summary ‘snippets’.

But of course, many of those result pages have changed or disappeared entirely since then — and that’s where the Internet Archive’s Wayback Machine comes in. For many of the 2001 search results, the best or only view comes from the Wayback Machine, which Google has helpfully provided in lieu of the usual ‘cached version’ links.

The combination of authentic Google search and the Wayback’s giant web archive is more powerful than either alone: finding needles lost in the Wayback haystack, showing actual prior rankings/popularity of pages for real queries, and highlighting material that would have been lost forever without purposeful public-interest archiving.

We thank Google for this chance to work together and highlight our web archive. Google plans to leave the 2001 search up for one month, and we’ll talk more about what we’ve learned from this service in a future blog post.

In the meantime, try the 2001 Google Search!

Internet Archive at OSCON

July 24, 2008

Tomorrow, at the O’Reilly Open Source Convention in Portland, I’ll be presenting a session about our open source web archiving tools. Full details:

Build Your Own Web Archive: archive.org’s Open Source Tools to Crawl, Access & Search Web Captures
Gordon Mohr (Internet Archive, Web Group)
11:35am Friday, 07/25/2008
Web Applications
Location: E145

The Internet Archive, with support from other libraries around the world, has helped develop a collection of open source tools in Java to support web archiving. These include the Heritrix archival web crawler, “Wayback” for replaying historic web content, and extensions to Nutch for web archive full-text search. This session will explain the design and capabilities these tools, and quickly demo their use for the creation of a small personal web archive.

Heritrix has been designed for faithful and complete content archiving but has also found use in other web search contexts. Wayback allows URL-based lookup and follow-up browsing of archived web content. Nutch, as applied to archival web crawls, allows Google-style full-text search of web content, including the same content as it changes over time. Together, they provide everything necessary to archive and access accurate historical records of web-published content.

Also: last month James Turner of O’Reilly Media spoke to me in advance of OSCON. You can read or hear the interview at: Gordon Mohr Takes Us Inside the Internet Archives.

OAIster.org Now Featuring Archive-It Collections

June 4, 2007

Archive-It recently introduced an OAI-PMH metadata feed for all Archive-It collections. This feed has been submitted to the OAIster catalog. Our feed has been harvested and you will see hits from Archive-It collections in your search results.

We are also planning to integrate an SRU protocol in our search engine very soon. The Archive-It team is very excited about providing new ways for our partners and their patrons to be able to access their Archive-It collections.

Crawl Data Delivered to Bibliotheque National de France

May 17, 2007

On April 10, 2007, we delivered our third annual contract crawl to Bibliotheque National de France. The collections included a 2006 crawl of the .fr domain and a historical collection spanning March to June of 2005, totaling more than 324 million documents.

New to the 2006 collection was a NutchWAX full-text index of the .fr domain, representing one of the largest deployments of a searchable web archive.

The collections were delivered on a 40-node Petabox storage cluster, complementing BnF’s existing 80-node cluster previously installed by the Web Team in 2005 and 2006. With this delivery, BnF now owns and operates the third largest Petabox installation in the world (after the Internet Archive and Library of Alexandria).

Petabox Racks in BNF RepositoryInternet Archive and BNF installation/crawl team