Archive for May, 2007

Deadline for 2 Billion Page Crawl is Tomorrow!

May 17, 2007

Friday May 18 (tomorrow) is the last day to URLs to the upcoming 2 Billion Page web crawl. Please be sure to submit your seeds by the end of the day.

Write to aroundtheworld at for a submission login or for more information.

Thank you for your participation in this historic crawl!


Crawl Data Delivered to Bibliotheque National de France

May 17, 2007

On April 10, 2007, we delivered our third annual contract crawl to Bibliotheque National de France. The collections included a 2006 crawl of the .fr domain and a historical collection spanning March to June of 2005, totaling more than 324 million documents.

New to the 2006 collection was a NutchWAX full-text index of the .fr domain, representing one of the largest deployments of a searchable web archive.

The collections were delivered on a 40-node Petabox storage cluster, complementing BnF’s existing 80-node cluster previously installed by the Web Team in 2005 and 2006. With this delivery, BnF now owns and operates the third largest Petabox installation in the world (after the Internet Archive and Library of Alexandria).

Petabox Racks in BNF RepositoryInternet Archive and BNF installation/crawl team