K-12 Web Archiving Program, 2008-2009

July 7, 2009 by

The website for the first full year of our K-12 Web Archiving Program is now available online.

For the 2008/2009 school year, Internet Archive, the Library of Congress and California Digital Library collaborated on a program that explores archiving the Web from the perspective of students in elementary, middle and high school.

Using the Archive-It service, students from ten different schools selected born digital content from the Web to create “time capsules” to represent their world. By allowing students to identify sites that will be preserved for the long-term, the program gives teens and younger students a chance to identify and document their cultural history and the world that’s important to them. Unlike time capsules of tangible objects, which usually remain hidden for decades or centuries, the resulting Web collections are immediately visible and publicly accessible on the Archive-it website, with full text search for study and analysis.

For the 2009/2010 school year we hope to broaden the program’s outreach to additional schools around the country. To get involved and/or learn more, please send us your information through this request form. Applications will be available mid to late July.

Advertisements

Check Us Out on Free Government Information

June 16, 2009 by

The Archive-It team are guest blogging throughout the month of June on the Free Government Information blog.  Please come on over and take a look at our posts or follow our feed

The entire Free Gov Info blog is an excellent resource for news and information, please check it out!

University of Melbourne’s Award Winning Web Archiving Program

June 11, 2009 by

The University of Melbourne  were recently recognized for their excellent web archiving program at the Sir Rupert Hamer Records Management Awards.

Each year the Public Records Advisory Council (PRAC) of Victoria, Australia offers the Sir Rupert Hamer Records Management Awards, recognizing excellence and innovation in records management in the Victorian public sector.  The Awards are named after Sir Rupert Hamer who was the Victorian Premier when the Public Records Act was passed in 1973 and when Public Record Office Victoria opened its first office and repository in 1975. 

The ceremony was held at Queens Hall at Parliament House on Thursday 28th of May.  The Web Archiving Program run by Records Services (team included Lucinda Davies – Ptrogram Coordinator, Silvia Paparozzi – Team Member, Mahesh Sundar – Team leader and Catherine Nicholls – Project Manager,) was awarded a “Certificate of Commendation” in the large agency category.  

The University of Melbourne has been an Archive-It partner since January 2008.  Overall they have collected over 5 million URLs and 500 gb of data.

Please take a look at their now award winning program including this wonderful video (featuring puppets!!) they put together at the end of last year.

Congratulations Team Melbourne!  

 

 

 

 

 

Each year the Public Records Advisory Council (PRAC) of Victoria,
Australia, offers the Sir Rupert Hamer Records Management Awards,
recognising excellence and innovation in records management in the
Victorian public sector. The Awards are named after Sir Rupert Hamer who
was the Victorian Premier when the Public Records Act was passed in 1973
and when Public Record Office Victoria opened its first office and
repository in 1975.
The ceremony was held at Queens Hall at Parliament House on Thursday
28th May. The Web Archiving Program run by Records Services (team
included Lucinda Davies – Program Coordinator, Silvia Paparozzi – Team
Member, Mahesh Sundar – Team Leader and me – Project Manager,) was
awarded a “Certificate of Commendation” in the large agency category.

WARC File Format Published as an International Standard

June 3, 2009 by

An exciting announcement from the International Internet Preservation Consortium regarding the preservation file format generated using the Heritrix web crawler (used for all Archive-It and Internet Archive crawls for partners):

The International Internet Preservation Consortium is pleased to
announce the publication of the WARC file format as an international
standard: ISO 28500:2009, Information and documentation — WARC file
format.
[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]
For many years, heritage organizations have tried to find the most
appropriate ways to collect and keep track of World Wide Web material
using web-scale tools such as web crawlers. At the same time, these
organizations were concerned with the requirement to archive very large
numbers of born-digital and digitized files. A need was for a container
format that permits one file simply and safely to carry a very large
number of constituent data objects (of unrestricted type, including many
binary types) for the purpose of storage, management, and exchange.
Another requirement was that the container need only minimal knowledge
of the nature of the objects.
The WARC format is expected to be a standard way to structure, manage
and store billions of resources collected from the web and elsewhere. It
is an extension of the ARC format
[http://www.archive.org/web/researcher/ArcFileFormat.php ], which has
been used since 1996 to store files harvested on the web. WARC format
offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for
every contained file, the management of duplicates and of migrated
records, and the segmentation of the records. WARC files are intended to
store every type of digital content, either retrieved by HTTP or another
protocol.
The motivation to extend the ARC format arose from the discussion and
experiences of the International Internet Preservation Consortium [
http://netpreserve.org/ ], whose core mission is to acquire, preserve
and make accessible knowledge and information from the Internet for
future generations. IIPC Standards Working Group put forward to ISO
TC46/SC4/WG12 a draft presenting the WARC file format. The draft was
accepted as a new Work Item by ISO in May 2005.
Over a period of four years, the ISO working group, with the
Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener,
collaborated closely with IIPC experts to improve the original draft.
The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the
standard and prepare its future revision.
Standardization offers a guarantee of durability and evolution for the
WARC format. It will help web archiving entering into the mainstream
activities of heritage institutions and other branches, by fostering the
development of new tools and ensuring the interoperability of
collections. Several applications are already WARC compliant, such as
the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the
WARC tools [http://code.google.com/p/warc-tools/ ] for data management
and exchange, the Wayback Machine
[http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX
[http://archive-access.sourceforge.net/projects/nutch/ ] and other
search tools [http://code.google.com/p/search-tools/ ] for access. The
international recognition of the WARC format and its applicability to
every kind of digital object will provide strong incentives to use it
within and beyond the web archiving community.
A press release is available on the IIPC website:
General information about the IIPC can be found at:
———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org

The International Internet Preservation Consortium is pleased to
announce the publication of the WARC file format as an international
standard: ISO 28500:2009, Information and documentation — WARC file
format.

[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]

For many years, heritage organizations have tried to find the most
appropriate ways to collect and keep track of World Wide Web material
using web-scale tools such as web crawlers. At the same time, these
organizations were concerned with the requirement to archive very large
numbers of born-digital and digitized files. A need was for a container
format that permits one file simply and safely to carry a very large
number of constituent data objects (of unrestricted type, including many
binary types) for the purpose of storage, management, and exchange.
Another requirement was that the container need only minimal knowledge
of the nature of the objects.

The WARC format is expected to be a standard way to structure, manage
and store billions of resources collected from the web and elsewhere. It
is an extension of the ARC format
[http://www.archive.org/web/researcher/ArcFileFormat.php ], which has
been used since 1996 to store files harvested on the web. WARC format
offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for
every contained file, the management of duplicates and of migrated
records, and the segmentation of the records. WARC files are intended to
store every type of digital content, either retrieved by HTTP or another
protocol.

The motivation to extend the ARC format arose from the discussion and
experiences of the International Internet Preservation Consortium [
http://netpreserve.org/ ], whose core mission is to acquire, preserve
and make accessible knowledge and information from the Internet for
future generations. IIPC Standards Working Group put forward to ISO
TC46/SC4/WG12 a draft presenting the WARC file format. The draft was
accepted as a new Work Item by ISO in May 2005.

Over a period of four years, the ISO working group, with the
Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener,
collaborated closely with IIPC experts to improve the original draft.
The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the
standard and prepare its future revision.

Standardization offers a guarantee of durability and evolution for the
WARC format. It will help web archiving entering into the mainstream
activities of heritage institutions and other branches, by fostering the
development of new tools and ensuring the interoperability of
collections. Several applications are already WARC compliant, such as
the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the
WARC tools [http://code.google.com/p/warc-tools/ ] for data management
and exchange, the Wayback Machine
[http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX
[http://archive-access.sourceforge.net/projects/nutch/ ] and other
search tools [http://code.google.com/p/search-tools/ ] for access. The
international recognition of the WARC format and its applicability to
every kind of digital object will provide strong incentives to use it
within and beyond the web archiving community.

A press release is available on the IIPC website:
http://netpreserve.org/press/pr20090601.php

General information about the IIPC can be found at:
http://netpreserve.org

———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org

Searching the dawn of the 21st Century

October 7, 2008 by

What was the web of the past really like?

Last Tuesday, Google unveiled a unique new web search, 2001 Google, as part of their 10th birthday celebration.

Using an actual archived version of their search engine index from January 2001, the service answers queries more-or-less how Google did back then — same results, same ranking, same summary ‘snippets’.

But of course, many of those result pages have changed or disappeared entirely since then — and that’s where the Internet Archive’s Wayback Machine comes in. For many of the 2001 search results, the best or only view comes from the Wayback Machine, which Google has helpfully provided in lieu of the usual ‘cached version’ links.

The combination of authentic Google search and the Wayback’s giant web archive is more powerful than either alone: finding needles lost in the Wayback haystack, showing actual prior rankings/popularity of pages for real queries, and highlighting material that would have been lost forever without purposeful public-interest archiving.

We thank Google for this chance to work together and highlight our web archive. Google plans to leave the 2001 search up for one month, and we’ll talk more about what we’ve learned from this service in a future blog post.

In the meantime, try the 2001 Google Search!

Seeking Schools for K-12 Web Archiving Program

September 11, 2008 by

Apply to be part of the Internet Archive k-12 project!

Could your school be one of 10 middle or high schools helping to
capture and archive today’s primary source materials on the Web?

A small number of individuals and institutions recognize the importance of archiving and preserving the often transitory digital cultural artifacts that are distributed over the Web. But so far, the vast majority of decisions about what Web sites will live into the future have been made by adults, and reflect adults’ sensibilities about what constitutes the important stuff of history.

The Internet Archive, the Library of Congress and California Digital Library are collaborating on a project that explores archiving the Web from the perspective of adolescents.

Find a complete project description and the brief application in the “Featured Resources” section at http://www.loc.gov/teachers/. Apply by September 30 for full consideration.

A pilot of the K-12 web archiving program took place in the Spring of 2008. Three high schools from across the country participated and the resulting collection represent a broad range of interests and points of view. You can learn more about the pilot and view the collections on the Archive-It website.

Archive-It 2.10!

September 10, 2008 by

The Archive-It team is excited to announce the release of Archive-It 2.10!

Our new features include a Spanish interface to both our public site (www.archive-it.org) and the Archive-It private web application. Archive-It is reaching out to our spanish speaking colleagues in the United States and around the world in order to widen the scope and breadth of web archiving.

Other changes to our public website will increase the visibility and access to the broad range of collections created by Archive-It partners. We have also added a seed URL registry so the web archiving community can see exactly what Archive-It partners are collecting. This feature is a work in progress and we welcome any and all feedback!

Inside the web application we have expanded our abilities to automatically harvest YouTube videos. This upgrade is the result of working closely with the University of North Carolina on an NDIIPP funded project to harvest election videos on the web.

We have also developed documentation and guidelines for partners to better review their collections for quality assurance. We will be offering advanced partner trainings in October or November that will help partners with QA and effective crawl scoping.

Stay tuned for our newest feature, Scope It! to be released in November as part of our 3.0 release. Scope It! is a pre-crawl scoping tool allowing partners to select content for harvest at mote selective level.

If you would like to attend one of our twice monthly informational webinars, send an email to archive-it at archive.org.

Library Partnership to Preserve End-of-Term Government Websites

August 21, 2008 by

Library of Congress has formally announced a collaborative partnership with Internet Archive, the California Digital Library, the University of North Texas Libraries, and the U.S. Government Printing Office to preserve .gov websites during upcoming presidential transition. There is a story covering the announcement in the Washington Post as well.

Internet Archive’s role in the project will be to focus on the harvesting of websites in the .gov domain using Heritrix, the open source web crawler developed at IA. The project will serve to preserve at-risk government websites that are likely to change dramatically from one administration to the next. The resulting collection will be publicly accessible starting in February 2009.

Internet Archive has played a key role in archiving past administrative transitions with the U.S. National Archives both in 2004 and with the congressional change in 2006. These past harvests are freely accessible online.

Internet Archive at OSCON

July 24, 2008 by

Tomorrow, at the O’Reilly Open Source Convention in Portland, I’ll be presenting a session about our open source web archiving tools. Full details:

Build Your Own Web Archive: archive.org’s Open Source Tools to Crawl, Access & Search Web Captures
Gordon Mohr (Internet Archive, Web Group)
11:35am Friday, 07/25/2008
Web Applications
Location: E145

The Internet Archive, with support from other libraries around the world, has helped develop a collection of open source tools in Java to support web archiving. These include the Heritrix archival web crawler, “Wayback” for replaying historic web content, and extensions to Nutch for web archive full-text search. This session will explain the design and capabilities these tools, and quickly demo their use for the creation of a small personal web archive.

Heritrix has been designed for faithful and complete content archiving but has also found use in other web search contexts. Wayback allows URL-based lookup and follow-up browsing of archived web content. Nutch, as applied to archival web crawls, allows Google-style full-text search of web content, including the same content as it changes over time. Together, they provide everything necessary to archive and access accurate historical records of web-published content.

Also: last month James Turner of O’Reilly Media spoke to me in advance of OSCON. You can read or hear the interview at: Gordon Mohr Takes Us Inside the Internet Archives.

Archiving the Web with High School students

June 6, 2008 by

Earlier this Spring, Internet Archive in cooperation with Library of Congress and California Digital Library, coordinated a K-12 web archiving pilot in 3 high schools across the country (1 each in California, Illinois. and Louisiana). The purpose of the project was for the high school students to curate a time capsule of the Web from a students perspective and document how the web is being used by high school students in 2008.

The project was a great success thanks to the hard work from the participants and the students’ creativity. The collections are all live and accessible from the program web page.

Please enjoy these collections! We hope to enlarge the pilot in the Fall so please contact Archive-It at archive-it@archive.org if you know of any high school would like to be involved.