Archive for June, 2009

Check Us Out on Free Government Information

June 16, 2009

The Archive-It team are guest blogging throughout the month of June on the Free Government Information blog.  Please come on over and take a look at our posts or follow our feed

The entire Free Gov Info blog is an excellent resource for news and information, please check it out!

University of Melbourne’s Award Winning Web Archiving Program

June 11, 2009

The University of Melbourne  were recently recognized for their excellent web archiving program at the Sir Rupert Hamer Records Management Awards.

Each year the Public Records Advisory Council (PRAC) of Victoria, Australia offers the Sir Rupert Hamer Records Management Awards, recognizing excellence and innovation in records management in the Victorian public sector.  The Awards are named after Sir Rupert Hamer who was the Victorian Premier when the Public Records Act was passed in 1973 and when Public Record Office Victoria opened its first office and repository in 1975. 

The ceremony was held at Queens Hall at Parliament House on Thursday 28th of May.  The Web Archiving Program run by Records Services (team included Lucinda Davies – Ptrogram Coordinator, Silvia Paparozzi – Team Member, Mahesh Sundar – Team leader and Catherine Nicholls – Project Manager,) was awarded a “Certificate of Commendation” in the large agency category.  

The University of Melbourne has been an Archive-It partner since January 2008.  Overall they have collected over 5 million URLs and 500 gb of data.

Please take a look at their now award winning program including this wonderful video (featuring puppets!!) they put together at the end of last year.

Congratulations Team Melbourne!  

 

 

 

 

 

Each year the Public Records Advisory Council (PRAC) of Victoria,
Australia, offers the Sir Rupert Hamer Records Management Awards,
recognising excellence and innovation in records management in the
Victorian public sector. The Awards are named after Sir Rupert Hamer who
was the Victorian Premier when the Public Records Act was passed in 1973
and when Public Record Office Victoria opened its first office and
repository in 1975.
The ceremony was held at Queens Hall at Parliament House on Thursday
28th May. The Web Archiving Program run by Records Services (team
included Lucinda Davies – Program Coordinator, Silvia Paparozzi – Team
Member, Mahesh Sundar – Team Leader and me – Project Manager,) was
awarded a “Certificate of Commendation” in the large agency category.

WARC File Format Published as an International Standard

June 3, 2009

An exciting announcement from the International Internet Preservation Consortium regarding the preservation file format generated using the Heritrix web crawler (used for all Archive-It and Internet Archive crawls for partners):

The International Internet Preservation Consortium is pleased to
announce the publication of the WARC file format as an international
standard: ISO 28500:2009, Information and documentation — WARC file
format.
[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]
For many years, heritage organizations have tried to find the most
appropriate ways to collect and keep track of World Wide Web material
using web-scale tools such as web crawlers. At the same time, these
organizations were concerned with the requirement to archive very large
numbers of born-digital and digitized files. A need was for a container
format that permits one file simply and safely to carry a very large
number of constituent data objects (of unrestricted type, including many
binary types) for the purpose of storage, management, and exchange.
Another requirement was that the container need only minimal knowledge
of the nature of the objects.
The WARC format is expected to be a standard way to structure, manage
and store billions of resources collected from the web and elsewhere. It
is an extension of the ARC format
[http://www.archive.org/web/researcher/ArcFileFormat.php ], which has
been used since 1996 to store files harvested on the web. WARC format
offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for
every contained file, the management of duplicates and of migrated
records, and the segmentation of the records. WARC files are intended to
store every type of digital content, either retrieved by HTTP or another
protocol.
The motivation to extend the ARC format arose from the discussion and
experiences of the International Internet Preservation Consortium [
http://netpreserve.org/ ], whose core mission is to acquire, preserve
and make accessible knowledge and information from the Internet for
future generations. IIPC Standards Working Group put forward to ISO
TC46/SC4/WG12 a draft presenting the WARC file format. The draft was
accepted as a new Work Item by ISO in May 2005.
Over a period of four years, the ISO working group, with the
Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener,
collaborated closely with IIPC experts to improve the original draft.
The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the
standard and prepare its future revision.
Standardization offers a guarantee of durability and evolution for the
WARC format. It will help web archiving entering into the mainstream
activities of heritage institutions and other branches, by fostering the
development of new tools and ensuring the interoperability of
collections. Several applications are already WARC compliant, such as
the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the
WARC tools [http://code.google.com/p/warc-tools/ ] for data management
and exchange, the Wayback Machine
[http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX
[http://archive-access.sourceforge.net/projects/nutch/ ] and other
search tools [http://code.google.com/p/search-tools/ ] for access. The
international recognition of the WARC format and its applicability to
every kind of digital object will provide strong incentives to use it
within and beyond the web archiving community.
A press release is available on the IIPC website:
General information about the IIPC can be found at:
———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org

The International Internet Preservation Consortium is pleased to
announce the publication of the WARC file format as an international
standard: ISO 28500:2009, Information and documentation — WARC file
format.

[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]

For many years, heritage organizations have tried to find the most
appropriate ways to collect and keep track of World Wide Web material
using web-scale tools such as web crawlers. At the same time, these
organizations were concerned with the requirement to archive very large
numbers of born-digital and digitized files. A need was for a container
format that permits one file simply and safely to carry a very large
number of constituent data objects (of unrestricted type, including many
binary types) for the purpose of storage, management, and exchange.
Another requirement was that the container need only minimal knowledge
of the nature of the objects.

The WARC format is expected to be a standard way to structure, manage
and store billions of resources collected from the web and elsewhere. It
is an extension of the ARC format
[http://www.archive.org/web/researcher/ArcFileFormat.php ], which has
been used since 1996 to store files harvested on the web. WARC format
offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for
every contained file, the management of duplicates and of migrated
records, and the segmentation of the records. WARC files are intended to
store every type of digital content, either retrieved by HTTP or another
protocol.

The motivation to extend the ARC format arose from the discussion and
experiences of the International Internet Preservation Consortium [
http://netpreserve.org/ ], whose core mission is to acquire, preserve
and make accessible knowledge and information from the Internet for
future generations. IIPC Standards Working Group put forward to ISO
TC46/SC4/WG12 a draft presenting the WARC file format. The draft was
accepted as a new Work Item by ISO in May 2005.

Over a period of four years, the ISO working group, with the
Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener,
collaborated closely with IIPC experts to improve the original draft.
The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the
standard and prepare its future revision.

Standardization offers a guarantee of durability and evolution for the
WARC format. It will help web archiving entering into the mainstream
activities of heritage institutions and other branches, by fostering the
development of new tools and ensuring the interoperability of
collections. Several applications are already WARC compliant, such as
the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the
WARC tools [http://code.google.com/p/warc-tools/ ] for data management
and exchange, the Wayback Machine
[http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX
[http://archive-access.sourceforge.net/projects/nutch/ ] and other
search tools [http://code.google.com/p/search-tools/ ] for access. The
international recognition of the WARC format and its applicability to
every kind of digital object will provide strong incentives to use it
within and beyond the web archiving community.

A press release is available on the IIPC website:
http://netpreserve.org/press/pr20090601.php

General information about the IIPC can be found at:
http://netpreserve.org

———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org