Archive for the ‘Digital Stewardship’ Category

Archive-It Partnership with LOCKSS

December 17, 2009

We should have announced this back in July, but we are still just as excited about it 6 months later, so we wanted to be sure we got the word out. We are pleased to announce that data harvested through the Archive-It service was successfully re-harvested into a LOCKSS network for preservation. The transfer was part of a Andrew W. Mellon foundation project with the University of Rochester.

If you are interested in learning more about the Archive-It/LOCKSS partnership, please contact the LOCKSS team lockss-support (at) lockss (dot) org or the Archive-It team (http://www.archive-it.org/public/contact-us)

The Archive-It team would like to partner with additional preservation systems and needs to hear from our partners. If your institution is interested in participating in a pilot for the preservation system you use, please contact the Archive-It team and let us know. We have done a pilot with iRODS and are in the middle of a test with CONTENTdm.

Alaska State Library Archiving Governor Palin’s End of Term Website

July 28, 2009

The Alaska State Library’s collection Alaska Governor/Lt. Governor Web Sites was originally conceived to archive these government websites over time. Alaska Governor Sarah Palin’s resignation announcement earlier this month and the transition of power to Lieutenant Governor Sean Parnell this past Sunday, July 26, 2009 gave the Alaska State Library a great chance to use the crawl on demand feature of Archive-It to preserve information on the announcement and the end of Governor Palin’s term.

By crawling Governor Palin and Lt. Governor Parnell’s websites on the eve of the transition of power, the Alaska State Library was able to capture information that is now offline.  Once Sarah Palin left office, the governor’s website changed to reflect Sean Parnell as governor, and the lieutenant governor’s website changed to reflect Craig Campbell as lieutenant governor. The information from former Governor Palin’s website as well as speeches and press releases from Sean Parnell’s time as lieutenant governor are no longer available on the live web. The foresight of the staff of the Alaska State Library and on-demand crawling through Archive-It made it possible to preserve the final changes to these websites before they were taken offline.

Join the K-12 Web Archiving Program!

July 22, 2009

 

Apply to be part of the Internet Archive K-12 program, and your school can help to capture and archive today’s primary source materials on the Web. 

A growing number of individuals and institutions recognize the importance of archiving and preserving the often transitory digital cultural artifacts that are distributed over the Web. But so far, the vast majority of decisions about what Web sites will live into the future have been made by adults, and reflect adults’ sensibilities about what constitutes the important records of history. We want and need to hear from students. 

The Internet Archive, the Library of Congress and California Digital Library collaborated on a pilot in the spring of 2008 and a full-year program for the 2008/2009 school year, working with a total of 10 elementary, middle and high schools. We are looking to expand this program to new schools in the coming year. You can explore the collections created during the 2008/2009 school year on the Archive-It website at: http://www.archive-it.org/k12/

Find a complete project description and the brief application here: http://www.loc.gov/teachers/newsevents/news/  Apply by August 14 for full consideration.

 

 <a href=’http://www.loc.gov/teachers/’>Apply</a&gt; to be part of the Internet Archive K-12 program, and your school can help to capture and archive today’s primary source materials on the Web. 
<br><br>
A growing number of individuals and institutions recognize the importance of archiving and preserving the often transitory digital cultural artifacts that are distributed over the Web. But so far, the vast majority of decisions about what Web sites will live into the future have been made by adults, and reflect adults’ sensibilities about what constitutes the important records of history. We want and need to hear from students. 
<br><br>
The Internet Archive, the Library of Congress and California Digital Library collaborated on a pilot in the spring of 2008 and a full-year program for the 2008/2009 school year, working with a total of 10 elementary, middle and high schools. We are looking to expand this program to new schools in the coming year. You can explore the collections created during the 2008/2009 school year on the <a href=’http://www.archive-it.org’>Archive-It</a&gt; website at: http://www.archive-it.org/k12/. 
<br><br>
Find a complete project description and the brief application in the “Featured Resources” section at http://www.loc.gov/teachers/. Apply by <b>August 14</b> for full consideration.
</p

Archive-It and LOCKSS Interoperability!

July 21, 2009

The Archive-It team is excited to announce that a successful transfer of Archive-It data moved from the Internet Archive data center into the LOCKSS network.  The transfer was part of a Andrew W. Mellon foundation project with the University of Rochester.   

We are excited to be able to provide these and other preservation options to Archive-It partners as we increase the interoperability of the Archive-It service.  If you are interested in learning more, please contact the Archive-It team. More information about the LOCKSS system can be found at www.lockss.org

K-12 Web Archiving Program, 2008-2009

July 7, 2009

The website for the first full year of our K-12 Web Archiving Program is now available online.

For the 2008/2009 school year, Internet Archive, the Library of Congress and California Digital Library collaborated on a program that explores archiving the Web from the perspective of students in elementary, middle and high school.

Using the Archive-It service, students from ten different schools selected born digital content from the Web to create “time capsules” to represent their world. By allowing students to identify sites that will be preserved for the long-term, the program gives teens and younger students a chance to identify and document their cultural history and the world that’s important to them. Unlike time capsules of tangible objects, which usually remain hidden for decades or centuries, the resulting Web collections are immediately visible and publicly accessible on the Archive-it website, with full text search for study and analysis.

For the 2009/2010 school year we hope to broaden the program’s outreach to additional schools around the country. To get involved and/or learn more, please send us your information through this request form. Applications will be available mid to late July.

University of Melbourne’s Award Winning Web Archiving Program

June 11, 2009

The University of Melbourne  were recently recognized for their excellent web archiving program at the Sir Rupert Hamer Records Management Awards.

Each year the Public Records Advisory Council (PRAC) of Victoria, Australia offers the Sir Rupert Hamer Records Management Awards, recognizing excellence and innovation in records management in the Victorian public sector.  The Awards are named after Sir Rupert Hamer who was the Victorian Premier when the Public Records Act was passed in 1973 and when Public Record Office Victoria opened its first office and repository in 1975. 

The ceremony was held at Queens Hall at Parliament House on Thursday 28th of May.  The Web Archiving Program run by Records Services (team included Lucinda Davies – Ptrogram Coordinator, Silvia Paparozzi – Team Member, Mahesh Sundar – Team leader and Catherine Nicholls – Project Manager,) was awarded a “Certificate of Commendation” in the large agency category.  

The University of Melbourne has been an Archive-It partner since January 2008.  Overall they have collected over 5 million URLs and 500 gb of data.

Please take a look at their now award winning program including this wonderful video (featuring puppets!!) they put together at the end of last year.

Congratulations Team Melbourne!  

 

 

 

 

 

Each year the Public Records Advisory Council (PRAC) of Victoria,
Australia, offers the Sir Rupert Hamer Records Management Awards,
recognising excellence and innovation in records management in the
Victorian public sector. The Awards are named after Sir Rupert Hamer who
was the Victorian Premier when the Public Records Act was passed in 1973
and when Public Record Office Victoria opened its first office and
repository in 1975.
The ceremony was held at Queens Hall at Parliament House on Thursday
28th May. The Web Archiving Program run by Records Services (team
included Lucinda Davies – Program Coordinator, Silvia Paparozzi – Team
Member, Mahesh Sundar – Team Leader and me – Project Manager,) was
awarded a “Certificate of Commendation” in the large agency category.

WARC File Format Published as an International Standard

June 3, 2009

An exciting announcement from the International Internet Preservation Consortium regarding the preservation file format generated using the Heritrix web crawler (used for all Archive-It and Internet Archive crawls for partners):

The International Internet Preservation Consortium is pleased to
announce the publication of the WARC file format as an international
standard: ISO 28500:2009, Information and documentation — WARC file
format.
[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]
For many years, heritage organizations have tried to find the most
appropriate ways to collect and keep track of World Wide Web material
using web-scale tools such as web crawlers. At the same time, these
organizations were concerned with the requirement to archive very large
numbers of born-digital and digitized files. A need was for a container
format that permits one file simply and safely to carry a very large
number of constituent data objects (of unrestricted type, including many
binary types) for the purpose of storage, management, and exchange.
Another requirement was that the container need only minimal knowledge
of the nature of the objects.
The WARC format is expected to be a standard way to structure, manage
and store billions of resources collected from the web and elsewhere. It
is an extension of the ARC format
[http://www.archive.org/web/researcher/ArcFileFormat.php ], which has
been used since 1996 to store files harvested on the web. WARC format
offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for
every contained file, the management of duplicates and of migrated
records, and the segmentation of the records. WARC files are intended to
store every type of digital content, either retrieved by HTTP or another
protocol.
The motivation to extend the ARC format arose from the discussion and
experiences of the International Internet Preservation Consortium [
http://netpreserve.org/ ], whose core mission is to acquire, preserve
and make accessible knowledge and information from the Internet for
future generations. IIPC Standards Working Group put forward to ISO
TC46/SC4/WG12 a draft presenting the WARC file format. The draft was
accepted as a new Work Item by ISO in May 2005.
Over a period of four years, the ISO working group, with the
Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener,
collaborated closely with IIPC experts to improve the original draft.
The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the
standard and prepare its future revision.
Standardization offers a guarantee of durability and evolution for the
WARC format. It will help web archiving entering into the mainstream
activities of heritage institutions and other branches, by fostering the
development of new tools and ensuring the interoperability of
collections. Several applications are already WARC compliant, such as
the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the
WARC tools [http://code.google.com/p/warc-tools/ ] for data management
and exchange, the Wayback Machine
[http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX
[http://archive-access.sourceforge.net/projects/nutch/ ] and other
search tools [http://code.google.com/p/search-tools/ ] for access. The
international recognition of the WARC format and its applicability to
every kind of digital object will provide strong incentives to use it
within and beyond the web archiving community.
A press release is available on the IIPC website:
General information about the IIPC can be found at:
———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org

The International Internet Preservation Consortium is pleased to
announce the publication of the WARC file format as an international
standard: ISO 28500:2009, Information and documentation — WARC file
format.

[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]

For many years, heritage organizations have tried to find the most
appropriate ways to collect and keep track of World Wide Web material
using web-scale tools such as web crawlers. At the same time, these
organizations were concerned with the requirement to archive very large
numbers of born-digital and digitized files. A need was for a container
format that permits one file simply and safely to carry a very large
number of constituent data objects (of unrestricted type, including many
binary types) for the purpose of storage, management, and exchange.
Another requirement was that the container need only minimal knowledge
of the nature of the objects.

The WARC format is expected to be a standard way to structure, manage
and store billions of resources collected from the web and elsewhere. It
is an extension of the ARC format
[http://www.archive.org/web/researcher/ArcFileFormat.php ], which has
been used since 1996 to store files harvested on the web. WARC format
offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for
every contained file, the management of duplicates and of migrated
records, and the segmentation of the records. WARC files are intended to
store every type of digital content, either retrieved by HTTP or another
protocol.

The motivation to extend the ARC format arose from the discussion and
experiences of the International Internet Preservation Consortium [
http://netpreserve.org/ ], whose core mission is to acquire, preserve
and make accessible knowledge and information from the Internet for
future generations. IIPC Standards Working Group put forward to ISO
TC46/SC4/WG12 a draft presenting the WARC file format. The draft was
accepted as a new Work Item by ISO in May 2005.

Over a period of four years, the ISO working group, with the
Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener,
collaborated closely with IIPC experts to improve the original draft.
The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the
standard and prepare its future revision.

Standardization offers a guarantee of durability and evolution for the
WARC format. It will help web archiving entering into the mainstream
activities of heritage institutions and other branches, by fostering the
development of new tools and ensuring the interoperability of
collections. Several applications are already WARC compliant, such as
the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the
WARC tools [http://code.google.com/p/warc-tools/ ] for data management
and exchange, the Wayback Machine
[http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX
[http://archive-access.sourceforge.net/projects/nutch/ ] and other
search tools [http://code.google.com/p/search-tools/ ] for access. The
international recognition of the WARC format and its applicability to
every kind of digital object will provide strong incentives to use it
within and beyond the web archiving community.

A press release is available on the IIPC website:
http://netpreserve.org/press/pr20090601.php

General information about the IIPC can be found at:
http://netpreserve.org

———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org

Seeking Schools for K-12 Web Archiving Program

September 11, 2008

Apply to be part of the Internet Archive k-12 project!

Could your school be one of 10 middle or high schools helping to
capture and archive today’s primary source materials on the Web?

A small number of individuals and institutions recognize the importance of archiving and preserving the often transitory digital cultural artifacts that are distributed over the Web. But so far, the vast majority of decisions about what Web sites will live into the future have been made by adults, and reflect adults’ sensibilities about what constitutes the important stuff of history.

The Internet Archive, the Library of Congress and California Digital Library are collaborating on a project that explores archiving the Web from the perspective of adolescents.

Find a complete project description and the brief application in the “Featured Resources” section at http://www.loc.gov/teachers/. Apply by September 30 for full consideration.

A pilot of the K-12 web archiving program took place in the Spring of 2008. Three high schools from across the country participated and the resulting collection represent a broad range of interests and points of view. You can learn more about the pilot and view the collections on the Archive-It website.

Archive-It 2.10!

September 10, 2008

The Archive-It team is excited to announce the release of Archive-It 2.10!

Our new features include a Spanish interface to both our public site (www.archive-it.org) and the Archive-It private web application. Archive-It is reaching out to our spanish speaking colleagues in the United States and around the world in order to widen the scope and breadth of web archiving.

Other changes to our public website will increase the visibility and access to the broad range of collections created by Archive-It partners. We have also added a seed URL registry so the web archiving community can see exactly what Archive-It partners are collecting. This feature is a work in progress and we welcome any and all feedback!

Inside the web application we have expanded our abilities to automatically harvest YouTube videos. This upgrade is the result of working closely with the University of North Carolina on an NDIIPP funded project to harvest election videos on the web.

We have also developed documentation and guidelines for partners to better review their collections for quality assurance. We will be offering advanced partner trainings in October or November that will help partners with QA and effective crawl scoping.

Stay tuned for our newest feature, Scope It! to be released in November as part of our 3.0 release. Scope It! is a pre-crawl scoping tool allowing partners to select content for harvest at mote selective level.

If you would like to attend one of our twice monthly informational webinars, send an email to archive-it at archive.org.

Library Partnership to Preserve End-of-Term Government Websites

August 21, 2008

Library of Congress has formally announced a collaborative partnership with Internet Archive, the California Digital Library, the University of North Texas Libraries, and the U.S. Government Printing Office to preserve .gov websites during upcoming presidential transition. There is a story covering the announcement in the Washington Post as well.

Internet Archive’s role in the project will be to focus on the harvesting of websites in the .gov domain using Heritrix, the open source web crawler developed at IA. The project will serve to preserve at-risk government websites that are likely to change dramatically from one administration to the next. The resulting collection will be publicly accessible starting in February 2009.

Internet Archive has played a key role in archiving past administrative transitions with the U.S. National Archives both in 2004 and with the congressional change in 2006. These past harvests are freely accessible online.