Archive for the ‘Web Archiving Community’ Category

Wayback Machine & Web Archiving Open Thread, April 2011

April 7, 2011

Anything you want to know or discuss about the Wayback Machine or the Internet Archive’s web archive? This is the place!

What do you want to know about the Wayback Machine and Internet Archive web archive? Do you have problems, concerns, suggestions? This is the place!

If your comment is a question, please check the classic Wayback Machine Frequently-Asked-Questions (FAQ) or new Wayback Machine FAQ site to see if your question has already been addressed before posting.

A few other things to note before posting:

Everything else? Fire away!

Archive-It Partnership with LOCKSS

December 17, 2009

We should have announced this back in July, but we are still just as excited about it 6 months later, so we wanted to be sure we got the word out. We are pleased to announce that data harvested through the Archive-It service was successfully re-harvested into a LOCKSS network for preservation. The transfer was part of a Andrew W. Mellon foundation project with the University of Rochester.

If you are interested in learning more about the Archive-It/LOCKSS partnership, please contact the LOCKSS team lockss-support (at) lockss (dot) org or the Archive-It team (http://www.archive-it.org/public/contact-us)

The Archive-It team would like to partner with additional preservation systems and needs to hear from our partners. If your institution is interested in participating in a pilot for the preservation system you use, please contact the Archive-It team and let us know. We have done a pilot with iRODS and are in the middle of a test with CONTENTdm.

Alaska State Library Archiving Governor Palin’s End of Term Website

July 28, 2009

The Alaska State Library’s collection Alaska Governor/Lt. Governor Web Sites was originally conceived to archive these government websites over time. Alaska Governor Sarah Palin’s resignation announcement earlier this month and the transition of power to Lieutenant Governor Sean Parnell this past Sunday, July 26, 2009 gave the Alaska State Library a great chance to use the crawl on demand feature of Archive-It to preserve information on the announcement and the end of Governor Palin’s term.

By crawling Governor Palin and Lt. Governor Parnell’s websites on the eve of the transition of power, the Alaska State Library was able to capture information that is now offline.  Once Sarah Palin left office, the governor’s website changed to reflect Sean Parnell as governor, and the lieutenant governor’s website changed to reflect Craig Campbell as lieutenant governor. The information from former Governor Palin’s website as well as speeches and press releases from Sean Parnell’s time as lieutenant governor are no longer available on the live web. The foresight of the staff of the Alaska State Library and on-demand crawling through Archive-It made it possible to preserve the final changes to these websites before they were taken offline.

Join the K-12 Web Archiving Program!

July 22, 2009

 

Apply to be part of the Internet Archive K-12 program, and your school can help to capture and archive today’s primary source materials on the Web. 

A growing number of individuals and institutions recognize the importance of archiving and preserving the often transitory digital cultural artifacts that are distributed over the Web. But so far, the vast majority of decisions about what Web sites will live into the future have been made by adults, and reflect adults’ sensibilities about what constitutes the important records of history. We want and need to hear from students. 

The Internet Archive, the Library of Congress and California Digital Library collaborated on a pilot in the spring of 2008 and a full-year program for the 2008/2009 school year, working with a total of 10 elementary, middle and high schools. We are looking to expand this program to new schools in the coming year. You can explore the collections created during the 2008/2009 school year on the Archive-It website at: http://www.archive-it.org/k12/

Find a complete project description and the brief application here: http://www.loc.gov/teachers/newsevents/news/  Apply by August 14 for full consideration.

 

 <a href=’http://www.loc.gov/teachers/’>Apply</a&gt; to be part of the Internet Archive K-12 program, and your school can help to capture and archive today’s primary source materials on the Web. 
<br><br>
A growing number of individuals and institutions recognize the importance of archiving and preserving the often transitory digital cultural artifacts that are distributed over the Web. But so far, the vast majority of decisions about what Web sites will live into the future have been made by adults, and reflect adults’ sensibilities about what constitutes the important records of history. We want and need to hear from students. 
<br><br>
The Internet Archive, the Library of Congress and California Digital Library collaborated on a pilot in the spring of 2008 and a full-year program for the 2008/2009 school year, working with a total of 10 elementary, middle and high schools. We are looking to expand this program to new schools in the coming year. You can explore the collections created during the 2008/2009 school year on the <a href=’http://www.archive-it.org’>Archive-It</a&gt; website at: http://www.archive-it.org/k12/. 
<br><br>
Find a complete project description and the brief application in the “Featured Resources” section at http://www.loc.gov/teachers/. Apply by <b>August 14</b> for full consideration.
</p

Archive-It and LOCKSS Interoperability!

July 21, 2009

The Archive-It team is excited to announce that a successful transfer of Archive-It data moved from the Internet Archive data center into the LOCKSS network.  The transfer was part of a Andrew W. Mellon foundation project with the University of Rochester.   

We are excited to be able to provide these and other preservation options to Archive-It partners as we increase the interoperability of the Archive-It service.  If you are interested in learning more, please contact the Archive-It team. More information about the LOCKSS system can be found at www.lockss.org

K-12 Web Archiving Program, 2008-2009

July 7, 2009

The website for the first full year of our K-12 Web Archiving Program is now available online.

For the 2008/2009 school year, Internet Archive, the Library of Congress and California Digital Library collaborated on a program that explores archiving the Web from the perspective of students in elementary, middle and high school.

Using the Archive-It service, students from ten different schools selected born digital content from the Web to create “time capsules” to represent their world. By allowing students to identify sites that will be preserved for the long-term, the program gives teens and younger students a chance to identify and document their cultural history and the world that’s important to them. Unlike time capsules of tangible objects, which usually remain hidden for decades or centuries, the resulting Web collections are immediately visible and publicly accessible on the Archive-it website, with full text search for study and analysis.

For the 2009/2010 school year we hope to broaden the program’s outreach to additional schools around the country. To get involved and/or learn more, please send us your information through this request form. Applications will be available mid to late July.

University of Melbourne’s Award Winning Web Archiving Program

June 11, 2009

The University of Melbourne  were recently recognized for their excellent web archiving program at the Sir Rupert Hamer Records Management Awards.

Each year the Public Records Advisory Council (PRAC) of Victoria, Australia offers the Sir Rupert Hamer Records Management Awards, recognizing excellence and innovation in records management in the Victorian public sector.  The Awards are named after Sir Rupert Hamer who was the Victorian Premier when the Public Records Act was passed in 1973 and when Public Record Office Victoria opened its first office and repository in 1975. 

The ceremony was held at Queens Hall at Parliament House on Thursday 28th of May.  The Web Archiving Program run by Records Services (team included Lucinda Davies – Ptrogram Coordinator, Silvia Paparozzi – Team Member, Mahesh Sundar – Team leader and Catherine Nicholls – Project Manager,) was awarded a “Certificate of Commendation” in the large agency category.  

The University of Melbourne has been an Archive-It partner since January 2008.  Overall they have collected over 5 million URLs and 500 gb of data.

Please take a look at their now award winning program including this wonderful video (featuring puppets!!) they put together at the end of last year.

Congratulations Team Melbourne!  

 

 

 

 

 

Each year the Public Records Advisory Council (PRAC) of Victoria,
Australia, offers the Sir Rupert Hamer Records Management Awards,
recognising excellence and innovation in records management in the
Victorian public sector. The Awards are named after Sir Rupert Hamer who
was the Victorian Premier when the Public Records Act was passed in 1973
and when Public Record Office Victoria opened its first office and
repository in 1975.
The ceremony was held at Queens Hall at Parliament House on Thursday
28th May. The Web Archiving Program run by Records Services (team
included Lucinda Davies – Program Coordinator, Silvia Paparozzi – Team
Member, Mahesh Sundar – Team Leader and me – Project Manager,) was
awarded a “Certificate of Commendation” in the large agency category.

WARC File Format Published as an International Standard

June 3, 2009

An exciting announcement from the International Internet Preservation Consortium regarding the preservation file format generated using the Heritrix web crawler (used for all Archive-It and Internet Archive crawls for partners):

The International Internet Preservation Consortium is pleased to
announce the publication of the WARC file format as an international
standard: ISO 28500:2009, Information and documentation — WARC file
format.
[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]
For many years, heritage organizations have tried to find the most
appropriate ways to collect and keep track of World Wide Web material
using web-scale tools such as web crawlers. At the same time, these
organizations were concerned with the requirement to archive very large
numbers of born-digital and digitized files. A need was for a container
format that permits one file simply and safely to carry a very large
number of constituent data objects (of unrestricted type, including many
binary types) for the purpose of storage, management, and exchange.
Another requirement was that the container need only minimal knowledge
of the nature of the objects.
The WARC format is expected to be a standard way to structure, manage
and store billions of resources collected from the web and elsewhere. It
is an extension of the ARC format
[http://www.archive.org/web/researcher/ArcFileFormat.php ], which has
been used since 1996 to store files harvested on the web. WARC format
offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for
every contained file, the management of duplicates and of migrated
records, and the segmentation of the records. WARC files are intended to
store every type of digital content, either retrieved by HTTP or another
protocol.
The motivation to extend the ARC format arose from the discussion and
experiences of the International Internet Preservation Consortium [
http://netpreserve.org/ ], whose core mission is to acquire, preserve
and make accessible knowledge and information from the Internet for
future generations. IIPC Standards Working Group put forward to ISO
TC46/SC4/WG12 a draft presenting the WARC file format. The draft was
accepted as a new Work Item by ISO in May 2005.
Over a period of four years, the ISO working group, with the
Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener,
collaborated closely with IIPC experts to improve the original draft.
The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the
standard and prepare its future revision.
Standardization offers a guarantee of durability and evolution for the
WARC format. It will help web archiving entering into the mainstream
activities of heritage institutions and other branches, by fostering the
development of new tools and ensuring the interoperability of
collections. Several applications are already WARC compliant, such as
the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the
WARC tools [http://code.google.com/p/warc-tools/ ] for data management
and exchange, the Wayback Machine
[http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX
[http://archive-access.sourceforge.net/projects/nutch/ ] and other
search tools [http://code.google.com/p/search-tools/ ] for access. The
international recognition of the WARC format and its applicability to
every kind of digital object will provide strong incentives to use it
within and beyond the web archiving community.
A press release is available on the IIPC website:
General information about the IIPC can be found at:
———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org

The International Internet Preservation Consortium is pleased to
announce the publication of the WARC file format as an international
standard: ISO 28500:2009, Information and documentation — WARC file
format.

[http://www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=44717]

For many years, heritage organizations have tried to find the most
appropriate ways to collect and keep track of World Wide Web material
using web-scale tools such as web crawlers. At the same time, these
organizations were concerned with the requirement to archive very large
numbers of born-digital and digitized files. A need was for a container
format that permits one file simply and safely to carry a very large
number of constituent data objects (of unrestricted type, including many
binary types) for the purpose of storage, management, and exchange.
Another requirement was that the container need only minimal knowledge
of the nature of the objects.

The WARC format is expected to be a standard way to structure, manage
and store billions of resources collected from the web and elsewhere. It
is an extension of the ARC format
[http://www.archive.org/web/researcher/ArcFileFormat.php ], which has
been used since 1996 to store files harvested on the web. WARC format
offers new possibilities, notably the recording of HTTP request headers,
the recording of arbitrary metadata, the allocation of an identifier for
every contained file, the management of duplicates and of migrated
records, and the segmentation of the records. WARC files are intended to
store every type of digital content, either retrieved by HTTP or another
protocol.

The motivation to extend the ARC format arose from the discussion and
experiences of the International Internet Preservation Consortium [
http://netpreserve.org/ ], whose core mission is to acquire, preserve
and make accessible knowledge and information from the Internet for
future generations. IIPC Standards Working Group put forward to ISO
TC46/SC4/WG12 a draft presenting the WARC file format. The draft was
accepted as a new Work Item by ISO in May 2005.

Over a period of four years, the ISO working group, with the
Bibliothèque nationale de France [http://www.bnf.fr/ ] as convener,
collaborated closely with IIPC experts to improve the original draft.
The WG12 will continue to maintain [http://bibnum.bnf.fr/WARC/ ] the
standard and prepare its future revision.

Standardization offers a guarantee of durability and evolution for the
WARC format. It will help web archiving entering into the mainstream
activities of heritage institutions and other branches, by fostering the
development of new tools and ensuring the interoperability of
collections. Several applications are already WARC compliant, such as
the Heritrix [http://crawler.archive.org/ ] crawler for harvesting, the
WARC tools [http://code.google.com/p/warc-tools/ ] for data management
and exchange, the Wayback Machine
[http://archive-access.sourceforge.net/projects/wayback/ ], NutchWAX
[http://archive-access.sourceforge.net/projects/nutch/ ] and other
search tools [http://code.google.com/p/search-tools/ ] for access. The
international recognition of the WARC format and its applicability to
every kind of digital object will provide strong incentives to use it
within and beyond the web archiving community.

A press release is available on the IIPC website:
http://netpreserve.org/press/pr20090601.php

General information about the IIPC can be found at:
http://netpreserve.org

———————–
Abbie Grotke
Library of Congress
IIPC Communications Officer
netpreserve.org

Searching the dawn of the 21st Century

October 7, 2008

What was the web of the past really like?

Last Tuesday, Google unveiled a unique new web search, 2001 Google, as part of their 10th birthday celebration.

Using an actual archived version of their search engine index from January 2001, the service answers queries more-or-less how Google did back then — same results, same ranking, same summary ‘snippets’.

But of course, many of those result pages have changed or disappeared entirely since then — and that’s where the Internet Archive’s Wayback Machine comes in. For many of the 2001 search results, the best or only view comes from the Wayback Machine, which Google has helpfully provided in lieu of the usual ‘cached version’ links.

The combination of authentic Google search and the Wayback’s giant web archive is more powerful than either alone: finding needles lost in the Wayback haystack, showing actual prior rankings/popularity of pages for real queries, and highlighting material that would have been lost forever without purposeful public-interest archiving.

We thank Google for this chance to work together and highlight our web archive. Google plans to leave the 2001 search up for one month, and we’ll talk more about what we’ve learned from this service in a future blog post.

In the meantime, try the 2001 Google Search!

Seeking Schools for K-12 Web Archiving Program

September 11, 2008

Apply to be part of the Internet Archive k-12 project!

Could your school be one of 10 middle or high schools helping to
capture and archive today’s primary source materials on the Web?

A small number of individuals and institutions recognize the importance of archiving and preserving the often transitory digital cultural artifacts that are distributed over the Web. But so far, the vast majority of decisions about what Web sites will live into the future have been made by adults, and reflect adults’ sensibilities about what constitutes the important stuff of history.

The Internet Archive, the Library of Congress and California Digital Library are collaborating on a project that explores archiving the Web from the perspective of adolescents.

Find a complete project description and the brief application in the “Featured Resources” section at http://www.loc.gov/teachers/. Apply by September 30 for full consideration.

A pilot of the K-12 web archiving program took place in the Spring of 2008. Three high schools from across the country participated and the resulting collection represent a broad range of interests and points of view. You can learn more about the pilot and view the collections on the Archive-It website.