American University’s Web Harvesting Project: A Work in Progress

As we recently completed our first year of web harvesting, it seems a fitting time to make a progress report. The original scope of this project was to document the online presence of student organizations and to collect web only publications. We presented our proposal in the fall of 2008 just as AU was finalizing plans to launch its new website the following spring. In light of this, we expanded our scope to cover the University’s entire website.  American University selected the Internet Archive’s Archive-It service for this project. Archive-It has a user friendly web interface through which you can set up and schedule crawls. The Internet Archive stores the web sites collected, generates reports, and offers technical support. Because of the evanescent nature of the web, it is important to review the reports generated by Archive-Its crawler. These reports document the success/failures of the crawl. By reviewing this data, we can identify crawler traps and write code to prevent future problems. Over the course of the last year, we have conducted four major crawls and several smaller ones. We reaped the benefits of this project within several months of starting. We have already received inquiries from students seeking copies of articles they had written for an online publication. The publication’s web site was temporarily down and the harvested version was the only source of their work. The archived version of AU’s website is available through the Archive-It site. I invite you to browse the archives. Start at the following site: http://www.archive-it.org/public/all_collections and select one of AU’s Collections. For those of you familiar with the Wayback Machine, it only has data for http://www.american.edu/ from 1996-2008.