Software Recommendations for Web Archiving
From MWCSWiki
Contents |
Introduction
Our goal is to archive the University of Mary Washington's website. The archive must be searchable and it must be sorted by date. This is needed to provide an easy way to store web pages for old events, old documents and forms, old class schedules, old syllabuses, old lecture material, and more. The means to search these archives must be available to any students and professors who can access the website. The means to request an archiving must be available to any professor or staff with access to the website.
With these requirements in mind, we can look for software to accomplish this. The software comes in two parts. First we need a web crawler to make the archives. Second, we need an archive viewer to examine the contents of the archives.
Web Crawler
The web crawler must accomplish certain objectives. It must be able to find every page of a website it is told to crawl. It must record the page in its entirety and store it for later access. The archives it creates must be searchable by keyword and time frame. Fortunately, there is a web crawler available that was created specifically to handle those needs. It is called Heritrix. It was created by the Internet Archive for use with the Wayback Machine. It is open source, cross platform, and in active development. No other web crawler has met our needs so precisely, so it is my suggestion that we use Heritrix as our web crawler.
Archive Viewer
While Heritrix is a good web crawler, the Wayback Machine made to view archives is not as good an option. It is not open source, so it is not freely available. Comments by people who have access to it state that it was not easy to work with. It would be advisable to look for alternatives. Fortunately, two open source alternatives are available that were specifically made to work with Heritrix. Both are Java-based, server-side applications made to be similar to the Wayback Machine for the users. In order to choose one, we must look at which is easier for the developers to use.
Open Source Wayback Machine
The first option is an open source remake of the Wayback Machine in Java. The requirements are a 1.5 Java implementation and a servlet container. It has been tested to work with Sun's 1.5 Java implementation and Apache Tomcat. Installation is simple if Tomcat is already installed, and consists of letting Tomcat process a *.war file. Configuring the software consists of telling it where the archives are, and defining where to put the web interface people use to view the archives. The software supports dynamically added archives, so a restart is not required after every archiving job. One potential downside is the lack of a user manual, so supporting users who are less computer literate may be an issue.
WERA
An alternative software package is WERA. WERA is a web-based viewer for the NutchWAX archive search engine. The requirements are fairly steep. WERA needs a Java Virtual Machine, an Apache server with PHP 4.3 or 4.4 but not PHP 5, Tomcat servlet container, and NutchWAX. Fortunately it has an easy install once the requirements are met. It has a self-installing package which will prompt you for the information it needs to configure itself. WERA also has a well developed user manual complete with pictures to help walk the user through various processes.
Proxy Support
Both software packages have two modes of operation. The first mode is similar to the Internet Wayback Machine. It searches the archive like a search engine and then displays the results. Clicking on a link opens the web page in your browser. However, they also have another option to run as a proxy server. In this mode, the displayed websites will be displayed as a page within a page. This allows for easier searches for websites and also makes links between archived pages work more smoothly. This mode is fully supported by the Java Wayback Machine but is only experimental in WERA.
Evaluation
When choosing which software package to endorse, a number of factors come into play. Both applications have all the features we need, so the question is which will be easiest to implement with our current system and user base. Three categories to consider are installation, expandability, administrator friendliness, and user friendliness.
Both applications need Java and Tomcat. Java is already installed on the server, and the server uses Apache so Tomcat should not be a problem to install (if it is not already installed). However, WERA can not work with PHP 5. That may become a problem in the future if the university needs to upgrade to support future software purchases. NutchWAX must also be installed, and it requires an installation of the Hadoop implementation of the mapreduce algorithm. Therefore the Wayback Machine is the clear winner for installation and expandability.
When it comes to configuring the software after installation, but use a configuration file that the administrator edits. However, the Wayback Machine can support dynamically adding archive files. WERA does not list this as a feature. Therefore the Wayback Machine wins this category too.
However, WERA has a good user guide. This is a category that the Wayback Machine completely lacks. So the question is if this one category makes up for being behind in the other three. I propose that it does not since a user guide is easier to make than a simpler installation. Mary Washington can even make it's own user guide and submit it to the open source project. And since it is used by students and professors of Mary Washington, they have lab aids a help desk they can call for assistance. The availability of help makes this less of an issue.
Conclusion
Now that the options have been listed, it is this author's suggestion that we use Heritrix as our web crawler and the open source version of the Wayback Machine to view the archives. It will run on our current equipment. It will give access to the entire network. This will give us all the functionality we need with the fewest hassles with installation.
References
- INTERNET ARCHIVE. 2007. Heritrix. Source Forge. http://crawler.archive.org/
- INTERNET ARCHIVE. 2007. Wayback. Source Forge. http://archive-access.sourceforge.net/projects/wayback/
- INTERNATIONAL INTERNET PRESERVATION CONSORTIUM. 2007. WERA. Source Forge. http://archive-access.sourceforge.net/projects/wera/
- THE APACHE SOFTWARE FOUNDATION. 2007. Apache Tomcat. Apache Software Foundation. http://tomcat.apache.org/
- THE APACHE SOFTWARE FOUNDATION. 2006. Hadoop 0.14.2 API. Apache Software Foundation. http://lucene.apache.org/hadoop/api/overview-summary.html
antalya web tasarım web tasarım antalya firmaları antalya diş doktoru antalya diş hekimi antalya ağız ve diş sağlığı antalya dentist antalya mermer antalya havai fişek antalya balon süsleme antalya balon havai fişek balon havai fişek umre turları umre fiyatları ucuz umre turları ekonomik umre turları umre hac umre organizasyonu antalya ses ışık ses ışık ses ışık kiralama ses ışık sistemleri kongre seminer antalya kongre seminer

