Archiving the always changing World Wide Web
Have you ever gone to a Web site with great information only to find that the site has changed or is no longer to be found?
One of the problems of the new "digital age" is that things happen so quickly and change is so ubiquitous that there is not time to record what was happening. Historians point out that one of the reasons to study is to learn from the past and hopefully avoid making the same mistakes in the future. But that is hard to do if there is no record of the way things were done, because Web servers and e-mail servers are always adding new files that overwrite the old files. In an attempt to save the historical record for posterity, the Internet Archive (www.archive.org), has been collecting copies Web sites and saving them since 1996.
The Internet Wayback Machine is the search engine for The Internet Archive collection, and what a collection it is: Ten billion individual pages, occupying 100 TB (terabytes) of storage space! The entire US Library of Congress is estimated to contain a mere 20 TB of data. These old Web sites have been cached, or copied, on a regular basis since 1996. The collection is growing at a rate of 10 TB (terabytes) per month.
Using the Wayback Machine, you can see what Amazon.com looked like in 1996, and take a peek at the Web's fist search engine, WebCrawler, as it appeared on December 22, 1996. To use the Wayback Machine's search interface, you simply type in the URL of a page you'd like to dig out of the archive, just as you would enter the URL in the address bar of your Web browser. The search engine retrieves all references to that specific page, and presents them as a list of the cached copies of the site you requested, and you can view by clicking on the dates.
The Wayback Machine is an example of the Invisible Web because it is a database with contents that can't be indexed by general search engines, since they can't be crawled by following links. However, it's good to see that the way the URLs for individual cached pages in the collection are formed makes for easy linking - which is often not the case with invisible Web sites.
The extent and value of the Internet Archive becomes apparent as soon as you start searching. For example, if you search for Yahoo!, you'll find 3,396 archived pages beginning October 17 1996, although many of these are duplicates which the engine removes from the results listing by default (you can click a "See all" link to get the complete results). Anyone who has used Google's cached page feature to look at a site that has disappeared or changed will appreciate the Wayback Machine's implementation which offers a whole historical range of cached page copies instead of just the most recent one captured by the search engine.
Searchers looking for information about very specific, narrow subjects are often frustrated to find that promising-sounding pages shown by search engine results are dead links, the pages long since vanished. The Wayback Machine offers a second chance to find obscure material.
The Wayback Machine is also a great tool for researchers doing competitive intelligence work. It's a piece of cake to dig down and find old corporate information, back before the company's business plan changed.
Mr. Peabody and Sherman were cartoon characters on the 1960's TV series Bullwinkle and Rocky Show. They used a time machine called The Wayback Machine to travel back in time to witness historical events, which is where the Internet Archive's search engine gets its name. In the cartoon, Mr. Peabody and Sherman tended to interfere in events to make sure history turned out the way it should. With the Internet's Wayback Machine you can see history the way it really was, without any interference or modifications.
On the Internet, old history is measured in weeks, not years. Archiving the Web allows you to keep track of the history that we are all witnessing, which would have made Mr. Peabody proud.