Get access to the world of the `Invisible Web'

Royal Gazette Staff

Created: Mar 16, 2001 10:00 AM

There's a big problem with most search engines, and it's one many people aren't even aware of. The problem is that vast expanses of the Web are completely invisible to general-purpose search engines like AltaVista, HotBot, and Google. Even worse, this "Invisible Web'' is growing significantly faster than the visible Web you're familiar with.

So what is this Invisible Web and why aren't search engines indexing it? To answer this question, it's important to first understand the "visible'' Web, and describe how search engines compile their indexes.

The Web was created in 1990 by Tim Berners-Lee, a researcher at the CERN physics laboratory in Switzerland. Berners-Lee designed the Web to be platform-independent, so researchers at CERN could share materials residing on any type of computer system, avoiding cumbersome and costly conversion issues.

To enable this cross-platform capability, Berners-Lee created HTML, or HyperText Markup Language.

HTML documents are simple: they consist of a "head'' portion, with a title and perhaps a few words describing the document, and a "body'' portion, the actual document itself. Because HTML pages are so simple, it is easy for search engines to retrieve HTML documents, index every word on every page, and store them in huge databases that can be searched on demand. Search engines are essentially word processors. When you search, they compare your query keywords with documents in the database and try to find the best match.

Search engines use automated programs called spiders to "crawl'' the web.

Like you do when you surf the Web, spiders rely on links to take them from page to page. They roam through public Web servers, recording the addresses and descriptions of the Web pages they discover.

The Invisible Web is made up of the information stored in databases. Unlike pages on the visible Web, information in databases is generally inaccessible to the software spiders that compile indexes for search engines. When a search engine comes across a database, it's as if it runs smack into the entrance of a massive library with securely bolted doors. Spiders can record the library's address, but cannot tell you anything about the books, magazines or other documents inside. Generally information in a database is not in HTML format and does not have any HTML links pointing to it, even if that database has a Web front-end interface.

There are thousands of databases containing high-quality information that are accessible via the Web. But to search them, you must visit the Web site that provides an interface to the database. The advantage to this direct approach is that you can use search tools that were specifically designed to retrieve the best results from the database. The disadvantage is you need to find the database in the first place.

Also, Web sites are increasingly using databases to store and display their information instead of putting Web site content on static HTML pages.

Databases allow Web sites to offer customised content that's often assembled on the fly from many parts of the database. This trend is going to make it even harder for search engines to be comprehensive Web indexes.

How do you find the information in the databases on the Invisible Web? To use the invisible Web you have to use specialised search engines that search databases, not Web pages. But how do you know what these special search engines are if a regular search engine cannot index it? The situation is analogous to a large public library. I know that the Library of Congress most likely has somewhere in their stacks the exact piece of information that I am looking for. The problem is how to locate it. The Internet user is in the same position. If you tell me that there is a database for "Acronym and Abbreviation Finder'', I can be pretty sure that if I need to know the meaning of a certain acronym, I can find it there. But if I didn't know that such a database existed in the first place, how would I know to look for it? Gary Price, a librarian at George Washington University in Washington D.C., is the guru of off-Internet searching, and he coined the term "invisible Web''.

Mr. Price has assembled a list called "Direct Search'' (http://gwis2.circ.gwu.edu/(tilde)gprice/d irect.htm) which has links to the search interfaces of resources that contain data not easily or entirely searchable/accessible from search tools like AltaVista, Google or Hotbot.

Direct Search is a massive compilation of specialised Internet search tools, listing annotated links to over 1000 searchable, interactive databases -- an index of indexes. The page is designed to provide quick access directly to the search forms of invisible Web sites.

The size of the Direct Search page makes it appear a bit intimidating at first, but when you become familiar with the layout you will find it's well organised and easy to use. At the top of the page is a search engine just for the Direct Search links followed by the recently added links. Beneath this are links that take you to major subject categories in Mr. Price's directory of databases and below this is a section called Internet Resource Compilations.

Mr. Price's Internet Resources Compilations has links to the following useful pages: On Price's Lists of Lists (http://gwis2.circ.gwu.edu/(tilde)gprice/listof.htm) you'll find links to sources that compile business rankings and other useful lists from a variety of Web-accessible resources.

Mr. Price's NewsCenter page (http://gwis2.circ.gwu.edu/(tilde)gprice/newscenter.htm) lists sources providing up to the minute news stories on any subject imaginable.

If you want to know what politicians, business people, or other notable people have said on the record, Speech and Transcript Center (http://gwis2.circ.gwu.edu/(tilde)gprice/speech.htm) provides links to transcripts of speeches, broadcasts, and other spoken records.

Finally, Mr. Price maintains a small but growing set of links to Congressional Research Service Reports (http://gwis2.circ.gwu.edu/(tilde)gprice/crs.htm) The Congressional Research Service is a department of the Library of Congress, and works exclusively as a nonpartisan analytical, research, and reference arm for Congress.

Continued next Wednesday in Personal Technology

Log In

Get access to the world of the `Invisible Web'