The almost invisible web: finding more pages with specialised search engines
No one knows how many pages there are on the Web -- recent estimates range around four billion worldwide with more than seven million being added every day, according to a study by Cyveillance International.
Assuming that all human knowledge, or at least most of it, is somewhere on the Web, the problem is how do you ever find it? Of course the answer is to use one of the many search engines on the Internet, but there are three basic drawbacks to using any general search engine.
The first problem with any search engine is that search engines will not show you all of the pages they actually have indexed. You've probably seen something like this before: "Your search resulted in 2,057,913 results.'' You'll never see more than the first few hundred to one thousand that each search engine has deemed the most `relevant'. In other words, though they boast about the large number of results that were found, search engines consider almost all of them to be irrelevant-and won't show them to you! Is this a bad thing? Not necessarily. The only time you'd need to have access to hundreds or more results is if you're truly looking for a needle in a haystack, or if you're doing research that requires you to be exhaustively thorough. And you can often find a `close enough' link in the first set of results. While it may not be exactly what you're looking for, it may offer links to other sites that do match your needs more precisely.
The only real problem that arises when search engines do not show you all of the pages they find comes when the results they display are not what you are looking for.
The second problem is that search engines cannot index all the information available on the Web because it is stored in databases.
The information in databases is not in HTML format and does not have any HTML links. A search engine, like Google, can find the address to a database that is on the Web and that your Web browser can access, but the search engine cannot see what is inside the database. It is like knowing where the library is on Queen Street but having no idea what is inside.
All the information in databases around the world makes up the invisible Web, and all this information is invisible to Web search engines. They cannot see it so they cannot index it. If you would like more information about the Invisible Web, e-mail me (michelle ychristers.net), and I will send you the column from March 21, `The Invisible Web'.
The third problem with search engines is that search engines do not bother trying to catalogue all the Web pages in HTML format that they can `see' and index. Just because some Web pages aren't included in search engine index doesn't automatically mean that they are invisible.
Search engines use automated programmes called spiders to crawl throughout the Web and retrieve Web addresses for their search indexes. When you use a search engine, you're searching a proprietary database that contains a lot of information about some of the web pages on the Internet. Search engine databases, called indexes, do not even come close to indexing the entire Web.
Inktomi and AltaVista claim to have found between 1.2 and 1.5 billion documents on the Web, but have put only 500 million and 350 million respectively into their searchable indexes. Why are there so many Web pages unindexed? There are two reasons.
First, crawling is a resource-intensive operation. It puts a certain amount of demand on the computers hosting the Web sites, and this can slow the host computers down. For this reason, search engines often limit the number of pages they retrieve and index from any given Web site. Search engines usually only look at a Web site's first page and rarely follow the links to everything else.
Second, search engines are not up to the minute. It can take them weeks of roaming to identify and catalogue various Web pages. So if you are looking for recent information, the chances increase that a search engine will not know about what you are looking for.
It's tempting to think that these unretrieved Web pages are part of the Invisible Web, but they aren't. They are visible and indexable, but the search engines have made a conscious decision not to index them to save computing power, so from your perspective, when you are searching the Web, they are almost invisible.
How can you find Web pages that search engines have not bothered to index or have not had time to index yet? Use a specialised search engine.
Specialised search engines target specific information, for example news or law. Searching a smaller, targeted body of data can increase the likelihood that you will find what you want because there are potentially fewer irrelevant Web pages to get in the way.
Specialised search engines crawl through Web sites more thoroughly and more often. In many cases they use humans to guide the spider programme to specific sites to crawl through. Also, new sites are often learned about more quickly because the search engines are only looking for particular type of Web sites.
There is one big problem with using a specialised search engine: You have to know that they exist before you can use them, and how do you make an index of the indexes? Below is a list of specialised search engines.
News Now (www.newsnow.co.uk) is a British site that scans 1,500 news sources and updates them every five minutes. NewsHub (www.newshub.com) integrates and reports headlines from news wires around the world every 15 minutes.
To search for legal information, check out findlaw.com, which contains many links to databases, such as US federal resources, reference resources and legal forms and international law Web sites including the United Nations as well as dozens of countries. Click on the Lawcrawler link to limit your search to certain types of law related resources.
Psychcrawler.com is sponsored by the American Psychological Association, an organisation that knows this topic. This helps to ensure quality material is indexed.
ERIC (Education Resource Information Clearinghouse at www.accesseric.org) provides some of the best and most useful material for the education researcher. Currently, if you search via the ERIC Digest search interface you have access to the full-text of 2200 digests. Most of the general search tools are missing several hundred of these documents at last check.
Tip: When you use a search engine dedicated to a particular topic, be very precise with your search words. For example, simply using `law', as a search term at findlaw.com will bring up a long list of Web sites since all the material in the index is about law.
There is a `collection of special search engines', including search engines to search topics you've never heard of, much less thought of searching. Check the Dutch site: www.leidenuniv.nl/ub/biv/specials.htm, which contains 77 pages of stuff compiled by Marten Hofstede.
Basically, to be able to find what you need on the Web, you need to know the where the search engines are before you need to use them. This is of course much easier said than done, but by continuously bookmarking and organising your own collection of search engines you can save a great deal of time and effort. With that said, do not forget to browse. Serendipity can be the searchers' best friend.
Browsing is also a great way to acquire new sites for later exploration.