CS99I Meeting 3 Notes: Search engines

By Gio Wiederhold, 19 Jan 2000.

Topics Covered briefly

Size of the Web

"A new study by Inktomi and NEC Research Institute show that there is at least one billion unique indexable Web pages on the internet. The details are pretty interesting; for example, Apache dominates the server market." (17Jan2000)
The actual statistics are at http://www.inktomi.com/webmap/.

Re 1,000,000,000 documents, from a 1964 book about a near-infinte library that contained all the 500 page books that could be generated by permuting the alphabet:

"The certitude that any book exists on the shelves of the library first led to elation, but soon the realization that it was unlikely to be found converted the feelings to a great depression.

Search Techniques

Gio Wiederhold: Trends for the Information Technology Industry; Stanford University, April 1999

There is a wide variety of search techniques available. They are rarely clearly explained to the customers, perhaps because a better understanding might cause customers to move to other searches. Since the techniques differ, results will differ as well, but comparisons are typically based in recall rather than on precision. Getting more references always improves recall, but assessing precision formally requires an analysis of relevance, and knowing what has been missed, which is an impossible task given the size and dynamics of the web.

Potentially more relevant results can be obtained by intersecting the results from a variety of search techniques, although precision is then likely to suffer further.

We briefly describe below the principal techniques used by some well-known search engines; they can be experienced by invoking www.name.com. This summary can provide hints for further improvements in the tools.

Yahoo

Alta Vista

automates the process, by surfing the web, creating indexes for terms extracted from the pages, and then using high-powered computers to report matches to the users. Except for limits due to access barriers, the volume of possibly relevant references is impressive. However, the result is typically quite poor in precision. Since the entire web is too large to be scanned frequently, references might be out of date, and when content has changed slightly, redundant references are presented. Context is ignored, so that when seeking, say, a song title incorporating the name of a town, information about the town is returned as well.

Excite

combines some of the features, and also keeps track of queries. If prior queries exist, those results are given priority. Searches are also broadened by using the ontology service of Wordnet [Miller:93]. The underlying notion is that customers can be classified, and that customers in the same class will share interests. However, asking similar queries and relating them to individual users is a limited notion, and leads only sometimes to significantly better results. Collecting personal information raises questions of privacy protection.

Firefly

provides customer control over their profiles. Individuals submit information that will encourage businesses to provide them with information they want [Maes:94]. However, that information is aggregated to create clusters of similar consumers, protecting individual privacy. Business can use the system to forward information and advertisements that are appropriate to that cluster. There is a simplification of matching a person to a single customer role. Many persons have multiple roles. At times they may be a professional customer, seeking business information, and at other times they may pursue their sports hobby, and subsequently they may plan a vacation for their family. Unless these customer roles can be distinguished, the clustering of individuals is greatly weakened.

Alexa

collects not only references, but also the webpages themselves. This allows Alexa to present information that has been deleted from the source files. Ancillary information about web pages is also provided, as the author organization, the extent and the number of links referring to this page. Such information helps the customer judge the quality of information on the page. Presenting web pages that have been deleted provides an archival service, although the content may be invalid. The creators of such webpages can request Alexa to stop showing them, for instance if the page contained serious errors or was libelous. Since the inverted links are made available one can also go to referencing sites.

Google

ranks the importance of web pages according to the total importance of web pages that refer to it. This definition is circular, and Google performs the required iterative computation to estimate the scaled rank of all pages relative to each other. The effect is that often highly relevant information is returned first. It also looks for all matches to all terms, which reduces the volume greatly, but may miss relevant pages [PageB:98].

Junglee

provides integration over diverse sources. By inspecting sources, their formats are discerned, and the information is placed into tables that then can be very effectively indexed. This technology is suitable for fields where there is sufficient demand, so that the customer needs can be understood and served, as advertisements for jobs, and searches for merchandise. later time by the same or a related application. For instance, a search for some movie, recorded in a cookie, can trigger an advertisement for a similar movie later. The use of cookies moves the storage of user-specific information to the user Accessing and parsing multiple sources allows, for instance, price comparisons to be produced. Vendors who wish to differentiate themselves based on the quality of their products (see Section 2.2.3) may dislike such comparisons.

Cookies

is not an independent search engine, but a device used by many engines and applications to track users of use, the `freshness rejecting of cookies and applications that generate cookies.

This list of techniques can be arbitrarily extended. New ideas in improving the relevance and precision of searches are still developing [Hearst:97]. There are, however, limits to general tools. Three important additional factors conspire against generality, and will require a new level of processing if searching tools are to become effective.

Problems with Searching

Getting all that is relevant = Recall -- Query formulation with alternate terms, paths
Getting only what is relevant = Precision -- Word meanings differ in context -- intersect multiple terms
Getting duplicates (report, paper, book chapter, .. on distinct paths) -- SCAM project
Integration from multiple contexts -- mismatched terms

Institutions and People

Is the government the proper source for Internet support? We discussed the differences between Eurpean/Asian and U.S. approaches. To invest in risky ventures requires a setting where the value of the potential gain outweighs the cost of the potential loss. For an investor who has not-needed cash -- or can collect such a group of people -- the equation is simply:

gain * p(success) * n(success) > loss * p(failure) * n(failure)

gain

loss

p(success)

p(failure)

n(success)

n(failure)

That seems also be true for a government, since it represents a very large group of individuals, so large, that in total they wouldn't be very sensitive to loss. However, the actions are executed by actual people, bureaucrats. An employee of the government, as an individual, is not very tolerant of loss. If money is gained, the bureaucrat will gain little, perhaps a promotion, a 10% raise, a few years earlier than woould happen otherwise. However, a failure will cause loss of promotion and raises. So for the bureaucrat the value of gain = loss.

There some anciliary conclusions from that reasoning:

Don't blame the poor bureaucrat, it is the setting and the reward mechanism.
Relieve bureaucrats of responsibility when dealing with them: Never ask them `Can you do X', but rather `How should I do X'.

Notes

See
See also the references.