CS99I Meeting 10 Notes: Browsers and Search engines

By Gio Wiederhold, 19Jan2000, 14Feb2002.

Topics Covered briefly

Browsers

Enabled the web for broad usage (Mosaic at UIUC, then Netscape, various competitors -- often adding new source formats, plug-ins for generality, then MS Internet Explorer.)

Translate HTML and linked material to Human-readable presentations
Translate XML via a specified XSL, and linked material to Human-readable presentations

Size of the Web

"A year 2000 study by Inktomi and NEC Research Institute show that there are at least one billion unique indexable Web pages on the internet. The details are pretty interesting; for example, Apache dominates the server market." (17Jan2000)
The actual statistics are at http://www.inktomi.com/webmap/.

Re 1,000,000,000 documents, from a 1964 book about a near-infinite library that contained all the 500 page books that could be generated by permuting the alphabet:

"The certitude that any book exists on the shelves of the library first led to elation, but soon the realization that it was unlikely to be found converted the feelings to a great depression.

------------------------ Is More Better ?

Search Engines

All Search engines must carry out in some way the following tasks

Collect web pages
- by human submission, evaluation, selection
- by crawling: going from one page to another following embedded hyperlinks to their leaves. (where to start?)
- Not well covered: pages that are dynamically created from databases, often by filling in forms: the hidden web
link them so they can be searched
- create indexes
- categorize them according to some ontology
rank them
- by frequemcy of term
- by relative frequency of term wrt. all documents
- by value of references
- by human assessment
- by frequency of use (requires dynamic access as Doubleclick)

These techniques interact with each other, some examples:

Search Techniques

[extract from Gio Wiederhold: Trends for the Information Technology Industry; Stanford University, April 1999.]

There is a wide variety of search techniques available. They are rarely clearly explained to the customers, perhaps because a better understanding might cause customers to move to other searches. Since the techniques differ, results will differ as well, but comparisons are typically based in recall rather than on precision. Getting more references always improves recall, but assessing precision formally requires an analysis of relevance, and knowing what has been missed, which is an impossible task given the size and dynamics of the web.

Potentially more relevant results can be obtained by intersecting the results from a variety of search techniques, although precision is then likely to suffer further.

We briefly describe below the principal techniques used by some well-known search engines; they can be experienced by invoking www.name.com. This summary can provide hints for further improvements in the tools.

Yahoo

Alta Vista

automates the process, by surfing the web, creating indexes for terms extracted from the pages, and then using high-powered computers to report matches to the users. Except for limits due to access barriers, the volume of possibly relevant references is impressive. However, the result is typically quite poor in precision. Since the entire web is too large to be scanned frequently, references might be out of date, and when content has changed slightly, redundant references are presented. Context is ignored, so that when seeking, say, a song title incorporating the name of a town, information about the town is returned as well.

Excite

combines some of the features, and also keeps track of queries. If prior queries exist, those results are given priority. Searches are also broadened by using the ontology service of Wordnet [Miller:93]. The underlying notion is that customers can be classified, and that customers in the same class will share interests. However, asking similar queries and relating them to individual users is a limited notion, and leads only sometimes to significantly better results. Collecting personal information raises questions of privacy protection.

Firefly

provides customer control over their profiles. Individuals submit information that will encourage businesses to provide them with information they want [Maes:94]. However, that information is aggregated to create clusters of similar consumers, protecting individual privacy. Business can use the system to forward information and advertisements that are appropriate to that cluster. There is a simplification of matching a person to a single customer role. Many persons have multiple roles. At times they may be a professional customer, seeking business information, and at other times they may pursue their sports hobby, and subsequently they may plan a vacation for their family. Unless these customer roles can be distinguished, the clustering of individuals is greatly weakened.

Alexa

collects not only references, but also the webpages themselves. This allows Alexa to present information that has been deleted from the source files. Ancillary information about web pages is also provided, as the author organization, the extent and the number of links referring to this page. Such information helps the customer judge the quality of information on the page. Presenting web pages that have been deleted provides an archival service, although the content may be invalid. The creators of such webpages can request Alexa to stop showing them, for instance if the page contained serious errors or was libelous. Since the inverted links are made available one can also go to referencing sites.

Google

ranks the importance of web pages according to the total importance of web pages that refer to it. This definition is circular, and Googleperforms the required iterative computation to estimate the scaled rank of all pages relative to each other. The effect is that often highly relevant information is returned first. It also looks for all matches to all terms, which reduces the volume greatly, but may miss relevant pages [PageB:98].

Junglee

provides integration over diverse sources. By inspecting sources, their formats are discerned, and the information is placed into tables that then can be very effectively indexed. This technology is suitable for fields where there is sufficient demand, so that the customer needs can be understood and served, as advertisements for jobs, and searches for merchandise. later time by the same or a related application. For instance, a search for some movie, recorded in a cookie, can trigger an advertisement for a similar movie later. The use of cookies moves the storage of user-specific information to the user Accessing and parsing multiple sources allows, for instance, price comparisons to be produced. Vendors who wish to differentiate themselves based on the quality of their products (see Section 2.2.3) may dislike such comparisons.

Cookies

is not an independent search engine, but a device used by many engines and applications to track users of use, the `freshness rejecting of cookies and applications that generate cookies.

This list of techniques can be arbitrarily extended. New ideas in improving the relevance and precision of searches are still developing [Hearst:97]. There are, however, limits to general tools. Three important additional factors conspire against generality, and will require a new level of processing if searching tools are to become effective.

Problems with Searching

Getting all that is relevant = Recall -- Query formulation with alternate terms, paths
Getting only what is relevant = Precision -- Word meanings differ in context -- intersect multiple terms
Getting duplicates (report, paper, book chapter, .. on distinct paths) -- SCAM project
Integration from multiple contexts -- mismatched terms

Notes

See
See also the references.