"A new study by Inktomi and NEC Research
Institute show that there is at least one billion unique indexable
Web pages on the internet. The details are pretty interesting; for
example, Apache dominates the server market." (17Jan2000)
The actual statistics are at
http://www.inktomi.com/webmap/.
Re 1,000,000,000 documents, from a 1964 book about a near-infinte
library that contained all the 500 page books that could be generated
by permuting the alphabet:
"The certitude that any book exists on the shelves of the library first
led to elation, but soon the realization that it was unlikely to be
found converted the feelings to a great depression. [Luis Borges: The
Infinite Library]"
Search Techniques
[extract from
Gio Wiederhold: Trends for the Information Technology Industry;
Stanford University, April 1999.]
There is a wide variety of search techniques available. They are
rarely clearly explained to the customers, perhaps because a better
understanding might cause customers to move to other searches. Since
the techniques differ, results will differ as well, but comparisons
are typically based in recall rather than on precision. Getting more
references always improves recall, but assessing precision formally
requires an analysis of relevance, and knowing what has been missed,
which is an impossible task given the size and dynamics of the
web.
Potentially more relevant results can be obtained by intersecting
the results from a variety of search techniques, although precision is
then likely to suffer further.
We briefly describe below the principal techniques used by some
well-known search engines; they can be experienced by invoking
www.name.com. This summary can provide hints for further improvements
in the tools.
Yahoo
catalogues useful web sites and organizes them as a
hierarchical list of web-addresses. By searching down the hierarchy
the field is narrowed, although at each bottom leaf many entries
remain, which can then be further narrowed by using keywords. Yahoo
employs now a staff of about 200 people, each focusing on some area,
who filter web pages that are submitted for review or located
directly, and categorizes those pages into the existing
classification. Some of the categories are dynamic, as recent events
and entertainment, and aggregate information when a search is
requested.
Note that Yahoo uses InkToMe as its serach engine provider.
Alta Vista
automates the process, by surfing the web, creating indexes
for terms extracted from the pages, and then using high-powered
computers to report matches to the users. Except for limits due to
access barriers, the volume of possibly relevant references is
impressive. However, the result is typically quite poor in precision.
Since the entire web is too large to be scanned frequently, references
might be out of date, and when content has changed slightly, redundant
references are presented. Context is ignored, so that when seeking,
say, a song title incorporating the name of a town, information about
the town is returned as well.
Excite
combines some of the features, and also keeps track of
queries. If prior queries exist, those results are given
priority. Searches are also broadened by using the ontology service of
Wordnet [Miller:93]. The underlying notion is that customers can be
classified, and that customers in the same class will share interests.
However, asking similar queries and relating them to individual users
is a limited notion, and leads only sometimes to significantly better
results. Collecting personal information raises questions of privacy
protection.
Firefly
provides customer control over their profiles. Individuals
submit information that will encourage businesses to provide them with
information they want [Maes:94]. However, that information is
aggregated to create clusters of similar consumers, protecting
individual privacy. Business can use the system to forward information
and advertisements that are appropriate to that cluster. There is a
simplification of matching a person to a single customer role. Many
persons have multiple roles. At times they may be a professional
customer, seeking business information, and at other times they may
pursue their sports hobby, and subsequently they may plan a vacation
for their family. Unless these customer roles can be distinguished,
the clustering of individuals is greatly weakened.
Alexa
collects not only references, but also the webpages
themselves. This allows Alexa to present information that has been
deleted from the source files. Ancillary information about web pages
is also provided, as the author organization, the extent and the
number of links referring to this page. Such information helps the
customer judge the quality of information on the page. Presenting web
pages that have been deleted provides an archival service, although
the content may be invalid. The creators of such webpages can request
Alexa to stop showing them, for instance if the page contained serious
errors or was libelous. Since the inverted links are made available
one can also go to referencing sites.
Google
ranks the importance of web pages according to the total
importance of web pages that refer to it. This definition is circular,
and Google performs the required iterative computation to estimate the
scaled rank of all pages relative to each other. The effect is that
often highly relevant information is returned first. It also looks for
all matches to all terms, which reduces the volume greatly, but may
miss relevant pages [PageB:98].
Junglee
provides integration over diverse sources. By inspecting
sources, their formats are discerned, and the information is placed
into tables that then can be very effectively indexed. This technology
is suitable for fields where there is sufficient demand, so that the
customer needs can be understood and served, as advertisements for
jobs, and searches for merchandise. later time by the same or a
related application. For instance, a search for some movie, recorded
in a cookie, can trigger an advertisement for a similar movie
later. The use of cookies moves the storage of user-specific
information to the user Accessing and parsing multiple sources allows,
for instance, price comparisons to be produced. Vendors who wish to
differentiate themselves based on the quality of their products (see
Section 2.2.3) may dislike such comparisons.
Cookies
is not an independent search engine, but a device used by
many engines and applications to track users of use, the `freshness
rejecting of cookies and applications that generate cookies.
This list of techniques can be arbitrarily extended. New ideas in
improving the relevance and precision of searches are still developing
[Hearst:97]. There are, however, limits to general tools. Three
important additional factors conspire against generality, and will
require a new level of processing if searching tools are to become
effective.
Problems with Searching
- Getting all that is relevant = Recall --
Query formulation with alternate terms, paths
- Getting only what is relevant = Precision --
Word meanings differ in context -- intersect multiple terms
- Getting duplicates (report, paper, book chapter, .. on distinct paths) -- SCAM project
- Integration from multiple contexts -- mismatched terms
Is the government the proper source for Internet support?
We discussed the differences between Eurpean/Asian and U.S. approaches.
To invest in risky ventures requires a setting where the value of the potential
gain outweighs the cost of the potential loss. For an investor who has
not-needed cash -- or can collect such a group of people -- the equation is simply:
gain * p(success) * n(success) > loss * p(failure) * n(failure)
if the amount of gain is >@gt; (100* ?) than the amount of loss
(the investment made),
then lower values of p(success) versus p(failure) and
n(success) versus n(failure) can be tolerated.
That seems also be true for a government, since it represents a very
large group of individuals, so large, that in total they wouldn't be
very sensitive to loss. However, the actions are executed by actual
people, bureaucrats. An employee of the government, as an individual,
is not very tolerant of loss. If money is gained, the
bureaucrat will gain little, perhaps a promotion, a 10% raise, a few
years earlier than woould happen otherwise. However, a failure will
cause loss of promotion and raises. So for the bureaucrat the value of
gain = loss.
There some anciliary conclusions from that reasoning:
- Don't blame the poor bureaucrat, it is the setting and the reward mechanism.
- Relieve bureaucrats of responsibility when dealing with them:
Never ask them `Can you do X', but rather `How should I do X'.
In business dealing with people is of paramount importance, and all institutions
have people in their interfaces.
Notes
See
See also the references.