WSQ: Web-Supported (Database) Queries

Roy Goldman, Jennifer Widom

Run The Demo

WSQ (pronounced "wisk") is a new approach for combining the strengths of existing Web search engines and RDBMS technology. With WSQ, we can enhance SQL queries over a local relational database with relevant searches over Google, AltaVista or any other search engine.

Our online demo allows users to ask a set of restricted yet interesting WSQ queries. Users can rank tuples in local databases based on how often they appear on the Web, and optionally users can rank the tuples based on how often they appear along with arbitrary search terms. In this demo, all searches are issued to AltaVista.

As a simple example, we can rank the ACM SIGs by how often they appear on the Web near Knuth:

  1. Rank field: Select ACM SIGs.
  2. Near field: Type Knuth.
  3. Click Search the Web.
Now, for each ACM SIG, WSQ will issue a search to AltaVista to count how often that SIG appears on the Web near Knuth. The results will look something like this: For each SIG, the red number reflects the total number of Web pages for each SIG (as given by AltaVista). You can click on each SIG in the results to see the actual URLs supplied by AltaVista for that SIG.

You can try the demo now, try out some sample queries, or read ahead for more detailed instructions.

  1. Rank: Select one of several small local database tables in the Rank field. Choices include U.S. states, European countries, and ACM Special Interest Groups (SIGs). Click the Preview Local Database button to examine the contents of each table (without yet consulting the Web). The Identifier column is the primary text string assumed to identify the tuple on the Web; optionally, the Secondary Identifiers column is an additional disjunctive search expression that is useful for identifying the tuple. For example, among the Stanford DBGroup members, "Jeff Ullman" is the primary identifier, and "Jeffrey Ullman" or "Jeffrey D. Ullman" or "Jeff D. Ullman" is the expression that constitutes the Secondary Identifiers.

  2. Near: Optionally specify in the Near field keywords to be searched for along with each tuple in the local database. Suppose you select the ACM SIGs as the local database. If you supply Knuth in the Near field, then you're creating a query to rank the ACM SIGs by how often each SIG appears on the Web near Knuth. If you leave the Near field empty, then you're creating a query to measure the pure popularity of each SIG on the Web, independent of context. If the Near field is not empty, two additional options are available:
    • Correlation: Correlation between each tuple and the Near expression can be tight or loose. Under tight correlation, the Near expression must appear on the Web in close textual proximity to each tuple identifier (implemented by using the AltaVista near operator in the search). If correlation is loose, we only require that the Near expression and the tuple identifier appear anywhere together on the same Web page (implemented by using the AltaVista and operator in the search).
    • Rankings: Rankings can be absolute or normalized. With absolute rankings, tuples will be ranked simply by the number of times they appear on the Web together with the Near expression. With normalized rankings, the number of Web hits for the expression is normalized by the number of times the tuple appears on the Web without the Near expression. The motivation for this approach is best understood by considering the U.S. States. Ranking these states by their popularity on the Web (without a Near expression) shows that some states (such as California, Texas, and New York) appear far more often than others. Now suppose we want to rank each state by how often it appears on the Web near the keyword crime. With absolute rankings, the most popular states will rank highly again since there are just so many more Web pages for those states. But we may really be interested in the relative importance of crime to each state--that is, how often the word crime appears near a state relative to the total number of times the state is mentioned. Note that our normalization algorithm currently has its own limitations--it tends to quickly "disqualify" the most popular states from any search.

  3. Search the Web: Click Search the Web to issue your WSQ query.
WSQ is described in more detail in WSQ/DSQ: A Practical Approach for Combined Querying of Databases and the Web (Postscript) (Acrobat). This paper will appear in Proceedings of the ACM SIGMOD International Conference on Management of Data in May, 2000.

Other questions or comments? Please contact Roy Goldman, royg@cs.stanford.edu.

Run The Demo