Intelligent Sampling for Learning Complex Query Concepts
Faculty
Edward Chang

PhD Students
Beitao Li
Kingshy Goh
Yan Meng
Gang Wu
Yi Wu
Huaxin You

Publications

Sponsors
IBM Almaden
IBM T.J. Watson
NSF Career
NSF ITR

Collaborators
Simon Tong Stanford

Brian Chang
Tim Cheng
Larry Lai 
Yi-Leh Wu
Tony Wu Morphosoft

 

Sigmund Freud (1856-1939)


For a multimedia search task, a query concept is hard to articulate, and articulation can be subjective. For instance, in an image search, it is difficult for a user to describe a desired image using low-level features such as color, shape and texture. In addition, different users may perceive the same image differently. Even if an image is perceived similarly, users may use different vocabulary (i.e., different combinations of low-level features) to depict it. Furthermore, most users are not trained to specify simple query criteria using, for example, Boolean algebra. In order to make information access easier and more personal, it is both necessary (for capturing subjective concepts) and desirable (for alleviating users from specifying complex query concepts) to build intelligent search engines that can quickly learn users' query concepts through  relevance feedback.

Traditional learning and relevance feedback techniques, unfortunately, are not suitable for online query-concept learning for at least two reasons.

  • Time and sample constraints. Traditional learning methods such as decision trees and neural networks require a large number of training instances (i.e., samples) and can take a long time (more than a few seconds) to learn a concept. But, online users are typically impatient and cannot be expected to wait around or to provide a great deal of feedback.
  • Seeding constraint. All traditional relevance feedback methods require users to provide good examples to seed a query. However, finding good seeds is the job of the search engine itself, and this circular requirement leaves the core problem---learning users' query concepts---unsolved.

The goal of the proposed research plan is to make fundamental advances towards intelligent search engines through the development of online query-concept learners. The specific targets are as follows:

  1. To design novel learning algorithms that grasp a user's query concept quickly despite time, sample, and seeding constraints.
  2. To develop techniques that can detect concept drift during a relevance feedback session, and to handle concept drift in the learning algorithms.
  3. To devise multi-resolution image characterization methods for improving both search accuracy and search efficiency.
  4. To ensure the scalability in feature dimension, dataset size, and concept complexity of the developed learning algorithms.
  5. To conduct validation on developed learning algorithms with experimental data provided by colleagues at IBM Laboratories, Sony, and Benchthalon.

The project's broader impacts upon information retrieval are potentially substantial. First, rapid proliferation of multimedia content in digital libraries and on the Web underscores the increasing importance of having effective multimedia search tools. Second, intelligent query-concept learners will directly or indirectly make traditional text-based information retrieval easier and more personal. Directly, a text collection can employ an intelligent learner to better capture users' query concepts. Indirectly, for instance, multimedia data can be added to a text collection so that searches can be conducted through interfaces that contain pictures and graphics. Even young students who have not learned Boolean algebra can use images and graphics to search for stories and books. In addition to bringing benefits to education, we believe that this research project will further contribute to making information more accessible for underprivileged users who are not yet able to enjoy the full benefits of the information revolution.