NSF IIS 0535085 Project Home Page

Scalable, Multimodal Algorithms for Multimedia Information Retrieval

PI: Edward Y. Chang, University of California, Santa Barbara

Project Duration: September 2006 to August 2009

I. Project Summary

The principal design goal of a multimedia retrieval system is to return data (images or video clips) that accurately match users' query concepts (e.g., `animals,' `landscape' or `architectures'). To achieve this design goal, the system must first comprehend a user's query concept thoroughly, and then search efficiently through large repositories to find images or videos that match the concept. These concept-learning and efficient-retrieval requirements pose new scalability challenges for the learning algorithms and indexing methods that do the job. Three major challenges are 1) scarcity of training data, 2) scalability of learning algorithms, and 3) non-traditional and high-dimensional queries. This project addresses these challenges in three separate but interrelated research thrusts.

1. Multimodal active learning algorithms. Keywords and perceptual features are jointly explored for devising active-learning algorithms that can learn a target query-concept based on the concept's characteristics.

2. Scalable kernel machines. Kernel machines such as kernel PCA, Spectral Clustering, and Support Vector Machines (SVMs) are essential statistical tools for analyzing and annotating media data, but their computational intensity is O(n³). This research thrust investigates approximate decomposition and parallel algorithms to reduce the computational intensity.

3. Kernel indexers for hyperplane queries. A concept learned by SVMs is a hyperplane, not a point. This research thrust develops novel high-dimensional indexing methods to support hyperplane queries to work with kernel methods in general and SVMs in particular.

The scalable kernel machine thrust improves the efficiency to analyze and annotate multimedia data. The multimodal active learning thrust uses the analysis and annotation to perform concept-dependent learning. The kernel indexer thrust makes the other two thrusts efficient for retrieval of data matching certain criteria. Together, these three integrated research thrusts provide a solid foundation for building large-scale, next-generation, multimedia information-retrieval systems.

II. Participants & Collaborators

PhD Students

Navneet Panda (CS 2002 - 2006)

Arun Qamra (CS 2002 - 2007)

Wen-Yen Chen (CS, 2005 - 2009)

MS Students

Gengxin Miao (ECE, 2006 - 2007)

Jon Chu (MIT, 2007-2008)

Rong Hu (MIT, 2008-2009)

Alice Li (MIT, 2008-2010)

Brian Wong (MIT, 2008)

Industry Collaborators

Intel

Google

III. Key Results

Six scalable algorithms developed by this project are 1) PSVM, 2) CCF, 3) PFP, 4) Parallel Spectral Clustering, 5) PLDA, and 6) KDX.

1) PSVM:

Support Vector Machines suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed PSVM, which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and $m$ the number of machines. PSVM reduces the memory requirement from O(n^2) to O(np/m), and improves computation time to O(np^2/m). Empirical studies on up to 500 computers show PSVM to be effective. Details are documented in the attached report.

2) CCF:

Rapid growth in the amount of data (texts and photos) available on social networking sites has made information retrieval increasingly challenging for users. In this work, we propose a combinational collaborative filtering method (CCF) to perform personalized community recommendations by considering multiple types of co-occurrences in social data at the same time. This filtering method fuses semantic and user information, then applies a hybrid training strategy that combines Gibbs sampling and Expectation-Maximization algorithm. To handle the large-scale dataset, parallel computing is used to speed up the model training. Through an empirical study on the Orkut dataset, we show CCF to be both effective and scalable.

3) PFP:

Frequent itemset mining (FIM) is a useful tool for discovering frequently co-occurring items. Existing FIM algorithms such as Apriori and FP-Growth can be resource intensive when a mined dataset is huge. We have parallelized FP-Growth on MapReduce. Our parallel PF-Growth (PFP) algorithm divides a large-scale mining task into independent computational tasks and maps them onto MapReduce jobs. PFP can achieve near-linear speedup with capability of restarting from computer failures (thanks to MapReduce).

4) Parallel Spectral Clustering:

Clustering is a key subroutine for unsupervised learning tasks. Among clustering algorithms, Spectral Clustering is considered to be the most effective because of its ability to work with kernels, which can model non-linear cluster boundaries. When the size of a dataset is large, Spectral Clustering encounters a quadratic resource bottleneck to compute and store the n by n similarity matrix for n data instances. In our work, we consider the sparsification strategy of retaining the nearest neighbors of each data instance and then we investigate its parallel implementation. Our parallel implementation, which we call parallel spectral clustering (PSC), provides a systematic solution to handle a variety of challenges from calculating the similarity matrix to efficiently finding eigenvectors. PSC first distributes n data instances onto p distributed machine nodes. On each node, PSC computes the similarities between local data and the whole set in a way that uses minimal disk I/O. For the eigenvector matrix, PSC stores it on distributed nodes to reduce the per-node memory use. Together with parallel eigensolver and parallel k-means, PSC achieves good clustering accuracy over other approximation approaches such as Nyström as well as good speedup performance.

5) PLDA:

Latent Dirichlet Allocation (LDA) was first proposed by Blei, Ng and Jordan to model documents. Each document is modeled as a mixture of K latent topics, where each topic k, is a multinomial distribution over a V -word vocabulary. Given an input corpus W, the LDA learning process consists of calculating a maximum-likelihood estimate of the model parameters. Given this model, we can infer topic distributions for arbitrary documents. LDA runs also into scalability issues because of its large word-document matrix. Our parallel LDA implementation (PLDA) distributes Gibbs Sampling on multiple machines. PLDA supports both MPI and MapReduce. Our experiment on a Wikipedia dataset shows promising scalability: PLDA achieved 164-time speedup when 256 machines were employed.

6) KDX:

A kernel index, which organizes data in infinite dimensional spaces for fast retrieval. We showed that in even an infinite dimensional space, indexing can be done effectively. This is accomplished by projecting data to the surface of a hypersphere, and then dividing the surface into concentric regions for indexing.

IV. Publications

Publications are listed by the three thrusts of this research project.

Multimodal active learning algorithms.

1. Navneet Panda, King-Shy Goh, Edward Y. Chang: Active learning in very large databases. Multimedia Tools Appl. 31(3): 249-267 (2006)

2. Steven C. H. Hoi, Michael R. Lyu, Edward Y. Chang: Learning the unified kernel machines for classification. ACM KDD 2006: 187-196

Scalable kernel machines and data mining algorithms.

1. E. Y. Chang, et. al, Parellel SVMs on Distributed Computers, NIPS 2007.

2. Gang Wu, Edward Y. Chang, Yen-Kuang Chen, Christoper Hughes: Incremental approximate matrix factorization for speeding up support vector machines. ACM KDD 2006: 760-766.

3. Nonparametric Regression Using Reproducing Kernel Hilbert Spaces, Zhihua Zhang, Gang Wu and Edward Y. Chang, IEEE Transactions on Neural Networks, September 2007.

4. PLDA---Parallel Dirichlet Allocation for Large-scale Applications, AAIM, San Francisco, June 2009. Discovery of Community User Latent Behavior, Wen-Yen Chen, et al., WWW 2009, Spain, April 2009.

5. CCF: Combinational Collaborative Filtering for Personalized Community Recommendation, Wen-Yen Chen, Dong Zhang, and E. Y. Chang, ACM KDD, August 2008.

6. Parallel Spectral Clustering, Yangqiu Song, Wen-Yen Chen, Hongjie Bai, Chih-Jen Lin, Edward Y. Chang, ECML, Belgium, September 2008.

7. PFP: Parallel FP-Growth for Query Recommendation, H. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, ACM Recommendation Systems, Lausanne, October 2008

Kernel indexers for hyperplane queries.

1. Navneet Panda, Edward Y. Chang: KDX: An Indexer for Support Vector Machines. IEEE Trans. Knowl. Data Eng. 18(6): 748-763 (2006)

2. RCMap: Efficiently Creating High-Quality Euclidean Embeddings, A. Qamra and E. Y. Chang, SIAM International Conference on Data Mining (SDM), Minneapolis, Minnesota, April 2007.

3. Navneet Panda, Edward Y. Chang: Efficient top-k hyperplane query processing for multimedia information retrieval. ACM Multimedia 2006: 317-326

4. Navneet Panda, Edward Y. Chang, Gang Wu: Concept boundary detection for speeding up SVMs. ICML 2006: 681-688

V. Invited Presentations (9/06 - 8/09)

1. Confucius and "its" Intelligent Disciples (keynote), the 18th ACM International Conference on Information and Knowledge Management (CIKM), November 2009.

2. Parallel Algorithms for Mining Large-scale Data(tutorial), ACM CIKM, November 2009.

3. Information Extraction Meets Relational Databases(panelist), ACM CIKM, November 2009.

4. Parallel Algorithms for Mining Large-scale Multimedia Data (tutorial), ACM International Conference on Multimedia, October 2009.

5. Parallel Algorithms for Mining Large-scale Multimedia Data (invited lecture), Sino-German Summer School on Multimedia Computing, Tsinghua U, Beijing, September 2009.

6. Innovation with Science and Humanity (distinguished lecture), New Computer Science Graduate Student Orientation, Tsinghua University, Beijing, September 2009.

7. Confucius and "its" Intelligent Disciples (keynote), the 5th International Conference on Advanced Data Mining and Applications, Beijing, August 2009.

8. New Frontier in Large-scale Genome Data Analysis, Genomics Institute, Hong Kong, August 2009.

9. Innovation with Science and Humanity (invited talk), IEEE 125th Anniversary Student Congress, Singapore, July 2009.

10. Parallel Algorithms for Mining Large-scale Data (invited talk), NSF European Workshop on Mining Massive Datasets (EMMDS), Copenhagen, July 2009.

11. Large-scale Collaborative Filtering (keynote), The 5th AAIM Conference, San Francisco, June 2009.

12. Large-scale Photo Annotation Using Collective Wisdom of Data and Users (keynote), IEEE International Conference on Multimedia Computing and Systems, Ouarzazate, Morocco, April 2009.

13. Parallel Algorithms for Mining Large-Scale Data (keynote), Workshop of Modeling, Mining and Managing Evolving Social Networks (in conjunction with ICDE), Shanghai, March 2009.

14. Large-scale Collaborative Filtering (invited lecture, Andrew Yao's class), Tsinghua University, March 2009.

15. From Machine Learning to Human Innovation (distinguished lecture), Public Lecture on Mathematics, Hong Kong, Feb. 2009.

16. Large-scale Machine Learning Algorithm (invited lecture), Symposium of Machine Learning and Bioinformatics, Hong Kong, Feb. 2009.

17. Large-scale Machine Learning Algorithms (invited talk), National Taiwan University/Academia Sinica, Taipei, December 2008.

18. Beyond Search---Computational Intelligence for the Web (invited talk), NIPS, Vancouver/Whistler, December 2008.

19. From Machine Learning to Human Innovation (invited talk), Annual Meeting of Machine Learning and Applications, Nanjing, November 2008.

20. Large-scale Collaborative Filtering (invited talk), UC Berkeley EECS/Math Seminar, Berkeley, October 2008.

21. Social Network Open Platform Strategies and Applications (invited talk), CSDN, Shanghai, September 2008.

22. Organizing Multimedia Data Socially with Scalable Algorithms (keynote), ACM International Conference on Image and Video Retrieval, Niagara Falls, July 2008.

23. Large-scale Social-Graph Mining, Challenges and Scalable Solutions (tutorial), MMDS, Workshop on Algorithms for Modern Massive Data Sets, Stanford University, June 2008.

24. Rich Media and Web 2.0 (panel moderator), WWW, Beijing, April 2008.

25. Massive Mining on Social Graphs (keynote), Social Network Data Mining Workshop, WWW, Beijing, April 2008.

26. Organizing World's Information, Socially (invited talk), Stanford InfoSeminar, Computer Science Department, Stanford University, March 14th, 2008.

27. Parallel, Combinational Collaborative Filtering (invited talk), HP Labs, Palo Alto, March 2008.

28. Organizing World's Information, Socially (invited talk), Chinese University Hong Kong, Hong Kong, December 2007.

29. Media Sharing on Social Networks (keynote), Pacific-rim Multimedia Conference, Hong Kong, December 2007.

30. Web-scale Multimedia Data Management: Challenges and Remedies (keynote), Visual and Multimedia Digital Libraries, sponsored by European Commission Networks of Excellence, Modena, Italy, September 2007.

31. Challenges of and Remedies for Large-scale Multimedia Information Retrieval (keynote), CVDB in conjunction with ACM SIGMOD, Beijing, June 2007.

32. Internet, Past and Future (panel moderator), Internet+ Conference, Beijing, March 2007.

33. Web 2.0 and Multimedia (keynote), International Conf. on Multimedia Modeling, Singapore, January 2007.

34. Parallel Support Vector Machines, Google Tech Talk, Mountain View, December 2006.

35. Unified & Scalable Learning for Multimedia Information Retrieval (keynote), MIR Workshop at ACM MM Conf., Santa Barbara, October 2006.

36. Web 2.0, Synergy and Challenge (panelist), ACM MM Conf., Santa Barbara, October 2006.

VI. Open Source

We made the code of PSVM available through Open Source in December 2007 and PLDA in April 2009. The following links lead to download sites and also provide pointers to relevant papers and datasets.

PSVM Open Source

PLDA Open Source