Determining Gains Acquired from Word Embedding Quantitatively using Discrete Distribution Clustering
Jianbo Ye (1), Yanran Li (2), Zhaohui Wu (1),
James Z. Wang (1), Wenjie Li (2) and Jia Li (1)
(1)The Pennsylvania State University, USA
(2)The Hong Kong Polytechnic University, Hong Kong
Abstract:
Word embeddings have become widelyused in document analysis. While a
large number of models for mapping words to vector spaces have been
developed, it remains undetermined how much net gain can be achieved
over traditional approaches based on bag-of-words. In this paper, we
propose a new document clustering approach by combining any word
embedding with a state-of-the-art algorithm for clustering empirical
distributions. By using the Wasserstein distance between
distributions, the word-to-word semantic relationship is taken into
account in a principled way. The new clustering method is easy to use
and consistently outperforms other methods on a variety of data
sets. More importantly, the method provides an effective framework for
determining when and how much word embeddings contribute to document
analysis. Experimental results with multiple embedding models are
reported.
Full Paper
(PDF, 0.4MB)
Datasets
(ZIP, 15MB)
Source Codes for Scripts to Process Data
(link)
Citation:
Jianbo Ye, Yanran Li, Zhaohui Wu, James Z. Wang, Wenjie Li and Jia Li,
``Determining Gains Acquired from Word Embedding Quantitatively using
Discrete Distribution Clustering,'' Proceedings of the Annual Meeting
of the Association for Computational Linguistics, vol. 1,
pp. 1847-1856, Vancouver, Canada, August 2017.
© 2017 Association for Computational Linguistics (ACL).
Personal use of this material is permitted. However,
permission to reprint/republish this material for advertising or
promotional purposes or for creating new collective works for resale
or redistribution to servers or lists, or to reuse any copyrighted
component of this work in other works must be obtained from the ACL.
Last Modified:
April 3, 2017
© 2017