Published May 2015 | Version Submitted
Journal Article Open

A Spectral Algorithm for Latent Dirichlet Allocation

Abstract

Topic modeling is a generalization of clustering that posits that observations (words in a document) are generated by multiple latent factors (topics), as opposed to just one. The increased representational power comes at the cost of a more challenging unsupervised learning problem for estimating the topic-word distributions when only words are observed, and the topics are hidden. This work provides a simple and efficient learning procedure that is guaranteed to recover the parameters for a wide class of multi-view models and topic models, including latent Dirichlet allocation (LDA). For LDA, the procedure correctly recovers both the topic-word distributions and the parameters of the Dirichlet prior over the topic mixtures, using only trigram statistics (i.e., third order moments, which may be estimated with documents containing just three words). The method is based on an efficiently computable orthogonal tensor decomposition of low-order moments.

Additional Information

© 2014 Springer Science+Business Media New York. Received: 01 October 2013; Accepted: 12 June 2014; First Online: 03 July 2014. We thank Kamalika Chaudhuri, Adam Kalai, Percy Liang, Chris Meek, David Sontag, and Tong Zhang for valuable insights. We also thank Rong Ge for sharing preliminary results (in [8]) and the anonymous reviewers for their comments, suggestions, and pointers to references. Part of this work was completed while DH was a postdoctoral researcher at Microsoft Research New England, and while DPF, YKL, and AA were visiting the same lab. AA is supported in part by Microsoft Faculty Fellowship, NSF Career award CCF-1254106, NSF Award CCF-1219234, NSF BIGDATA IIS-1251267 and ARO YIP Award W911NF-13-1-0084.

Attached Files

Submitted - 1204.6703.pdf

Files

1204.6703.pdf

Files (309.8 kB)

Name Size Download all
md5:765af4b587a7a7de89e0e35dfd93fdcb
309.8 kB Preview Download

Additional details

Identifiers

Eprint ID
81632
DOI
10.1007/s00453-014-9909-1
Resolver ID
CaltechAUTHORS:20170920-142816744

Funding

Microsoft Research
NSF
CCF-1254106
NSF
CCF-1219234
NSF
IIS-1251267
Army Research Office (ARO)
W911NF-13-1-0084

Dates

Created
2017-09-20
Created from EPrint's datestamp field
Updated
2022-12-22
Created from EPrint's last_modified field