In natural language processing, a topic model is a type of probabilistic, neural, or algebraic model for discovering the abstract topics that occur in a collection of documents. Topic modeling is a frequently used text mining tool for discovering hidden semantic features and structures in a text. The topics produced by topic models are generated through a variety of mathematical frameworks, including probabilistic generative models, matrix factorization methods based on word co-occurrence, and clustering algorithms applied to semantic embeddings.

Topic models are commonly used to organize and discover latent features in large collections of unstructured text and other forms of big data. Beyond text mining, topic models have also been used to uncover latent structures in fields such as genetic information, bioinformatics, computer vision, and social networks.

History

An early topic model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998. Another one, called probabilistic latent semantic analysis (PLSA), was created by Thomas Hofmann in 1999. Latent Dirichlet allocation (LDA), perhaps the most common topic model currently in use, is a generalization of PLSA. Developed by David Blei, Andrew Ng, and Michael I. Jordan in 2002, LDA introduces sparse Dirichlet prior distributions over document-topic and topic-word distributions, encoding the intuition that documents cover a small number of topics and that topics often use a small number of words. Other topic models are generally extensions on LDA, such as Pachinko allocation, which improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics. Hierarchical latent tree analysis () is an alternative to LDA, which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.

Topic models for context information

Approaches for temporal information include Block and Newman's determination of the temporal dynamics of topics in the Pennsylvania Gazette during 1728–1800. Griffiths & Steyvers used topic modeling on abstracts from the journal PNAS to identify topics that rose or fell in popularity from 1991 to 2001 whereas Lamba & Madhusushan used topic modeling on full-text research articles retrieved from DJLIT journal from 1981 to 2018. In the field of library and information science, Lamba & Madhusudhan applied topic modeling on different Indian resources like journal articles and electronic theses and resources (ETDs). Nelson has been analyzing change in topics over time in the Richmond Times-Dispatch to understand social and political changes and continuities in Richmond during the American Civil War. Yang, Torget and Mihalcea applied topic modeling methods to newspapers from 1829 to 2008. Mimno used topic modelling with 24 journals on classical philology and archaeology spanning 150 years to look at how topics in the journals change over time and how the journals become more different or similar over time.

Yin et al. introduced a topic model for geographically distributed documents, where document positions are explained by latent regions which are detected during inference.

Chang and Blei included network information between linked documents in the relational topic model, to model the links between websites.

The author-topic model by Rosen-Zvi et al. models the topics associated with authors of documents to improve the topic detection for documents with authorship information.

HLTA was applied to a collection of recent research papers published at major AI and Machine Learning venues. The resulting model is called . The resulting topics are used to index the papers at to help researchers , and help conference organizers and journal editors .

To improve the qualitative aspects and coherency of generated topics, some researchers have explored the efficacy of "coherence scores", or otherwise how computer-extracted clusters (i.e. topics) align with a human benchmark. Coherence scores are metrics for optimising the number of topics to extract from a document corpus.

Algorithms

In practice, researchers attempt to fit appropriate model parameters to the data corpus using one of several heuristics for maximum likelihood fit. A survey by D. Blei describes this suite of algorithms. Several groups of researchers starting with Papadimitriou et al. have attempted to design algorithms with provable guarantees. Assuming that the data were actually generated by the model in question, they try to design algorithms that probably find the model that was used to create the data. Techniques used here include singular value decomposition (SVD) and the method of moments. In 2012 an algorithm based upon non-negative matrix factorization (NMF) was introduced that also generalizes to topic models with correlations among topics.

Since 2017, neural networks has been leveraged in topic modeling in order to improve the speed of inference, and leading to further advancements like vONTSS, which allows humans to incorporate domain knowledge via weakly supervised learning.

In 2018, a new approach to topic models was proposed based on the stochastic block model.

Topic modeling has leveraged LLMs through contextual embedding and fine tuning.

Applications of topic models

To quantitative biomedicine

Topic models are being used also in other contexts. For examples uses of topic models in biology and bioinformatics research emerged. Recently topic models has been used to extract information from dataset of cancers' genomic samples. In this case topics are biological latent variables to be inferred.

To analysis of music and creativity

Topic models can be used for analysis of continuous signals like music. For instance, they were used to quantify how musical styles change in time, and identify the influence of specific artists on later music creation.

See also

Further reading

  • Steyvers, Mark; Griffiths, Tom (2007). . In Landauer, T.; McNamara, D; Dennis, S.; et al. (eds.). (PDF). Psychology Press. ISBN 978-0-8058-5418-3. Archived from (PDF) on 2013-06-24.
  • Blei, D.M.; Lafferty, J.D. (2009). (PDF).
  • Blei, D.; Lafferty, J. (2007). "A correlated topic model of Science". Annals of Applied Statistics. 1 (1): 17–35. arXiv:. doi:. S2CID .
  • Mimno, D. (April 2012). (PDF). Journal on Computing and Cultural Heritage. 5 (1): 1–19. doi:. S2CID .
  • Marwick, Ben (2013). . In Yanchang, Zhao; Yonghua, Cen (eds.). Data Mining Applications with R. Elsevier. pp. 63–93. doi:. ISBN 978-0-12-411511-8.
  • Jockers, M. 2010 Matthew L. Jockers, posted 19 March 2010
  • Drouin, J. 2011 Ecclesiastical Proust Archive. posted 17 March 2011
  • Templeton, C. 2011 Maryland Institute for Technology in the Humanities Blog. posted 1 August 2011
  • Griffiths, T.; Steyvers, M. (2004). . Proceedings of the National Academy of Sciences. 101 (Suppl 1): 5228–35. Bibcode:. doi:. PMC . PMID .
  • Yang, T., A Torget and R. Mihalcea (2011) Topic Modeling on Historical Newspapers. . The Association for Computational Linguistics, Madison, WI. pages 96–104.
  • Block, S. (January 2006). . Common-place the Interactive Journal of Early American Life. 6 (2).
  • Newman, D.; Block, S. (March 2006). (PDF). Journal of the American Society for Information Science and Technology. 57 (5): 753–767. doi:. S2CID .

External links

  • Mimno, David. .
  • Brett, Megan R. . Journal of Digital Humanities.
  • Video of a Google Tech Talk presentation by Alice Oh on topic modeling with LDA
  • Video of a Google Tech Talk presentation by David M. Blei
  • Video of a presentation by Brandon Stewart at the , 14 June 2010
  • Shawn Graham, Ian Milligan, and Scott Weingart . The Programming Historian. Archived from on 2014-08-28.
  • Blei, David M.
  • , - example of using LDA for topic modelling