lda_utils.eval_metrics

Module Contents

Functions

metric_held_out_documents_wallach09(dtm_test,theta_test,phi_train,alpha,n_samples=10000) Estimation of the probability of held-out documents according to Wallach et al. 2009 [1] using a document-topic
metric_cao_juan_2009(topic_word_distrib) Cao Juan, Xia Tian, Li Jintao, Zhang Yongdong, and Tang Sheng. 2009. A density-based method for adaptive LDA model
metric_arun_2010(topic_word_distrib,doc_topic_distrib,doc_lengths) Rajkumar Arun, V. Suresh, C. E. Veni Madhavan, and M. N. Narasimha Murthy. 2010. On finding the natural number of
metric_griffiths_2004(logliks) Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy
metric_coherence_mimno_2011(topic_word_distrib,dtm,top_n=20,eps=1e-12,normalize=True,return_mean=False) Calculate coherence metric according to Mimno et al. 2011 (a.k.a. “U_Mass” coherence metric). There are two
metric_coherence_gensim(measure,topic_word_distrib=None,gensim_model=None,vocab=None,dtm=None,gensim_corpus=None,texts=None,top_n=20,return_coh_model=False,return_mean=False,**kwargs) Calculate model coherence using Gensim’s CoherenceModel [1,2]. Note: this function also supports models from
metric_held_out_documents_wallach09(dtm_test, theta_test, phi_train, alpha, n_samples=10000)

Estimation of the probability of held-out documents according to Wallach et al. 2009 [1] using a document-topic estimation theta_test that was estimated via held-out documents dtm_test on a trained model with a topic-word distribution phi_train and a document-topic prior alpha. Draw n_samples according to theta_test for each document in dtm_test (memory consumption and run time can be very high for larger n_samples and a large amount of big documents in dtm_test).

A document-topic estimation theta_test can be obtained from a trained model from the “lda” package or scikit-learn package with the transform() method.

Adopted MATLAB code originally from Ian Murray, 2009 See https://people.cs.umass.edu/~wallach/code/etm/ MATLAB code downloaded from https://people.cs.umass.edu/~wallach/code/etm/lda_eval_matlab_code_20120930.tar.gz

Note: requires gmpy2 package for multiple-precision arithmetic to avoid numerical underflow.
see https://github.com/aleaxit/gmpy

[1] Wallach, H.M., Murray, I., Salakhutdinov, R. and Mimno, D., 2009. Evaluation methods for topic models.

metric_cao_juan_2009(topic_word_distrib)

Cao Juan, Xia Tian, Li Jintao, Zhang Yongdong, and Tang Sheng. 2009. A density-based method for adaptive LDA model selection. Neurocomputing — 16th European Symposium on Artificial Neural Networks 2008 72, 7–9: 1775–1781. http://doi.org/10.1016/j.neucom.2008.06.011

metric_arun_2010(topic_word_distrib, doc_topic_distrib, doc_lengths)

Rajkumar Arun, V. Suresh, C. E. Veni Madhavan, and M. N. Narasimha Murthy. 2010. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Advances in knowledge discovery and data mining, Mohammed J. Zaki, Jeffrey Xu Yu, Balaraman Ravindran and Vikram Pudi (eds.). Springer Berlin Heidelberg, 391–402. http://doi.org/10.1007/978-3-642-13657-3_43

metric_griffiths_2004(logliks)

Thomas L. Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences 101, suppl 1: 5228–5235. http://doi.org/10.1073/pnas.0307752101

Calculates the harmonic mean of the loglikelihood values logliks as in Griffiths, Steyvers 2004. Burnin values should already be removed from logliks.

Note: requires gmpy2 package for multiple-precision arithmetic to avoid numerical underflow.
see https://github.com/aleaxit/gmpy
metric_coherence_mimno_2011(topic_word_distrib, dtm, top_n=20, eps=1e-12, normalize=True, return_mean=False)

Calculate coherence metric according to Mimno et al. 2011 (a.k.a. “U_Mass” coherence metric). There are two modifications to the originally suggested measure: - uses a different epsilon by default (set eps=1 for original) - uses a normalizing constant by default (set normalize=False for original)

Provide a topic word distribution $phi$ as topic_word_distrib and a document-term-matrix dtm (can be sparse). top_n controls how many most probable words per topic are selected.

By default, it will return a NumPy array of coherence values per topic (same ordering as in topic_word_distrib). Set return_mean to True to return the mean of all topics instead.

  1. Mimno, H. Wallach, E. Talley, M. Leenders, A. McCullum 2011: Optimizing semantic coherence in topic models
metric_coherence_gensim(measure, topic_word_distrib=None, gensim_model=None, vocab=None, dtm=None, gensim_corpus=None, texts=None, top_n=20, return_coh_model=False, return_mean=False, **kwargs)

Calculate model coherence using Gensim’s CoherenceModel [1,2]. Note: this function also supports models from lda and sklearn (by passing topic_word_distrib, dtm and vocab)!

Define which measure to use with parameter measure: - u_mass - c_v - c_uci - c_npmi

Provide a topic word distribution $phi$ as topic_word_distrib OR a Gensim model gensim_model and the corpus’ vocabulary as vocab OR pass a gensim corpus as gensim_corpus. top_n controls how many most probable words per topic are selected.

If measure is u_mass, a document-term-matrix dtm or gensim_corpus must be provided and texts can be None. If any other measure than u_mass is used, tokenized input as texts must be provided as 2D list: ``` [[‘some’, ‘text’, …], # doc. 1

[‘some’, ‘more’, …], # doc. 2 [‘another’, ‘document’, …]] # doc. 3

```

If return_coh_model is True, the whole CoherenceModel instance will be returned, otherwise: - if return_mean is True, the mean coherence value will be returned - if return_mean is False, a list of coherence values (for each topic) will be returned

Provided kwargs will be passed to CoherenceModel() or CoherenceModel.get_coherence_per_topic().

[1]: https://radimrehurek.com/gensim/models/coherencemodel.html [2]: https://rare-technologies.com/what-is-topic-coherence/