topicmod.model_stats

Statistics for topic models and BoW matrices (doc-term-matrices).

Markus Konrad <markus.konrad@wzb.eu>

Module Contents

Functions

get_marginal_topic_distrib(doc_topic_distrib,doc_lengths) Return marginal topic distribution p(T) (topic proportions) given the document-topic distribution (theta)
get_marginal_word_distrib(topic_word_distrib,p_t) Return the marginal word distribution p(w) (term proportions derived from topic model) given the
_words_by_score(words,score,least_to_most,n=None) Order a vector of words by a score, either least_to_most or reverse. Optionally return only the top n
get_word_saliency(topic_word_distrib,doc_topic_distrib,doc_lengths) Calculate word saliency according to Chuang et al. 2012.
_words_by_salience_score(vocab,topic_word_distrib,doc_topic_distrib,doc_lengths,n=None,least_to_most=False) Return words in vocab ordered by saliency score.
get_most_salient_words(vocab,topic_word_distrib,doc_topic_distrib,doc_lengths,n=None) Order the words from vocab by “saliency score” (Chuang et al. 2012) from most to least salient. Optionally only
get_least_salient_words(vocab,topic_word_distrib,doc_topic_distrib,doc_lengths,n=None) Order the words from vocab by “saliency score” (Chuang et al. 2012) from least to most salient. Optionally only
get_word_distinctiveness(topic_word_distrib,p_t) Calculate word distinctiveness according to Chuang et al. 2012.
_words_by_distinctiveness_score(vocab,topic_word_distrib,doc_topic_distrib,doc_lengths,n=None,least_to_most=False) Return words in vocab ordered by distinctiveness score.
get_most_distinct_words(vocab,topic_word_distrib,doc_topic_distrib,doc_lengths,n=None) Order the words from vocab by “distinctiveness score” (Chuang et al. 2012) from most to least distinctive.
get_least_distinct_words(vocab,topic_word_distrib,doc_topic_distrib,doc_lengths,n=None) Order the words from vocab by “distinctiveness score” (Chuang et al. 2012) from least to most distinctive.
get_topic_word_relevance(topic_word_distrib,doc_topic_distrib,doc_lengths,lambda_) Calculate the topic-word relevance score with a lambda parameter lambda_ according to Sievert and Shirley 2014.
_check_relevant_words_for_topic_args(vocab,rel_mat,topic)
get_most_relevant_words_for_topic(vocab,rel_mat,topic,n=None) Get words from vocab for topic ordered by most to least relevance (Sievert and Shirley 2014) using the relevance
get_least_relevant_words_for_topic(vocab,rel_mat,topic,n=None) Get words from vocab for topic ordered by least to most relevance (Sievert and Shirley 2014) using the relevance
generate_topic_labels_from_top_words(topic_word_distrib,doc_topic_distrib,doc_lengths,vocab,n_words=None,lambda_=1,labels_glue=”_”,labels_format=”{i1}_{topwords}”) Generate topic labels derived from the top words of each topic. The top words are determined from the
top_n_from_distribution(distrib,top_n=10,row_labels=None,col_labels=None,val_labels=None) Get top_n values from LDA model’s distribution distrib as DataFrame. Can be used for topic-word distributions
top_words_for_topics(topic_word_distrib,top_n=None,vocab=None,return_prob=False)
_join_value_and_label_dfs(vals,labels,top_n,val_fmt=None,row_labels=None,col_labels=None,index_name=None)
filter_topics(w,vocab,topic_word_distrib,top_n=None,thresh=None,match=”exact”,cond=”any”,glob_method=”match”,return_words_and_matches=False) Filter topics defined as topic-word distribution topic_word_distrib across vocabulary vocab for a word (pass a
exclude_topics(excl_topic_indices,doc_topic_distrib,topic_word_distrib=None,renormalize=True,return_new_topic_mapping=False) Exclude topics with the indices excl_topic_indices from the document-topic distribution doc_topic_distrib (i.e.
get_marginal_topic_distrib(doc_topic_distrib, doc_lengths)

Return marginal topic distribution p(T) (topic proportions) given the document-topic distribution (theta) doc_topic_distrib and the document lengths doc_lengths. The latter can be calculated with get_doc_lengths().

get_marginal_word_distrib(topic_word_distrib, p_t)

Return the marginal word distribution p(w) (term proportions derived from topic model) given the topic-word distribution (phi) topic_word_distrib and the marginal topic distribution p(T) p_t. The latter can be calculated with get_marginal_topic_distrib().

_words_by_score(words, score, least_to_most, n=None)

Order a vector of words by a score, either least_to_most or reverse. Optionally return only the top n results.

get_word_saliency(topic_word_distrib, doc_topic_distrib, doc_lengths)

Calculate word saliency according to Chuang et al. 2012. saliency(w) = p(w) * distinctiveness(w)

  1. Chuang, C. Manning, J. Heer 2012: “Termite: Visualization Techniques for Assessing Textual Topic Models”
_words_by_salience_score(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None, least_to_most=False)

Return words in vocab ordered by saliency score.

get_most_salient_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by “saliency score” (Chuang et al. 2012) from most to least salient. Optionally only return the n most salient words.

  1. Chuang, C. Manning, J. Heer 2012: “Termite: Visualization Techniques for Assessing Textual Topic Models”
get_least_salient_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by “saliency score” (Chuang et al. 2012) from least to most salient. Optionally only return the n least salient words.

  1. Chuang, C. Manning, J. Heer 2012: “Termite: Visualization Techniques for Assessing Textual Topic Models”
get_word_distinctiveness(topic_word_distrib, p_t)

Calculate word distinctiveness according to Chuang et al. 2012. distinctiveness(w) = KL(P(T|w), P(T)) = sum_T(P(T|w) log(P(T|w)/P(T))) with P(T) .. marginal topic distribution

P(T|w) .. prob. of a topic given a word
  1. Chuang, C. Manning, J. Heer 2012: “Termite: Visualization Techniques for Assessing Textual Topic Models”
_words_by_distinctiveness_score(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None, least_to_most=False)

Return words in vocab ordered by distinctiveness score.

get_most_distinct_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by “distinctiveness score” (Chuang et al. 2012) from most to least distinctive. Optionally only return the n most distinctive words.

  1. Chuang, C. Manning, J. Heer 2012: “Termite: Visualization Techniques for Assessing Textual Topic Models”
get_least_distinct_words(vocab, topic_word_distrib, doc_topic_distrib, doc_lengths, n=None)

Order the words from vocab by “distinctiveness score” (Chuang et al. 2012) from least to most distinctive. Optionally only return the n least distinctive words.

  1. Chuang, C. Manning, J. Heer 2012: “Termite: Visualization Techniques for Assessing Textual Topic Models”
get_topic_word_relevance(topic_word_distrib, doc_topic_distrib, doc_lengths, lambda_)

Calculate the topic-word relevance score with a lambda parameter lambda_ according to Sievert and Shirley 2014. relevance(w,T|lambda) = lambda * log phi_{w,t} + (1-lambda) * log (phi_{w,t} / p(w)) with phi .. topic-word distribution

p(w) .. marginal word probability
_check_relevant_words_for_topic_args(vocab, rel_mat, topic)
get_most_relevant_words_for_topic(vocab, rel_mat, topic, n=None)

Get words from vocab for topic ordered by most to least relevance (Sievert and Shirley 2014) using the relevance matrix rel_mat obtained from get_topic_word_relevance(). Optionally only return the n most relevant words.

get_least_relevant_words_for_topic(vocab, rel_mat, topic, n=None)

Get words from vocab for topic ordered by least to most relevance (Sievert and Shirley 2014) using the relevance matrix rel_mat obtained from get_topic_word_relevance(). Optionally only return the n least relevant words.

generate_topic_labels_from_top_words(topic_word_distrib, doc_topic_distrib, doc_lengths, vocab, n_words=None, lambda_=1, labels_glue="_", labels_format="{i1}_{topwords}")

Generate topic labels derived from the top words of each topic. The top words are determined from the relevance score (Sievert and Shirley 2014) depending on lambda_. Specify the number of top words in the label with n_words. If n_words is None, a minimum number of words will be used to create unique labels for each topic. Topic labels are formed by joining the top words with labels_glue and formatting them with labels_format. Placeholders in labels_format are {i0} (zero-based topic index), {i1} (one-based topic index) and {topwords} (top words glued with labels_glue).

top_n_from_distribution(distrib, top_n=10, row_labels=None, col_labels=None, val_labels=None)

Get top_n values from LDA model’s distribution distrib as DataFrame. Can be used for topic-word distributions and document-topic distributions. Set row_labels to a format string or a list. Set col_labels to a format string for the column names. Set val_labels to return value labels instead of pure values (probabilities).

top_words_for_topics(topic_word_distrib, top_n=None, vocab=None, return_prob=False)
_join_value_and_label_dfs(vals, labels, top_n, val_fmt=None, row_labels=None, col_labels=None, index_name=None)
filter_topics(w, vocab, topic_word_distrib, top_n=None, thresh=None, match="exact", cond="any", glob_method="match", return_words_and_matches=False)

Filter topics defined as topic-word distribution topic_word_distrib across vocabulary vocab for a word (pass a string) or multiple words/patterns w (pass a list of strings). Either run pattern(s) w against the list of top words per topic (use top_n for number of words in top words list) or specify a minimum topic-word probability thresh, resulting in a list of words above this threshold for each topic, which will be used for pattern matching. You can also specify top_n and thresh. Set the match parameter according to the options provided by filter_tokens.token_match() (exact matching, RE or glob matching). Use cond to specify whether at only one match suffices per topic when a list of patterns w is passed (cond=’any’) or all patterns must match (cond=’all’). By default, this function returns a NumPy array containing the indices of topics that passed the filter criteria. If return_words_and_matches is True, this function additonally returns a NumPy array with the top words for each topic and a NumPy array with the pattern matches for each topic.

exclude_topics(excl_topic_indices, doc_topic_distrib, topic_word_distrib=None, renormalize=True, return_new_topic_mapping=False)

Exclude topics with the indices excl_topic_indices from the document-topic distribution doc_topic_distrib (i.e. delete the respective columns in this matrix) and optionally re-normalize the distribution so that the rows sum up to 1 if renormalize is set to True. Optionally also strip the topics from the topic-word distribution topic_word_distrib (i.e. remove the respective rows).

If topic_word_distrib is given, return a tuple with the updated doc.-topic and topic-word distributions, else return only the updated doc.-topic distribution.

WARNING: The topics to be excluded are specified by zero-based indices.