bow.dtm

Functions for creating a document-term-matrix (DTM) and some compatibility functions for Gensim.

Module Contents

Functions

get_vocab_and_terms(docs) From a dict docs with document ID -> terms/tokens list mapping, generate an array of vocabulary (i.e.
create_sparse_dtm(vocab,doc_labels,docs_terms,sum_uniques_per_doc) Create a sparse document-term-matrix (DTM) as scipy “coo_matrix” from vocabulary array vocab, document
dtm_to_gensim_corpus(dtm)
gensim_corpus_to_dtm(corpus)
dtm_and_vocab_to_gensim_corpus_and_dict(dtm,vocab,as_gensim_dictionary=True)
get_vocab_and_terms(docs)

From a dict docs with document ID -> terms/tokens list mapping, generate an array of vocabulary (i.e. unique terms of the whole corpus docs), an array of document labels (i.e. document IDs), a dict with document ID -> document terms array mapping and a sum of the number of unique terms per document.

The returned variable sum_uniques_per_doc tells us how many elements will be non-zero in a DTM which will be created later. Hence this is the allocation size for the sparse DTM.

This function provides the input for create_sparse_dtm().

Return a tuple with: - np.array of vocabulary - np.array of document names - dict with mapping: document name -> np.array of document terms - overall sum of unique terms per document (allocation size for the sparse DTM)

create_sparse_dtm(vocab, doc_labels, docs_terms, sum_uniques_per_doc)

Create a sparse document-term-matrix (DTM) as scipy “coo_matrix” from vocabulary array vocab, document IDs/labels array doc_labels, dict of doc_label -> document terms docs_terms and the sum of unique terms per document sum_uniques_per_doc. The DTM’s rows are document names, its columns are indices in vocab, hence a value DTM[j, k] is the term frequency of term vocab[k] in docnames[j].

Memory requirement: about 3 * <sum_uniques_per_doc>.

dtm_to_gensim_corpus(dtm)
gensim_corpus_to_dtm(corpus)
dtm_and_vocab_to_gensim_corpus_and_dict(dtm, vocab, as_gensim_dictionary=True)