dtm

Module Contents

Functions

get_vocab_and_terms(docs) From a dict docs with document ID -> terms/tokens list mapping, generate an array of vocabulary (i.e.
create_sparse_dtm(vocab,doc_labels,docs_terms,sum_uniques_per_doc) Create a sparse document-term-matrix (DTM) as scipy “coo_matrix” from vocabulary array vocab, document
save_dtm_to_pickle(dtm,vocab,docnames,picklefile) Save a DTM as pickle file.
load_dtm_from_pickle(picklefile) Load a DTM from a pickle file.
get_vocab_and_terms(docs)

From a dict docs with document ID -> terms/tokens list mapping, generate an array of vocabulary (i.e. unique terms of the whole corpus docs), an array of document labels (i.e. document IDs), a dict with document ID -> document terms array mapping and a sum of the number of unique terms per document.

The returned variable sum_uniques_per_doc tells us how many elements will be non-zero in a DTM which will be created later. Hence this is the allocation size for the sparse DTM.

This function provides the input for create_sparse_dtm().

Return a tuple with: - np.array of vocabulary - np.array of document names - dict with mapping: document name -> np.array of document terms - overall sum of unique terms per document (allocation size for the sparse DTM)

create_sparse_dtm(vocab, doc_labels, docs_terms, sum_uniques_per_doc)

Create a sparse document-term-matrix (DTM) as scipy “coo_matrix” from vocabulary array vocab, document IDs/labels array doc_labels, dict of doc_label -> document terms docs_terms and the sum of unique terms per document sum_uniques_per_doc. The DTM’s rows are document names, its columns are indices in vocab, hence a value DTM[j, k] is the term frequency of term vocab[k] in docnames[j].

Memory requirement: about 3 * <sum_uniques_per_doc>.

save_dtm_to_pickle(dtm, vocab, docnames, picklefile)

Save a DTM as pickle file.

load_dtm_from_pickle(picklefile)

Load a DTM from a pickle file.