corpus

Module Contents

Classes

Corpus(self,docs=None)

Functions

read_full_file(fpath,encoding,read_size=None)
path_recursive_split(path,base=None)
paragraphs_from_lines(lines,splitchar=” “,break_on_num_newlines=2) Take string of lines, split into list of lines using splitchar (or don’t if splitchar evaluates to False) and
class Corpus(docs=None)
__init__(docs=None)
__str__()
__len__()
__getitem__(doc_label)
__setitem__(doc_label, doc_text)
__delitem__(doc_label)
__iter__()
__contains__(doc_label)
items()
keys()
get(*args)
from_files(*args, **kwargs)
from_folder(*args, **kwargs)
from_pickle(picklefile)
get_doc_labels(sort=True)
add_doc(doc_label, doc_text)
add_files(files, encoding="utf8", doc_label_fmt="{path}-{basename}", doc_label_path_join="_", read_size=None)
add_folder(folder, valid_extensions=tuple, encoding="utf8", strip_folderpath_from_doc_label=True, doc_label_fmt="{path}-{basename}", doc_label_path_join="_", read_size=None)
to_pickle(picklefile)
split_by_paragraphs(break_on_num_newlines=2, join_paragraphs=1, new_doc_label_fmt="{doc}-{parnum}")
sample(n)
filter_by_min_length(nchars)
filter_by_max_length(nchars)
_filter_by_length(nchars, predicate)
read_full_file(fpath, encoding, read_size=None)
path_recursive_split(path, base=None)
paragraphs_from_lines(lines,splitchar="

“,break_on_num_newlines=2)

Take string of lines, split into list of lines using splitchar (or don’t if splitchar evaluates to False) and then split them into individual paragraphs. A paragraph must be divided by at least break_on_num_newlines line breaks (empty lines) from another paragraph. Return a list of paragraphs, each paragraph containing a string of sentences.