extract

Functions to extract various elements of interest from documents already parsed by spaCy, such as n-grams, named entities, subject-verb-object triples, and acronyms.

Module Contents

Functions

words(doc,filter_stops=True,filter_punct=True,filter_nums=False,include_pos=None,exclude_pos=None,min_freq=1) Extract an ordered sequence of words from a document processed by spaCy,
ngrams(doc,n,filter_stops=True,filter_punct=True,filter_nums=False,include_pos=None,exclude_pos=None,min_freq=1) Extract an ordered sequence of n-grams (n consecutive words) from a
named_entities(doc,include_types=None,exclude_types=None,drop_determiners=True,min_freq=1) Extract an ordered sequence of named entities (PERSON, ORG, LOC, etc.) from
noun_chunks(doc,drop_determiners=True,min_freq=1) Extract an ordered sequence of noun chunks from a spacy-parsed doc, optionally
pos_regex_matches(doc,pattern) Extract sequences of consecutive tokens from a spacy-parsed doc whose
subject_verb_object_triples(doc) Extract an ordered sequence of subject-verb-object (SVO) triples from a
acronyms_and_definitions(doc,known_acro_defs=None) Extract a collection of acronyms and their most likely definitions, if available,
_get_acronym_definition(acronym,window,threshold=0.8) Identify most likely definition for an acronym given a list of tokens.
semistructured_statements(doc,entity,cue=”be”,ignore_entity_case=True,min_n_words=1,max_n_words=20) Extract “semi-structured statements” from a spacy-parsed doc, each as a
direct_quotations(doc) Baseline, not-great attempt at direction quotation extraction (no indirect
words(doc, filter_stops=True, filter_punct=True, filter_nums=False, include_pos=None, exclude_pos=None, min_freq=1)

Extract an ordered sequence of words from a document processed by spaCy, optionally filtering words by part-of-speech tag and frequency.

Args:

doc (textacy.Doc, spacy.Doc, or spacy.Span) filter_stops (bool): if True, remove stop words from word list filter_punct (bool): if True, remove punctuation from word list filter_nums (bool): if True, remove number-like words (e.g. 10, ‘ten’)

from word list
include_pos (str or Set[str]): remove words whose part-of-speech tag
IS NOT included in this param
exclude_pos (str or Set[str]): remove words whose part-of-speech tag
IS in the specified tags
min_freq (int): remove words that occur in doc fewer than
min_freq times
Yields:
spacy.Token: the next token from doc passing specified filters in order of appearance in the document
Raises:
TypeError: if include_pos or exclude_pos is not a str, a set of str,
or a falsy value
Note:
Filtering by part-of-speech tag uses the universal POS tag set, http://universaldependencies.org/u/pos/.
ngrams(doc, n, filter_stops=True, filter_punct=True, filter_nums=False, include_pos=None, exclude_pos=None, min_freq=1)

Extract an ordered sequence of n-grams (n consecutive words) from a spacy-parsed doc, optionally filtering n-grams by the types and parts-of-speech of the constituent words.

Args:

doc (textacy.Doc, spacy.Doc, or spacy.Span) n (int): number of tokens per n-gram; 2 => bigrams, 3 => trigrams, etc. filter_stops (bool): if True, remove ngrams that start or end

with a stop word
filter_punct (bool): if True, remove ngrams that contain
any punctuation-only tokens
filter_nums (bool): if True, remove ngrams that contain
any numbers or number-like tokens (e.g. 10, ‘ten’)
include_pos (str or Set[str]): remove ngrams if any of their constituent
tokens’ part-of-speech tags ARE NOT included in this param
exclude_pos (str or Set[str]): remove ngrams if any of their constituent
tokens’ part-of-speech tags ARE included in this param
min_freq (int): remove ngrams that occur in doc fewer than
min_freq times
Yields:
spacy.Span: the next ngram from doc passing all specified filters, in order of appearance in the document
Raises:

ValueError: if n < 1 TypeError: if include_pos or exclude_pos is not a str, a set of str,

or a falsy value
Note:
Filtering by part-of-speech tag uses the universal POS tag set, http://universaldependencies.org/u/pos/.
named_entities(doc, include_types=None, exclude_types=None, drop_determiners=True, min_freq=1)

Extract an ordered sequence of named entities (PERSON, ORG, LOC, etc.) from a spacy-parsed doc, optionally filtering by entity types and frequencies.

Args:

doc (textacy.Doc or spacy.Doc) include_types (str or Set[str]): remove named entities whose type IS NOT

in this param; if “NUMERIC”, all numeric entity types (“DATE”, “MONEY”, “ORDINAL”, etc.) are included
exclude_types (str or Set[str]): remove named entities whose type IS
in this param; if “NUMERIC”, all numeric entity types (“DATE”, “MONEY”, “ORDINAL”, etc.) are excluded
drop_determiners (bool): Remove leading determiners (e.g. “the”)

from named entities (e.g. “the United States” => “United States”).

Note

Entities from which a leading determiner has been removed do not keep their entity type annotations. This is irritating but unavoidable, since the only way to re-annotate them is to modify doc directly, and this function is not meant to have any side-effects. If you’re only using the text of the returned spans, this is no big deal; if you’re using NE-like attributes downstream, however, this is something to watch out for.

min_freq (int): remove named entities that occur in doc fewer
than min_freq times
Yields:
spacy.Span: the next named entity from doc passing all specified filters in order of appearance in the document
Raises:
TypeError: if include_types or exclude_types is not a str, a set of
str, or a falsy value
noun_chunks(doc, drop_determiners=True, min_freq=1)

Extract an ordered sequence of noun chunks from a spacy-parsed doc, optionally filtering by frequency and dropping leading determiners.

Args:

doc (textacy.Doc or spacy.Doc) drop_determiners (bool): remove leading determiners (e.g. “the”)

from phrases (e.g. “the quick brown fox” => “quick brown fox”)
min_freq (int): remove chunks that occur in doc fewer than
min_freq times
Yields:
spacy.Span: the next noun chunk from doc in order of appearance in the document
pos_regex_matches(doc, pattern)

Extract sequences of consecutive tokens from a spacy-parsed doc whose part-of-speech tags match the specified regex pattern.

Args:

doc (textacy.Doc or spacy.Doc or spacy.Span) pattern (str): Pattern of consecutive POS tags whose corresponding words

are to be extracted, inspired by the regex patterns used in NLTK’s nltk.chunk.regexp. Tags are uppercase, from the universal tag set; delimited by < and >, which are basically converted to parentheses with spaces as needed to correctly extract matching word sequences; white space in the input doesn’t matter.

Examples (see constants.POS_REGEX_PATTERNS):

  • noun phrase: r’<DET>? (<NOUN>+ <ADP|CONJ>)* <NOUN>+’
  • compound nouns: r’<NOUN>+’
  • verb phrase: r’<VERB>?<ADV>*<VERB>+’
  • prepositional phrase: r’<PREP> <DET>? (<NOUN>+<ADP>)* <NOUN>+’
Yields:
spacy.Span: the next span of consecutive tokens from doc whose parts-of-speech match pattern, in order of apperance
subject_verb_object_triples(doc)

Extract an ordered sequence of subject-verb-object (SVO) triples from a spacy-parsed doc. Note that this only works for SVO languages.

Args:
doc (textacy.Doc or spacy.Doc or spacy.Span)
Yields:
Tuple[spacy.Span, spacy.Span, spacy.Span]: The next 3-tuple of spans from doc representing a (subject, verb, object) triple, in order of appearance.
acronyms_and_definitions(doc, known_acro_defs=None)

Extract a collection of acronyms and their most likely definitions, if available, from a spacy-parsed doc. If multiple definitions are found for a given acronym, only the most frequently occurring definition is returned.

Args:

doc (textacy.Doc or spacy.Doc or spacy.Span) known_acro_defs (dict): if certain acronym/definition pairs

are known, pass them in as {acronym (str): definition (str)}; algorithm will not attempt to find new definitions
Returns:
dict: unique acronyms (keys) with matched definitions (values)
References:
Taghva, Kazem, and Jeff Gilbreth. “Recognizing acronyms and their definitions.” International Journal on Document Analysis and Recognition 1.4 (1999): 191-198.
_get_acronym_definition(acronym, window, threshold=0.8)

Identify most likely definition for an acronym given a list of tokens.

Args:

acronym (str): acronym for which definition is sought window (spacy.Span): a span of tokens from which definition

extraction will be attempted
threshold (float): minimum “confidence” in definition required
for acceptance; valid values in [0.0, 1.0]; higher value => stricter threshold
Returns:
Tuple[str, float]: most likely definition for given acronym (‘’ if none found), along with the confidence assigned to it
References:
Taghva, Kazem, and Jeff Gilbreth. “Recognizing acronyms and their definitions.” International Journal on Document Analysis and Recognition 1.4 (1999): 191-198.
semistructured_statements(doc, entity, cue="be", ignore_entity_case=True, min_n_words=1, max_n_words=20)

Extract “semi-structured statements” from a spacy-parsed doc, each as a (entity, cue, fragment) triple. This is similar to subject-verb-object triples.

Args:

doc (textacy.Doc or spacy.Doc) entity (str): a noun or noun phrase of some sort (e.g. “President Obama”,

“global warming”, “Python”)
cue (str): verb lemma with which entity is associated
(e.g. “talk about”, “have”, “write”)

ignore_entity_case (bool): if True, entity matching is case-independent min_n_words (int): min number of tokens allowed in a matching fragment max_n_words (int): max number of tokens allowed in a matching fragment

Yields:
(spacy.Span or spacy.Token, spacy.Span or spacy.Token, spacy.Span): where each element is a matching (entity, cue, fragment) triple
Notes:

Inspired by N. Diakopoulos, A. Zhang, A. Salway. Visual Analytics of Media Frames in Online News and Blogs. IEEE InfoVis Workshop on Text Visualization. October, 2013.

Which itself was inspired by by Salway, A.; Kelly, L.; Skadiņa, I.; and Jones, G. 2010. Portable Extraction of Partially Structured Facts from the Web. In Proc. ICETAL 2010, LNAI 6233, 345-356. Heidelberg, Springer.

direct_quotations(doc)

Baseline, not-great attempt at direction quotation extraction (no indirect or mixed quotations) using rules and patterns. English only.

Args:
doc (textacy.Doc or spacy.Doc)
Yields:
(spacy.Span, spacy.Token, spacy.Span): next quotation in doc represented as a (speaker, reporting verb, quotation) 3-tuple
Notes:
Loosely inspired by Krestel, Bergler, Witte. “Minding the Source: Automatic Tagging of Reported Speech in Newspaper Articles”.

TODO: Better approach would use ML, but needs a training dataset.