spaCy extensions

Doc extensions

Functionality for inspecting, customizing, and transforming spaCy’s core data structure, spacy.tokens.Doc, accessible directly as functions that take a Doc as their first argument or as custom attributes/methods on instantiated docs prepended by an underscore:

>>> spacy_lang = textacy.load_spacy_lang("en")
>>> doc = nlp("This is a short text.")
>>> print(get_preview(doc))
Doc(6 tokens: "This is a short text.")
>>> print(doc._.preview)
Doc(6 tokens: "This is a short text.")
textacy.spacier.doc_extensions.set_doc_extensions()[source]

Set textacy’s custom property and method doc extensions on the global spacy.tokens.Doc.

textacy.spacier.doc_extensions.get_doc_extensions()[source]

Get textacy’s custom property and method doc extensions that can be set on or removed from the global spacy.tokens.Doc.

textacy.spacier.doc_extensions.remove_doc_extensions()[source]

Remove textacy’s custom property and method doc extensions from the global spacy.tokens.Doc.

textacy.spacier.doc_extensions.get_lang(doc)[source]

Get the standard, two-letter language code assigned to Doc and its associated spacy.vocab.Vocab.

Parameters

doc (spacy.tokens.Doc) –

Returns

str

textacy.spacier.doc_extensions.get_preview(doc)[source]

Get a short preview of the Doc, including the number of tokens and an initial snippet.

Parameters

doc (spacy.tokens.Doc) –

Returns

str

textacy.spacier.doc_extensions.get_tokens(doc)[source]

Yield the tokens in Doc, one at a time.

Parameters

doc (spacy.tokens.Doc) –

Yields

spacy.tokens.Token

textacy.spacier.doc_extensions.get_n_tokens(doc)[source]

Get the number of tokens (including punctuation) in Doc.

Parameters

doc (spacy.tokens.Doc) –

Returns

int

textacy.spacier.doc_extensions.get_n_sents(doc)[source]

Get the number of sentences in Doc.

Parameters

doc (spacy.tokens.Doc) –

Returns

int

textacy.spacier.doc_extensions.get_meta(doc)[source]

Get custom metadata added to Doc.

Parameters

doc (spacy.tokens.Doc) –

Returns

dict

textacy.spacier.doc_extensions.set_meta(doc, value)[source]

Add custom metadata to Doc.

Parameters
  • doc (spacy.tokens.Doc) –

  • value (dict) –

textacy.spacier.doc_extensions.to_tokenized_text(doc)[source]

Transform Doc into an ordered, nested list of token-texts per sentence.

Parameters

doc (spacy.tokens.Doc) –

Returns

List[List[str]]

Note

If doc hasn’t been segmented into sentences, the entire document is treated as a single sentence.

textacy.spacier.doc_extensions.to_tagged_text(doc)[source]

Transform Doc into an ordered, nested list of (token-text, part-of-speech tag) pairs per sentence.

Parameters

doc (spacy.tokens.Doc) –

Returns

List[List[Tuple[str, str]]]

Note

If doc hasn’t been segmented into sentences, the entire document is treated as a single sentence.

textacy.spacier.doc_extensions.to_terms_list(doc, *, ngrams=(1, 2, 3), entities=True, normalize='lemma', as_strings=False, **kwargs)[source]

Transform Doc into a sequence of ngrams and/or entities — not necessarily in order of appearance — where each appears in the sequence as many times as it appears in Doc.

Parameters
  • doc (spacy.tokens.Doc) –

  • ngrams (int or Set[int] or None) – ngrams to include in the terms list. If {1, 2, 3}, unigrams, bigrams, and trigrams are included; if 2, only bigrams are included; if None, ngrams aren’t included, except for those belonging to named entities.

  • entities (bool or None) –

    If True, entities are included in the terms list; if False, they are excluded from the list; if None, entities aren’t included or excluded at all.

    Note

    When both entities and ngrams are non-null, exact duplicates (based on start and end indexes) are handled. If entities is True, any duplicate entities are included while duplicate ngrams are discarded to avoid double-counting; if entities is False, no entities are included of course, and duplicate ngrams are discarded as well.

  • normalize (str or callable) – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if callable, must accept a Token or Span and return a str, e.g. get_normalized_text().

  • as_strings (bool) – If True, terms are returned as strings; if False, terms are returned as their unique integer ids.

  • kwargs

    • filter_stops (bool)

    • filter_punct (bool)

    • filter_nums (bool)

    • include_pos (str or Set[str])

    • exclude_pos (str or Set[str])

    • min_freq (int)

    • include_types (str or Set[str])

    • exclude_types (str or Set[str]

    • drop_determiners (bool)

    See textacy.extract.words(), textacy.extract.ngrams(), and textacy.extract.entities() for details.

Yields

int or str – the next term in the terms list, either as a unique integer id or as a string

Raises

ValueError – if neither entities nor ngrams are included, or if entities or normalize have invalid values

Note

Despite the name, this is a generator function; to get an actual list of terms, call list(to_terms_list(doc)).

textacy.spacier.doc_extensions.to_bag_of_terms(doc, *, ngrams=(1, 2, 3), entities=True, normalize='lemma', weighting='count', as_strings=False, **kwargs)[source]

Transform Doc into a bag-of-terms: the set of unique terms in Doc mapped to their frequency of occurrence, where “terms” includes ngrams and/or entities.

Parameters
  • doc (spacy.tokens.Doc) –

  • ngrams (int or Set[int]) – n of which n-grams to include; (1, 2, 3) (default) includes unigrams (words), bigrams, and trigrams; 2 if only bigrams are wanted; falsy (e.g. False) to not include any

  • entities (bool) – If True (default), include named entities; note: if ngrams are also included, any ngrams that exactly overlap with an entity are skipped to prevent double-counting

  • normalize (str or callable) – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if a callable, must accept a Token or Span and return a str, e.g. textacy.spacier.utils.get_normalized_text().

  • weighting ({"count", "freq", "binary"}) – Type of weight to assign to terms. If “count” (default), weights are the absolute number of occurrences (count) of term in doc. If “binary”, all counts are set equal to 1. If “freq”, term counts are normalized by the total token count, giving their relative frequency of occurrence.

  • as_strings (bool) – If True, words are returned as strings; if False (default), words are returned as their unique integer ids.

  • kwargs

    • filter_stops (bool)

    • filter_punct (bool)

    • filter_nums (bool)

    • include_pos (str or Set[str])

    • exclude_pos (str or Set[str])

    • min_freq (int)

    • include_types (str or Set[str])

    • exclude_types (str or Set[str]

    • drop_determiners (bool)

    See textacy.extract.words(), textacy.extract.ngrams(), and textacy.extract.entities() for details.

Returns

mapping of a unique term id or string (depending on the value of as_strings) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

Return type

dict

See also

to_terms_list(), which is used under the hood.

textacy.spacier.doc_extensions.to_bag_of_words(doc, *, normalize='lemma', weighting='count', as_strings=False, filter_stops=True, filter_punct=True, filter_nums=False)[source]

Transform Doc into a bag-of-words: the set of unique words in Doc mapped to their absolute, relative, or binary frequency of occurrence.

Parameters
  • doc (spacy.tokens.Doc) –

  • normalize (str) – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they they appear in doc

  • weighting ({"count", "freq", "binary"}) – Type of weight to assign to words. If “count” (default), weights are the absolute number of occurrences (count) of word in doc. If “binary”, all counts are set equal to 1. If “freq”, word counts are normalized by the total token count, giving their relative frequency of occurrence. Note: The resulting set of frequencies won’t (necessarily) sum to 1.0, since punctuation and stop words are filtered out after counts are normalized.

  • as_strings (bool) – If True, words are returned as strings; if False (default), words are returned as their unique integer ids

  • filter_stops (bool) – If True (default), stop words are removed after counting.

  • filter_punct (bool) – If True (default), punctuation tokens are removed after counting.

  • filter_nums (bool) – If True, tokens consisting of digits are removed after counting.

Returns

mapping of a unique word id or string (depending on the value of as_strings) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

Return type

dict

textacy.spacier.doc_extensions.to_semantic_network(doc, *, nodes='words', normalize='lemma', edge_weighting='default', window_width=10)[source]

Transform Doc into a semantic network, where nodes are either “words” or “sents” and edges between nodes may be weighted in different ways.

Parameters
  • doc (spacy.tokens.Doc) –

  • nodes ({"words", "sents"}) – Type of doc component to use as nodes in the semantic network.

  • normalize (str or callable) – If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in doc; if a callable, must accept a Token or Span (if nodes = “words” or “sents”, respectively) and return a str, e.g. get_normalized_text()

  • edge_weighting (str) – Type of weighting to apply to edges between nodes; if nodes = “words”, options are {“cooc_freq”, “binary”}, if nodes = “sents”, options are {“cosine”, “jaccard”}; if “default”, “cooc_freq” or “cosine” will be automatically used.

  • window_width (int) – Size of sliding window over terms that determines which are said to co-occur; only applicable if nodes = “words”.

Returns

where nodes represent either terms or sentences in doc; edges, the relationships between them.

Return type

networkx.Graph

Raises

ValueError – If nodes is neither “words” nor “sents”.

Pipeline Components

Custom components to add to a spaCy language pipeline.

class textacy.spacier.components.TextStatsComponent(attrs=None)[source]

A custom component to be added to a spaCy language pipeline that computes one, some, or all text stats for a parsed doc and sets the values as custom attributes on a spacy.tokens.Doc.

Add the component to a pipeline, after the parser (as well as any subsequent components that modify the tokens/sentences of the doc):

>>> en = spacy.load('en')
>>> text_stats_component = TextStatsComponent()
>>> en.add_pipe(text_stats_component, after='parser')

Process a text with the pipeline and access the custom attributes via spaCy’s underscore syntax:

>>> doc = en(u"This is a test test someverylongword.")
>>> doc._.n_words
6
>>> doc._.flesch_reading_ease
73.84500000000001

Specify which attributes of the textacy.text_stats.TextStats() to add to processed documents:

>>> en = spacy.load('en')
>>> text_stats_component = TextStatsComponent(attrs='n_words')
>>> en.add_pipe(text_stats_component, last=True)
>>> doc = en(u"This is a test test someverylongword.")
>>> doc._.n_words
6
>>> doc._.flesch_reading_ease
AttributeError: [E046] Can't retrieve unregistered extension attribute 'flesch_reading_ease'. Did you forget to call the `set_extension` method?
Parameters

attrs (str or Iterable[str] or None) – If str, a single text stat to compute and set on a Doc. If Iterable[str], multiple text stats. If None, all text stats are computed and set as extensions.

name

Default name of this component in a spaCy language pipeline, used to get and modify the component via various spacy.Language methods, e.g. https://spacy.io/api/language#get_pipe.

Type

str

spaCy Utils

Helper functions for working with / extending spaCy’s core functionality.

textacy.spacier.utils.make_doc_from_text_chunks(text, lang, chunk_size=100000)[source]

Make a single spaCy-processed document from 1 or more chunks of text. This is a workaround for processing very long texts, for which spaCy is unable to allocate enough RAM.

Although this function’s performance is pretty good, it’s inherently less performant that just processing the entire text in one shot. Only use it if necessary!

Parameters
  • text (str) – Text document to be chunked and processed by spaCy.

  • lang (str or spacy.Language) – A 2-letter language code (e.g. “en”), the name of a spaCy model for the desired language, or an already-instantiated spaCy language pipeline.

  • chunk_size (int) –

    Number of characters comprising each text chunk (excluding the last chunk, which is probably smaller). For best performance, value should be somewhere between 1e3 and 1e7, depending on how much RAM you have available.

    Note

    Since chunking is done by character, chunks edges’ probably won’t respect natural language segmentation, which means that every chunk_size characters, spaCy will probably get tripped up and make weird parsing errors.

Returns

A single processed document, initialized from components accumulated chunk by chunk.

Return type

spacy.tokens.Doc

textacy.spacier.utils.merge_spans(spans, doc)[source]

Merge spans into single tokens in doc, in-place.

Parameters
  • spans (Iterable[spacy.tokens.Span]) –

  • doc (spacy.tokens.Doc) –

textacy.spacier.utils.preserve_case(token)[source]

Return True if token is a proper noun or acronym; otherwise, False.

Parameters

token (spacy.tokens.Token) –

Returns

bool

Raises

ValueError – If parent document has not been POS-tagged.

textacy.spacier.utils.get_normalized_text(span_or_token)[source]

Get the text of a spaCy span or token, normalized depending on its characteristics. For proper nouns and acronyms, text is returned as-is; for everything else, text is lemmatized.

Parameters

span_or_token (spacy.tokens.Span or spacy.tokens.Token) –

Returns

str

textacy.spacier.utils.get_main_verbs_of_sent(sent)[source]

Return the main (non-auxiliary) verbs in a sentence.

textacy.spacier.utils.get_subjects_of_verb(verb)[source]

Return all subjects of a verb according to the dependency parse.

textacy.spacier.utils.get_objects_of_verb(verb)[source]

Return all objects of a verb according to the dependency parse, including open clausal complements.

textacy.spacier.utils.get_span_for_compound_noun(noun)[source]

Return document indexes spanning all (adjacent) tokens in a compound noun.

textacy.spacier.utils.get_span_for_verb_auxiliaries(verb)[source]

Return document indexes spanning all (adjacent) tokens around a verb that are auxiliary verbs or negations.