Lang, Doc, Corpus

Convenient entry points for making spaCy docs and loading spaCy language pipelines.

textacy.spacier.core.load_spacy_lang(name, disable=None, allow_blank=False)[source]

Load a spaCy Language: a shared vocabulary and language-specific data for tokenizing text, and (if available) model data and a processing pipeline containing a sequence of components for annotating a document. An LRU cache saves languages in memory.

Parameters
  • name (str or pathlib.Path) – spaCy language to load. Could be a shortcut link, full package name, or path to model directory, or a 2-letter ISO language code for which spaCy has language data.

  • disable (Tuple[str]) –

    Names of pipeline components to disable, if any.

    Note

    Although spaCy’s API specifies this argument as a list, here we require a tuple. Pipelines are stored in the LRU cache with unique identifiers generated from the hash of the function name and args — and lists aren’t hashable.

  • allow_blank (bool) – If True, allow loading of blank spaCy Language s; if False, raise an OSError if a full processing pipeline isn’t available. Note that spaCy Doc s produced by blank languages are missing key functionality, e.g. POS tags, entities, sentences.

Returns

A loaded spaCy Language.

Return type

spacy.language.Language

Raises
textacy.spacier.core.make_spacy_doc(data, lang=<bound method LangIdentifier.identify_lang of <textacy.lang_utils.LangIdentifier object>>)[source]

Make a spacy.tokens.Doc from valid inputs, and automatically load/validate spacy.language.Language pipelines to process data.

Make a Doc from text:

>>> text = "To be, or not to be, that is the question."
>>> doc = make_spacy_doc(text)
>>> doc._.preview
'Doc(13 tokens: "To be, or not to be, that is the question.")'

Make a Doc from a (text, metadata) pair, aka a “record”:

>>> record = (text, {"author": "Shakespeare, William"})
>>> doc = make_spacy_doc(record)
>>> doc._.preview
'Doc(13 tokens: "To be, or not to be, that is the question.")'
>>> doc._.meta
{'author': 'Shakespeare, William'}

Specify the language / Language pipeline used to process the text — or don’t:

>>> make_spacy_doc(text)
>>> make_spacy_doc(text, lang="en")
>>> make_spacy_doc(text, lang="en_core_web_sm")
>>> make_spacy_doc(text, lang=textacy.load_spacy_lang("en"))
>>> make_spacy_doc(text, lang=textacy.lang_utils.identify_lang)

Ensure that an already-processed Doc is compatible with lang:

>>> spacy_lang = textacy.load_spacy_lang("en")
>>> doc = spacy_lang(text)
>>> make_spacy_doc(doc, lang="en")
>>> make_spacy_doc(doc, lang="es")
...
ValueError: lang of spacy pipeline used to process document ('en') must be the same as `lang` ('es')
Parameters
  • data (str or Tuple[str, dict] or spacy.tokens.Doc) – Make a spacy.tokens.Doc from a text or (text, metadata) pair. If already a Doc, ensure that it’s compatible with lang to avoid surprises downstream, and return it as-is.

  • lang (str or spacy.language.Language or Callable) –

    Language with which spaCy processes (or processed) data.

    If known, pass a standard 2-letter language code (e.g. “en”), or the name of a spacy language pipeline (e.g. “en_core_web_md”), or an already-instantiated spacy.language.Language object. If not known, pass a function that takes unicode text as input and outputs a standard 2-letter language code.

    A given / detected language string is then used to instantiate a corresponding Language with all default components enabled.

Returns

spacy.tokens.Doc

Raises

A class for working with a collection of spaCy docs. Includes functionality for easily adding, getting, and removing documents; saving to / loading their data from disk; and tracking basic corpus statistics.

class textacy.corpus.Corpus(lang, data=None)[source]

An ordered collection of spacy.tokens.Doc, all of the same language and sharing the same spacy.language.Language processing pipeline and vocabulary, with data held in-memory.

Initialize from a language / Language and (optionally) one or a stream of texts or (text, metadata) pairs:

>>> ds = textacy.datasets.CapitolWords()
>>> records = ds.records(limit=50)
>>> corpus = textacy.Corpus("en", data=records)
>>> corpus
Corpus(50 docs, 32175 tokens)

Add or remove documents, with automatic updating of corpus statistics:

>>> texts = ds.texts(congress=114, limit=25)
>>> corpus.add(texts)
>>> corpus.add("If Burton were a member of Congress, here's what he'd say.")
>>> corpus
Corpus(76 docs, 55906 tokens)
>>> corpus.remove(lambda doc: doc._.meta.get("speaker_name") == "Rick Santorum")
>>> corpus
Corpus(61 docs, 48567 tokens)

Get subsets of documents matching your particular use case:

>>> match_func = lambda doc: doc._.meta.get("speaker_name") == "Bernie Sanders"
>>> for doc in corpus.get(match_func, limit=3):
...     print(doc._.preview)
Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")
Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...")
Doc(177 tokens: "Mr. Speaker, if we want to understand why in th...")

Get or remove documents by indexing, too:

>>> corpus[0]._.preview
'Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")'
>>> [doc._.preview for doc in corpus[:3]]
['Doc(159 tokens: "Mr. Speaker, 480,000 Federal employees are work...")',
 'Doc(219 tokens: "Mr. Speaker, a relationship, to work and surviv...")',
 'Doc(336 tokens: "Mr. Speaker, I thank the gentleman for yielding...")']
>>> del corpus[:5]
>>> corpus
Corpus(56 docs, 41573 tokens)

Compute basic corpus statistics:

>>> corpus.n_docs, corpus.n_sents, corpus.n_tokens
(56, 1771, 41573)
>>> word_counts = corpus.word_counts(as_strings=True)
>>> sorted(word_counts.items(), key=lambda x: x[1], reverse=True)[:5]
[('-PRON-', 2553), ('people', 215), ('year', 148), ('Mr.', 139), ('$', 137)]
>>> word_doc_counts = corpus.word_doc_counts(weighting="freq", as_strings=True)
>>> sorted(word_doc_counts.items(), key=lambda x: x[1], reverse=True)[:5]
[('-PRON-', 0.9821428571428571),
 ('Mr.', 0.7678571428571429),
 ('President', 0.5),
 ('people', 0.48214285714285715),
 ('need', 0.44642857142857145)]

Save corpus data to and load from disk:

>>> corpus.save("~/Desktop/capitol_words_sample.bin.gz")
>>> corpus = textacy.Corpus.load("en", "~/Desktop/capitol_words_sample.bin.gz")
>>> corpus
Corpus(56 docs, 41573 tokens)
Parameters
  • lang (str or spacy.language.Language) –

    Language with which spaCy processes (or processed) all documents added to the corpus, whether as data now or later.

    Pass a standard 2-letter language code (e.g. “en”), or the name of a spacy language pipeline (e.g. “en_core_web_md”), or an already-instantiated spacy.language.Language object.

    A given / detected language string is then used to instantiate a corresponding Language with all default components enabled.

  • data (obj or Iterable[obj]) –

    One or a stream of texts, records, or spacy.tokens.Doc s to be added to the corpus.

    See also

    Corpus.add()

lang
Type

str

spacy_lang
Type

spacy.language.Language

docs
Type

List[spacy.tokens.Doc]

n_docs
Type

int

n_sents
Type

int

n_tokens
Type

int

add(data, batch_size=1000)[source]

Add one or a stream of texts, records, or spacy.tokens.Doc s to the corpus, ensuring that all processing is or has already been done by the Corpus.spacy_lang pipeline.

Parameters
  • data (obj or Iterable[obj]) – str or Iterable[str] Tuple[str, dict] or Iterable[Tuple[str, dict]] spacy.tokens.Doc or Iterable[spacy.tokens.Doc]

  • batch_size (int) –

add_text(text)[source]

Add one text to the corpus, processing it into a spacy.tokens.Doc using the Corpus.spacy_lang pipeline.

Parameters

text (str) –

add_texts(texts, batch_size=1000)[source]

Add a stream of texts to the corpus, efficiently processing them into spacy.tokens.Doc s using the Corpus.spacy_lang pipeline.

Parameters
  • texts (Iterable[str]) –

  • batch_size (int) –

add_record(record)[source]

Add one record to the corpus, processing it into a spacy.tokens.Doc using the Corpus.spacy_lang pipeline.

Parameters

record (Tuple[str, dict]) –

add_records(records, batch_size=1000)[source]

Add a stream of records to the corpus, efficiently processing them into spacy.tokens.Doc s using the Corpus.spacy_lang pipeline.

Parameters
  • records (Iterable[Tuple[str, dict]]) –

  • batch_size (int) –

add_doc(doc)[source]

Add one spacy.tokens.Doc to the corpus, provided it was processed using the Corpus.spacy_lang pipeline.

Parameters

doc (spacy.tokens.Doc) –

add_docs(docs)[source]

Add a stream of spacy.tokens.Doc s to the corpus, provided they were processed using the Corpus.spacy_lang pipeline.

Parameters

doc (Iterable[spacy.tokens.Doc]) –

get(match_func, limit=None)[source]

Get all (or N <= limit) docs in Corpus for which match_func(doc) is True.

Parameters
  • match_func (Callable) –

    Function that takes a spacy.tokens.Doc as input and returns a boolean value. For example:

    Corpus.get(lambda x: len(x) >= 100)
    

    gets all docs with at least 100 tokens. And:

    Corpus.get(lambda doc: doc._.meta["author"] == "Burton DeWilde")
    

    gets all docs whose author was given as ‘Burton DeWilde’.

  • limit (int) – Maximum number of matched docs to return.

Yields

spacy.tokens.Doc – Next document passing match_func.

Tip

To get doc(s) by index, treat Corpus as a list and use Python’s usual indexing and slicing: Corpus[0] gets the first document in the corpus; Corpus[:5] gets the first 5; etc.

remove(match_func, limit=None)[source]

Remove all (or N <= limit) docs in Corpus for which match_func(doc) is True. Corpus doc/sent/token counts are adjusted accordingly.

Parameters
  • match_func (func) –

    Function that takes a spacy.tokens.Doc and returns a boolean value. For example:

    Corpus.remove(lambda x: len(x) >= 100)
    

    removes docs with at least 100 tokens. And:

    Corpus.remove(lambda doc: doc._.meta["author"] == "Burton DeWilde")
    

    removes docs whose author was given as “Burton DeWilde”.

  • limit (int) – Maximum number of matched docs to remove.

Tip

To remove doc(s) by index, treat Corpus as a list and use Python’s usual indexing and slicing: del Corpus[0] removes the first document in the corpus; del Corpus[:5] removes the first 5; etc.

property vectors

Constituent docs’ word vectors stacked in a 2d array.

property vector_norms

Constituent docs’ L2-normalized word vectors stacked in a 2d array.

word_counts(*, normalize='lemma', weighting='count', as_strings=False, filter_stops=True, filter_punct=True, filter_nums=False)[source]

Map the set of unique words in Corpus to their counts as absolute, relative, or binary frequencies of occurence, similar to Doc._.to_bag_of_words() but aggregated over all docs.

Parameters
  • normalize (str) – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they appear.

  • weighting ({"count", "freq"}) –

    Type of weight to assign to words. If “count” (default), weights are the absolute number of occurrences (count) of word in corpus. If “freq”, word counts are normalized by the total token count, giving their relative frequencies of occurrence.

    Note

    The resulting set of frequencies won’t (necessarily) sum to 1.0, since punctuation and stop words are filtered out after counts are normalized.

  • as_strings (bool) – If True, words are returned as strings; if False (default), words are returned as their unique integer ids.

  • filter_stops (bool) – If True (default), stop word counts are removed.

  • filter_punct (bool) – If True (default), punctuation counts are removed.

  • filter_nums (bool) – If True, number counts are removed.

Returns

mapping of a unique word id or string (depending on the value of as_strings) to its absolute, relative, or binary frequency of occurrence (depending on the value of weighting).

Return type

dict

word_doc_counts(*, normalize='lemma', weighting='count', smooth_idf=True, as_strings=False, filter_stops=True, filter_punct=True, filter_nums=True)[source]

Map the set of unique words in Corpus to their document counts as absolute, relative, inverse, or binary frequencies of occurence.

Parameters
  • normalize (str) – If “lemma”, lemmatize words before counting; if “lower”, lowercase words before counting; otherwise, words are counted using the form with which they appear.

  • weighting ({"count", "freq", "idf"}) – Type of weight to assign to words. If “count” (default), weights are the absolute number (count) of documents in which word appears. If “freq”, word doc counts are normalized by the total document count, giving their relative frequencies of occurrence. If “idf”, weights are the log of the inverse relative frequencies: log(n_docs / word_doc_count) or (if smooth_idf is True) log(1 + (n_docs / word_doc_count)) .

  • smooth_idf (bool) – If True, add 1 to all word doc counts when calculating “idf” weighting, equivalent to adding a single document to the corpus containing every unique word.

  • as_strings (bool) – If True, words are returned as strings; if False (default), words are returned as their unique integer ids

  • filter_stops (bool) – If True (default), stop word counts are removed.

  • filter_punct (bool) – If True (default), punctuation counts are removed.

  • filter_nums (bool) – If True, number counts are removed.

Returns

mapping of a unique word id or string (depending on the value of as_strings) to the number of documents in which it appears weighted as absolute, relative, or binary frequencies (depending on the value of weighting).

Return type

dict

save(filepath)[source]

Save Corpus to disk as binary data.

Parameters

filepath (str) – Full path to file on disk where Corpus data will be saved as a binary file.

See also

Corpus.load()

classmethod load(lang, filepath)[source]

Load previously saved Corpus binary data, reproduce the original :class:`spacy.tokens.Doc`s tokens and annotations, and instantiate a new :class:`Corpus from them.

Parameters
  • lang (str or spacy.language.Language) –

  • filepath (str) – Full path to file on disk where Corpus data was previously saved as a binary file.

Returns

Corpus

See also

Corpus.save()