Data Augmentation

class textacy.augmentation.augmenter.Augmenter(transforms, *, num=None)[source]

Randomly apply one or many data augmentation transforms to spaCy Doc s to produce new docs with additional variety and/or noise in the data.

Initialize an Augmenter with multiple transforms, and customize the randomization of their selection when applying to a document:

>>> tfs = [transforms.delete_words, transforms.swap_chars, transforms.delete_chars]
>>> Augmenter(tfs, num=None)  # all tfs applied each time
>>> Augmenter(tfs, num=1)  # one randomly-selected tf applied each time
>>> Augmenter(tfs, num=0.5)  # tfs randomly selected with 50% prob each time
>>> augmenter = Augmenter(tfs, num=[0.4, 0.8, 0.6])  # tfs randomly selected with 40%, 80%, 60% probs, respectively, each time

Apply transforms to a given Doc to produce new documents:

>>> text = "The quick brown fox jumps over the lazy dog."
>>> doc = textacy.make_spacy_doc(text, lang="en")
>>> augmenter.apply_transforms(doc)
The quick brown ox jupms over the lazy dog.
>>> augmenter.apply_transforms(doc)
The quikc brown fox over the lazy dog.
>>> augmenter.apply_transforms(doc)
quick brown fox jumps over teh lazy dog.

Parameters for individual transforms may be specified when initializing Augmenter or, if necessary, when applying to individual documents:

>>> from functools import partial
>>> tfs = [partial(transforms.delete_words, num=3), transforms.swap_chars]
>>> augmenter = Augmenter(tfs)
>>> augmenter.apply_transforms(doc)
brown fox jumps over layz dog.
>>> augmenter.apply_transforms(doc, lang=doc.lang)  # (not actually needed for these tfs)
quick brown fox over teh lazy.
Parameters
  • transforms (Sequence[Callable]) –

    Ordered sequence of callables that must take List[AugTok] as their first positional argument and return another List[AugTok].

    Note

    Although the particular transforms applied may vary doc-by-doc, they are applied in order as listed here. Since some transforms may clobber text in a way that makes other transforms less effective, a stable ordering can improve the quality of augmented data.

  • num (int or float or List[float]) – If int, number of transforms to randomly select from transforms each time Augmenter.apply_tranforms() is called. If float, probability that any given transform will be selected. If List[float], the probability that the corresponding transform in transforms will be selected (these must be the same length). If None (default), num is set to len(transforms), which means that every transform is applied each time.

See also

A collection of good, general-purpose transforms are implemented in textacy.augmentation.transforms.

apply_transforms(doc, **kwargs)[source]

Sequentially apply some subset of data augmentation transforms to doc, then return a new Doc created from the augmented text.

Parameters
  • doc (spacy.tokens.Doc) –

  • **kwargs – If, for whatever reason, you have to pass keyword argument values into transforms that vary or depend on characteristics of doc, specify them here. The transforms’ call signatures will be inspected, and values will be passed along, as needed.

Returns

spacy.tokens.Doc

textacy.augmentation.transforms.substitute_word_synonyms(aug_toks, *, num=1, pos=None)[source]

Randomly substitute words for which synonyms are available with a randomly selected synonym, up to num times or with a probability of num.

Parameters
  • aug_toks (List[AugTok]) – Sequence of tokens to augment through synonym substitution.

  • num (int or float) – If int, maximum number of words with available synonyms to substitute with a randomly selected synonym; if float, probability that a given word with synonyms will be substituted.

  • pos (str or Set[str]) – Part of speech tag(s) of words to be considered for augmentation. If None, all words with synonyms are considered.

Returns

New, augmented sequence of tokens.

Return type

List[AugTok]

Note

This transform requires textacy.resources.ConceptNet to be downloaded to work properly, since this is the data source for word synonyms to be substituted.

textacy.augmentation.transforms.insert_word_synonyms(aug_toks, *, num=1, pos=None)[source]

Randomly insert random synonyms of tokens for which synonyms are available, up to num times or with a probability of num.

Parameters
  • aug_toks (List[AugTok]) – Sequence of tokens to augment through synonym insertion.

  • num (int or float) – If int, maximum number of words with available synonyms from which a random synonym is selected and randomly inserted; if float, probability that a given word with synonyms will provide a synonym to be inserted.

  • pos (str or Set[str]) – Part of speech tag(s) of words to be considered for augmentation. If None, all words with synonyms are considered.

Returns

New, augmented sequence of tokens.

Return type

List[AugTok]

Note

This transform requires textacy.resources.ConceptNet to be downloaded to work properly, since this is the data source for word synonyms to be inserted.

textacy.augmentation.transforms.swap_words(aug_toks, *, num=1, pos=None)[source]

Randomly swap the positions of two adjacent words, up to num times or with a probability of num.

Parameters
  • aug_toks (List[AugTok]) – Sequence of tokens to augment through position swapping.

  • num (int or float) – If int, maximum number of adjacent word pairs to swap; if float, probability that a given word pair will be swapped.

  • pos (str or Set[str]) – Part of speech tag(s) of words to be considered for augmentation. If None, all words are considered.

Returns

New, augmented sequence of tokens.

Return type

List[AugTok]

textacy.augmentation.transforms.delete_words(aug_toks, *, num=1, pos=None)[source]

Randomly delete words, up to num times or with a probability of num.

Parameters
  • aug_toks (List[AugTok]) – Sequence of tokens to augment through word deletion.

  • num (int or float) – If int, maximum number of words to delete; if float, probability that a given word will be deleted.

  • pos (str or Set[str]) – Part of speech tag(s) of words to be considered for augmentation. If None, all words are considered.

Returns

New, augmented sequence of tokens.

Return type

List[AugTok]

textacy.augmentation.transforms.substitute_chars(aug_toks, *, num=1, lang=None)[source]

Randomly substitute a single character in randomly-selected words with another, up to num times or with a probability of num.

Parameters
  • aug_toks (List[AugTok]) – Sequence of tokens to augment through character substitution.

  • num (int or float) – If int, maximum number of words to modify with a random character substitution; if float, probability that a given word will be modified.

  • lang (str) – Standard, two-letter language code corresponding to aug_toks. Used to load a weighted distribution of language-appropriate characters that are randomly selected for substitution. More common characters are more likely to be substituted. If not specified, ascii letters and digits are randomly selected with equal probability.

Returns

New, augmented sequence of tokens.

Return type

List[AugTok]

Note

This transform requires textacy.datasets.UDHR to be downloaded to work properly, since this is the data source for character weights when deciding which char(s) to insert.

textacy.augmentation.transforms.insert_chars(aug_toks, *, num=1, lang=None)[source]

Randomly insert a character into randomly-selected words, up to num times or with a probability of num.

Parameters
  • aug_toks (List[AugTok]) – Sequence of tokens to augment through character insertion.

  • num (int or float) – If int, maximum number of words to modify with a random character insertion; if float, probability that a given word will be modified.

  • lang (str) – Standard, two-letter language code corresponding to aug_toks. Used to load a weighted distribution of language-appropriate characters that are randomly selected for substitution. More common characters are more likely to be substituted. If not specified, ascii letters and digits are randomly selected with equal probability.

Returns

New, augmented sequence of tokens.

Return type

List[AugTok]

Note

This transform requires textacy.datasets.UDHR to be downloaded to work properly, since this is the data source for character weights when deciding which char(s) to insert.

textacy.augmentation.transforms.swap_chars(aug_toks, *, num=1)[source]

Randomly swap two adjacent characters in randomly-selected words, up to num times or with a probability of num.

Parameters
  • aug_toks (List[AugTok]) – Sequence of tokens to augment through character swapping.

  • num (int or float) – If int, maximum number of words to modify with a random character swap; if float, probability that a given word will be modified.

Returns

New, augmented sequence of tokens.

Return type

List[AugTok]

textacy.augmentation.transforms.delete_chars(aug_toks, *, num=1)[source]

Randomly delete a character in randomly-selected words, up to num times or with a probability of num.

Parameters
  • aug_toks (List[AugTok]) – Sequence of tokens to augment through character deletion.

  • num (int or float) – If int, maximum number of words to modify with a random character deletion; if float, probability that a given word will be modified.

Returns

New, augmented sequence of tokens.

Return type

List[AugTok]

class textacy.augmentation.utils.AugTok(text, ws, pos, is_word, syns)

tuple: Minimal token data required for data augmentation transforms.

property is_word

Alias for field number 3

property pos

Alias for field number 2

property syns

Alias for field number 4

property text

Alias for field number 0

property ws

Alias for field number 1

textacy.augmentation.utils.to_aug_toks(spacy_obj)[source]

Transform a spaCy Doc or Span into a list of AugTok objects, suitable for use in data augmentation transform functions.

Parameters

spacy_obj (spacy.tokens.Doc or spacy.tokens.Span) –

Returns

List[AugTok]

textacy.augmentation.utils.get_char_weights(lang)[source]

Get lang-specific character weights for use in certain data augmentation transforms, based on texts in textacy.datasets.UDHR.

Parameters

lang (str) – Standard two-letter language code.

Returns

Collection of (character, weight) pairs, based on the distribution of characters found in the source text.

Return type

List[Tuple[str, int]]