Text (Pre-)Processing

Normalize

Normalize aspects of raw text that may vary in problematic ways.

textacy.preprocessing.normalize.normalize_hyphenated_words(text)[source]

Normalize words in text that have been split across lines by a hyphen for visual consistency (aka hyphenated) by joining the pieces back together, sans hyphen and whitespace.

Parameters

text (str) –

Returns

str

textacy.preprocessing.normalize.normalize_quotation_marks(text)[source]

Normalize all “fancy” single- and double-quotation marks in text to just the basic ASCII equivalents. Note that this will also normalize fancy apostrophes, which are typically represented as single quotation marks.

Parameters

text (str) –

Returns

str

textacy.preprocessing.normalize.normalize_unicode(text, *, form='NFC')[source]

Normalize unicode characters in text into canonical forms.

Parameters
  • text (str) –

  • form ({"NFC", "NFD", "NFKC", "NFKD"}) – Form of normalization applied to unicode characters. For example, an “e” with accute accent “´” can be written as “e´” (canonical decomposition, “NFD”) or “é” (canonical composition, “NFC”). Unicode can be normalized to NFC form without any change in meaning, so it’s usually a safe bet. If “NFKC”, additional normalizations are applied that can change characters’ meanings, e.g. ellipsis characters are replaced with three periods.

textacy.preprocessing.normalize.normalize_whitespace(text)[source]

Replace all contiguous line-breaking whitespaces with a single newline and all contiguous non-breaking whitespaces with a single space, then strip any leading/trailing whitespace.

Parameters

text (str) –

Returns

str

Remove

Remove aspects of raw text that may be unwanted for certain use cases.

textacy.preprocessing.remove.remove_accents(text, *, fast=False)[source]

Remove accents from any accented unicode characters in text, either by replacing them with ASCII equivalents or removing them entirely.

Parameters
  • text (str) –

  • fast (bool) –

    If False, accents are removed from any unicode symbol with a direct ASCII equivalent ; if True, accented chars for all unicode symbols are removed, regardless.

    Note

    fast=True can be significantly faster than fast=False, but its transformation of text is less “safe” and more likely to result in changes of meaning, spelling errors, etc.

Returns

str

Raises

ValueError – If method is not in {“unicode”, “ascii”}.

See also

For a more powerful (but slower) alternative, check out unidecode: https://github.com/avian2/unidecode

textacy.preprocessing.remove.remove_punctuation(text, *, marks=None)[source]

Remove punctuation from text by replacing all instances of marks with whitespace.

Parameters
  • text (str) –

  • marks (str) – Remove only those punctuation marks specified here. For example, “,;:” removes commas, semi-colons, and colons. If None, all unicode punctuation marks are removed.

Returns

str

Note

When marks=None, Python’s built-in str.translate() is used to remove punctuation; otherwise, a regular expression is used. The former’s performance is about 5-10x faster.

Replace

Replace parts of raw text that are semantically important as members of a group but not so much in the individual instances.

textacy.preprocessing.replace.replace_currency_symbols(text, replace_with='_CUR_')[source]

Replace all currency symbols in text with replace_with.

Parameters
  • text (str) –

  • replace_with (str) –

Returns

str

textacy.preprocessing.replace.replace_emails(text, replace_with='_EMAIL_')[source]

Replace all email addresses in text with replace_with.

Parameters
  • text (str) –

  • replace_with (str) –

Returns

str

textacy.preprocessing.replace.replace_emojis(text, replace_with='_EMOJI_')[source]

Replace all emoji and pictographs in text with replace_with.

Parameters
  • text (str) –

  • replace_with (str) –

Returns

str

Note

If your Python has a narrow unicode build (“USC-2”), only dingbats and miscellaneous symbols are replaced because Python isn’t able to represent the unicode data for things like emoticons. Sorry!

textacy.preprocessing.replace.replace_hashtags(text, replace_with='_TAG_')[source]

Replace all hashtags in text with replace_with.

Parameters
  • text (str) –

  • replace_with (str) –

Returns

str

textacy.preprocessing.replace.replace_numbers(text, replace_with='_NUMBER_')[source]

Replace all numbers in text with replace_with.

Parameters
  • text (str) –

  • replace_with (str) –

Returns

str

textacy.preprocessing.replace.replace_phone_numbers(text, replace_with='_PHONE_')[source]

Replace all phone numbers in text with replace_with.

Parameters
  • text (str) –

  • replace_with (str) –

Returns

str

textacy.preprocessing.replace.replace_urls(text, replace_with='_URL_')[source]

Replace all URLs in text with replace_with.

Parameters
  • text (str) –

  • replace_with (str) –

Returns

str

textacy.preprocessing.replace.replace_user_handles(text, replace_with='_USER_')[source]

Replace all user handles in text with replace_with.

Parameters
  • text (str) –

  • replace_with (str) –

Returns

str

Text Utils

Set of small utility functions that take text strings as input.

textacy.text_utils.is_acronym(token, exclude=None)[source]

Pass single token as a string, return True/False if is/is not valid acronym.

Parameters
  • token (str) – Single word to check for acronym-ness

  • exclude (Set[str]) – If technically valid but not actually good acronyms are known in advance, pass them in as a set of strings; matching tokens will return False.

Returns

bool

textacy.text_utils.keyword_in_context(text, keyword, *, ignore_case=True, window_width=50, print_only=True)[source]

Search for keyword in text via regular expression, return or print strings spanning window_width characters before and after each occurrence of keyword.

Parameters
  • text (str) – Text in which to search for keyword.

  • keyword (str) –

    Technically, any valid regular expression string should work, but usually this is a single word or short phrase: “spam”, “spam and eggs”; to account for variations, use regex: “[Ss]pam (and|&) [Ee]ggs?”

    Note: If keyword contains special characters, be sure to escape them!

  • ignore_case (bool) – If True, ignore letter case in keyword matching.

  • window_width (int) – Number of characters on either side of keyword to include as “context”.

  • print_only (bool) – If True, print out all results with nice formatting; if False, return all (pre, kw, post) matches as generator of raw strings

Returns

generator(Tuple[str, str, str]), or None

textacy.text_utils.KWIC(text, keyword, *, ignore_case=True, window_width=50, print_only=True)

Alias of keyword_in_context.

textacy.text_utils.clean_terms(terms)[source]

Clean up a sequence of single- or multi-word strings: strip leading/trailing junk chars, handle dangling parens and odd hyphenation, etc.

Parameters

terms (Iterable[str]) – sequence of terms such as “presidency”, “epic failure”, or “George W. Bush” that may be _unclean_ for whatever reason

Yields

str – next term in terms but with the cruft cleaned up, excluding terms that were _entirely_ cruft

Warning

Terms with (intentionally) unusual punctuation may get “cleaned” into a form that changes or obscures the original meaning of the term.