Miscellany

Text Statistics

Compute a variety of basic counts and readability statistics for documents.

class textacy.text_stats.TextStats(doc)[source]

Compute a variety of basic counts and readability statistics for a given document. For example:

>>> text = next(textacy.datasets.CapitolWords().texts(limit=1))
>>> doc = textacy.make_spacy_doc(text)
>>> ts = TextStats(doc)
>>> ts.n_words
136
>>> ts.flesch_kincaid_grade_level
11.817647058823532
>>> ts.basic_counts
{'n_chars': 685,
 'n_long_words': 43,
 'n_monosyllable_words': 90,
 'n_polysyllable_words': 24,
 'n_sents': 6,
 'n_syllables': 214,
 'n_unique_words': 80,
 'n_words': 136}
>>> ts.readability_stats
{'automated_readability_index': 13.626495098039214,
 'coleman_liau_index': 12.509300816176474,
 'flesch_kincaid_grade_level': 11.817647058823532,
 'flesch_reading_ease': 50.707745098039254,
 'gulpease_index': 51.86764705882353,
 'gunning_fog_index': 16.12549019607843,
 'lix': 54.28431372549019,
 'smog_index': 14.554592549557764,
 'wiener_sachtextformel': 8.266410784313727}
Parameters

doc (spacy.tokens.Doc) – A text document processed by spacy. Need only be tokenized.

n_sents

Number of sentences in doc.

Type

int

n_words

Number of words in doc, including numbers + stop words but excluding punctuation.

Type

int

n_chars

Number of characters for all words in doc.

Type

int

n_syllables

Number of syllables for all words in doc.

Type

int

n_unique_words

Number of unique (lower-cased) words in doc.

Type

int

n_long_words

Number of words in doc with 7 or more characters.

Type

int

n_monosyllable_words

Number of words in doc with 1 syllable only.

Type

int

n_polysyllable_words

Number of words in doc with 3 or more syllables. Note: Since this excludes words with exactly 2 syllables, it’s likely that n_monosyllable_words + n_polysyllable_words != n_words.

Type

int

flesch_kincaid_grade_level

see flesch_kincaid_grade_level()

Type

float

flesch_reading_ease

see flesch_reading_ease()

Type

float

smog_index

see smog_index()

Type

float

gunning_fog_index

see gunning_fog_index()

Type

float

coleman_liau_index

see coleman_liau_index()

Type

float

automated_readability_index

see automated_readability_index()

Type

float

lix

see lix()

Type

float

gulpease_index

see gulpease_index()

Type

float

wiener_sachtextformel

see wiener_sachtextformel() Note: This always returns variant #1.

Type

float

basic_counts

Mapping of basic count names to values, where basic counts are the attributes listed above between n_sents and n_polysyllable_words.

Type

Dict[str, int]

readability_stats

Mapping of readability statistic names to values, where readability stats are the attributes listed above between flesch_kincaid_grade_level and wiener_sachtextformel.

Type

Dict[str, float]

Raises

ValueError – If doc is not a spacy.tokens.Doc.

textacy.text_stats.flesch_kincaid_grade_level(n_syllables, n_words, n_sents)[source]

Readability score used widely in education, whose value estimates the U.S. grade level / number of years of education required to understand a text. Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch.E2.80.93Kincaid_grade_level

textacy.text_stats.flesch_reading_ease(n_syllables, n_words, n_sents, *, lang=None)[source]

Readability score usually in the range [0, 100], related (inversely) to flesch_kincaid_grade_level(). Higher value => easier text.

Note

Constant weights in this formula are language-dependent; if lang is null, the English-language formulation is used.

References

English: https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease German: https://de.wikipedia.org/wiki/Lesbarkeitsindex#Flesch-Reading-Ease Spanish: ? French: ? Italian: https://it.wikipedia.org/wiki/Formula_di_Flesch Dutch: ? Portuguese: https://pt.wikipedia.org/wiki/Legibilidade_de_Flesch Russian: https://ru.wikipedia.org/wiki/%D0%98%D0%BD%D0%B4%D0%B5%D0%BA%D1%81_%D1%83%D0%B4%D0%BE%D0%B1%D0%BE%D1%87%D0%B8%D1%82%D0%B0%D0%B5%D0%BC%D0%BE%D1%81%D1%82%D0%B8

textacy.text_stats.smog_index(n_polysyllable_words, n_sents)[source]

Readability score commonly used in healthcare, whose value estimates the number of years of education required to understand a text, similar to flesch_kincaid_grade_level() and intended as a substitute for gunning_fog_index(). Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/SMOG

textacy.text_stats.gunning_fog_index(n_words, n_polysyllable_words, n_sents)[source]

Readability score whose value estimates the number of years of education required to understand a text, similar to flesch_kincaid_grade_level() and smog_index(). Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/Gunning_fog_index

textacy.text_stats.coleman_liau_index(n_chars, n_words, n_sents)[source]

Readability score whose value estimates the number of years of education required to understand a text, similar to flesch_kincaid_grade_level() and smog_index(), but using characters instead of syllables. Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/Coleman%E2%80%93Liau_index

textacy.text_stats.automated_readability_index(n_chars, n_words, n_sents)[source]

Readability score whose value estimates the U.S. grade level required to understand a text, most similarly to flesch_kincaid_grade_level(), but using characters instead of syllables like coleman_liau_index(). Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/Automated_readability_index

textacy.text_stats.lix(n_words, n_long_words, n_sents)[source]

Readability score commonly used in Sweden, whose value estimates the difficulty of reading a foreign text. Higher value => more difficult text.

References

https://en.wikipedia.org/wiki/LIX

textacy.text_stats.wiener_sachtextformel(n_words, n_polysyllable_words, n_monosyllable_words, n_long_words, n_sents, *, variant=1)[source]

Readability score for German-language texts, whose value estimates the grade level required to understand a text. Higher value => more difficult text.

References

https://de.wikipedia.org/wiki/Lesbarkeitsindex#Wiener_Sachtextformel

textacy.text_stats.gulpease_index(n_chars, n_words, n_sents)[source]

Readability score for Italian-language texts, whose value is in the range [0, 100] similar to flesch_reading_ease(). Higher value => easier text.

References

https://it.wikipedia.org/wiki/Indice_Gulpease

textacy.text_stats.load_hyphenator(lang)[source]

Load an object that hyphenates words at valid points, as used in LaTex typesetting.

Parameters

lang (str) –

Standard 2-letter language abbreviation. To get a list of valid values:

>>> import pyphen; pyphen.LANGUAGES

Returns

pyphen.Pyphen()

Note

While hyphenation points always fall on syllable divisions, not all syllable divisions are valid hyphenation points. But it’s decent.

Semantic Networks

Represent documents as semantic networks, where nodes are individual terms or whole sentences and edges are weighted by the strength of their co-occurrence or similarity, respectively.

textacy.network.terms_to_semantic_network(terms, *, normalize='lemma', window_width=10, edge_weighting='cooc_freq')[source]

Transform an ordered list of non-overlapping terms into a semantic network, where each term is represented by a node with weighted edges linking it to other terms that co-occur within window_width terms of itself.

Parameters
  • terms (List[str] or List[spacy.tokens.Token]) –

  • normalize (str or Callable) –

    If “lemma”, lemmatize terms; if “lower”, lowercase terms; if falsy, use the form of terms as they appear in terms; if a callable, must accept a Token and return a str, e.g. textacy.spacier.utils.get_normalized_text().

    Note

    This is applied to the elements of terms only if it’s a list of Token.

  • window_width (int) – Size of sliding window over terms that determines which are said to co-occur. If 2, only immediately adjacent terms have edges in the returned network.

  • edge_weighting ({'cooc_freq', 'binary'}) – If ‘cooc_freq’, the nodes for all co-occurring terms are connected by edges with weight equal to the number of times they co-occurred within a sliding window; if ‘binary’, all such edges have weight = 1.

Returns

Nodes in this network correspond to individual terms; those that co-occur are connected by edges with weights determined by edge_weighting.

Return type

networkx.Graph

Note

  • Be sure to filter out stopwords, punctuation, certain parts of speech, etc. from the terms list before passing it to this function

  • Multi-word terms, such as named entities and compound nouns, must be merged into single strings or Token s beforehand

  • If terms are already strings, be sure to have normalized them so that like terms are counted together; for example, by applying textacy.spacier.utils.get_normalized_text()

textacy.network.sents_to_semantic_network(sents, *, normalize='lemma', edge_weighting='cosine')[source]

Transform a list of sentences into a semantic network, where each sentence is represented by a node with edges linking it to other sentences weighted by the (cosine or jaccard) similarity of their constituent words.

Parameters
  • sents (List[str] or List[spacy.tokens.Span]) –

  • normalize (str or Callable) –

    If ‘lemma’, lemmatize words in sents; if ‘lower’, lowercase word in sents; if false-y, use the form of words as they appear in sents; if a callable, must accept a spacy.tokens.Token . and return a str, e.g. textacy.spacier.utils.get_normalized_text().

    Note

    This is applied to the elements of sents only if it’s a list of Span s.

  • edge_weighting ({'cosine', 'jaccard'}) – Similarity metric to use for weighting edges between sentences. If ‘cosine’, use the cosine similarity between sentences represented as tf-idf word vectors; if ‘jaccard’, use the set intersection divided by the set union of all words in a given sentence pair.

Returns

Nodes are the integer indexes of the sentences in sents, not the actual text of the sentences! Edges connect every node, with weights determined by edge_weighting.

Return type

networkx.Graph

Note

  • If passing sentences as strings, be sure to filter out stopwords, punctuation, certain parts of speech, etc. beforehand

  • Consider normalizing the strings so that like terms are counted together (see textacy.spacier.utils.get_normalized_text())

Similarity

Collection of semantic + lexical similarity metrics between tokens, strings, and sequences thereof, returning values between 0.0 (totally dissimilar) and 1.0 (totally similar).

textacy.similarity.word_movers(doc1, doc2, metric='cosine')[source]

Measure the semantic similarity between two documents using Word Movers Distance.

Parameters
  • doc1 (spacy.tokens.Doc) –

  • doc2 (spacy.tokens.Doc) –

  • metric ({"cosine", "euclidean", "l1", "l2", "manhattan"}) –

Returns

Similarity between doc1 and doc2 in the interval [0.0, 1.0], where larger values correspond to more similar documents.

Return type

float

References

  • Ofir Pele and Michael Werman, “A linear time histogram metric for improved SIFT matching,” in Computer Vision - ECCV 2008, Marseille, France, 2008.

  • Ofir Pele and Michael Werman, “Fast and robust earth mover’s distances,” in Proc. 2009 IEEE 12th Int. Conf. on Computer Vision, Kyoto, Japan, 2009.

  • Kusner, Matt J., et al. “From word embeddings to document distances.” Proceedings of the 32nd International Conference on Machine Learning (ICML 2015). 2015. http://jmlr.org/proceedings/papers/v37/kusnerb15.pdf

textacy.similarity.word2vec(obj1, obj2)[source]

Measure the semantic similarity between one spacy Doc, Span, Token, or Lexeme and another like object using the cosine distance between the objects’ (average) word2vec vectors.

Parameters
  • obj1 (spacy.tokens.Doc, spacy.tokens.Span, spacy.tokens.Token, or spacy.Lexeme) –

  • obj2 (spacy.tokens.Doc, spacy.tokens.Span, spacy.tokens.Token, or spacy.Lexeme) –

Returns

float: Similarity between obj1 and obj2 in the interval [0.0, 1.0], where larger values correspond to more similar objects

textacy.similarity.jaccard(obj1, obj2, fuzzy_match=False, match_threshold=0.8)[source]

Measure the similarity between two strings or sequences of strings using Jaccard distance, with optional fuzzy matching of not-identical pairs when obj1 and obj2 are sequences of strings.

Parameters
  • obj1 (str or Sequence[str]) –

  • obj2 (str or Sequence[str]) – if str, both inputs are treated as sequences of characters, in which case fuzzy matching is not permitted

  • fuzzy_match (bool) – if True, allow for fuzzy matching in addition to the usual identical matching of pairs between input vectors

  • match_threshold (float) – value in the interval [0.0, 1.0]; fuzzy comparisons with a score >= this value will be considered matches

Returns

Similarity between obj1 and obj2 in the interval [0.0, 1.0], where larger values correspond to more similar strings or sequences of strings

Return type

float

Raises
  • ValueError – if fuzzy_match is True but obj1 and obj2 are strings,

  • or if match_threshold is not a valid float

textacy.similarity.levenshtein(str1, str2)[source]

Measure the similarity between two strings using Levenshtein distance, which gives the minimum number of character insertions, deletions, and substitutions needed to change one string into the other.

Parameters
  • str1 (str) –

  • str2 (str) –

Returns

similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings

Return type

float

textacy.similarity.token_sort_ratio(str1, str2)[source]

Measure the similarity between two strings based on levenshtein(), only with non-alphanumeric characters removed and the ordering of words in each string sorted before comparison.

Parameters
  • str1 (str) –

  • str2 (str) –

Returns

Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings.

Return type

float

textacy.similarity.character_ngrams(str1, str2)[source]

Measure the similarity between two strings using a character ngrams similarity metric, in which strings are transformed into trigrams of alnum-only characters, vectorized and weighted by tf-idf, then compared by cosine similarity.

Parameters
  • str1 (str) –

  • str2 (str) –

Returns

Similarity between str1 and str2 in the interval [0.0, 1.0], where larger values correspond to more similar strings

Return type

float

Note

This method has been used in cross-lingual plagiarism detection and authorship attribution, and seems to work better on longer texts. At the very least, it is slow on shorter texts relative to the other similarity measures.