Utilities

Language identification

Pipeline for identifying the language of a text, using a model inspired by Google’s Compact Language Detector v3 (https://github.com/google/cld3) and implemented with scikit-learn>=0.20.

Model

Character unigrams, bigrams, and trigrams are extracted from input text, and their frequencies of occurence within the text are counted. The full set of ngrams are then hashed into a 4096-dimensional feature vector with values given by the L2 norm of the counts. These features are passed into a Multi-layer Perceptron with a single hidden layer of 512 rectified linear units and a softmax output layer giving probabilities for ~140 different languages as ISO 639-1 language codes.

Technically, the model was implemented as a sklearn.pipeline.Pipeline with two steps: a sklearn.feature_extraction.text.HashingVectorizer for vectorizing input texts and a sklearn.neural_network.MLPClassifier for multi-class language classification.

Dataset

The pipeline was trained on a randomized, stratified subset of ~750k texts drawn from several sources:

  • Tatoeba: A crowd-sourced collection of sentences and their translations into many languages. Style is relatively informal; subject matter is a variety of everyday things and goings-on. Source: https://tatoeba.org/eng/downloads.

  • Leipzig Corpora: A collection of corpora for many languages pulling from comparable sources – specifically, 10k Wikipedia articles from official database dumps and 10k news articles from either RSS feeds or web scrapes, when available. Style is relatively formal; subject matter is a variety of notable things and goings-on. Source: http://wortschatz.uni-leipzig.de/en/download

  • UDHR: The UN’s Universal Declaration of Human Rights document, translated into hundreds of languages and split into paragraphs. Style is formal; subject matter is fundamental human rights to be universally protected. Source: https://unicode.org/udhr/index.html

  • Twitter: A collection of tweets in each of ~70 languages, posted in July 2014, with languages assigned through a combination of models and human annotators. Style is informal; subject matter is whatever Twitter was going on about back then. Source: https://blog.twitter.com/engineering/en_us/a/2015/evaluating-language-identification-performance.html

  • DSLCC: Two collections of short excerpts of journalistic texts in a handful of language groups that are highly similar to each other. Style is relatively formal; subject matter is current events. Source: http://ttg.uni-saarland.de/resources/DSLCC/

Performance

The trained model achieved F1 = 0.96 when (macro and micro) averaged over all languages. A few languages have worse performance; for example, the two Norwegians (“nb” and “no”), Bosnian (“bs”) and Serbian (“sr”), and Bashkir (“ba”) and Tatar (“tt”) are often confused with each other. See the textacy-data releases for more details: https://github.com/bdewilde/textacy-data/releases/tag/lang_identifier_v1.1_sklearn_v21

class textacy.lang_utils.LangIdentifier(data_dir=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data/lang_identifier'), max_text_len=1000)[source]
Parameters
  • data_dir (str) –

  • max_text_len (int) –

pipeline
Type

sklearn.pipeline.Pipeline

download(force=False)[source]

Download the pipeline data as a Python version-specific compressed pickle file and save it to disk under the LangIdentifier.data_dir directory.

Parameters

force (bool) – If True, download the dataset, even if it already exists on disk under data_dir.

identify_lang(text)[source]

Identify the most probable language identified in text.

Parameters

text (str) –

Returns

2-letter language code of the most probable language.

Return type

str

identify_topn_langs(text, topn=3)[source]

Identify the topn most probable languages identified in text.

Parameters
  • text (str) –

  • topn (int) –

Returns

2-letter language code and its probability for the topn most probable languages.

Return type

List[Tuple[str, float]]

init_pipeline()[source]

Initialize a new language identification pipeline, overwriting any pre-trained pipeline loaded from disk under LangIdentifier.data_dir. Must be trained on (text, lang) examples before use.

textacy.lang_utils.identify_lang(text)

Identify the most probable language identified in text.

Parameters

text (str) –

Returns

2-letter language code of the most probable language.

Return type

str

Cache

Functionality for caching language data and other NLP resources. Loading data from disk can be slow; let’s just do it once and forget about it. :)

textacy.cache.LRU_CACHE = LRUCache([], maxsize=2147483648, currsize=0)

Least Recently Used (LRU) cache for loaded data.

The max cache size may be set by the TEXTACY_MAX_CACHE_SIZE environment variable, where the value must be an integer (in bytes). Otherwise, the max size is 2GB.

Type

cachetools.LRUCache

textacy.cache.clear()[source]

Clear textacy’s cache of loaded data.