textacy: NLP, before and after spaCy

textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spaCy library. With the fundamentals — tokenization, part-of-speech tagging, dependency parsing, etc. — delegated to another library, textacy focuses primarily on the tasks that come before and follow after.

build status current release version pypi version conda version

Features

  • Convenient entry points to working with one or many documents processed by spaCy, with functionality added via custom extensions

  • Variety of downloadable datasets with both text content and metadata, from Congressional speeches to historical literature to Reddit comments

  • Easy file I/O for streaming data to and from disk

  • Cleaning, normalization, and exploration of raw text — before processing

  • Flexible extraction of words, ngrams, noun chunks, entities, acronyms, key terms, and other elements of interest

  • Tokenization and vectorization of documents, with functionality for training, interpreting, and visualizing topic models

  • String, set, and document similarity comparison by a variety of metrics

  • Calculations for common text statistics, including Flesch-Kincaid Grade Level and multilingual Flesch Reading Ease

and more!