Datasets

Capitol Words Congressional speeches

A collection of ~11k (almost all) speeches given by the main protagonists of the 2016 U.S. Presidential election that had previously served in the U.S. Congress – including Hillary Clinton, Bernie Sanders, Barack Obama, Ted Cruz, and John Kasich – from January 1996 through June 2016.

Records include the following data:

  • text: Full text of the Congressperson’s remarks.

  • title: Title of the speech, in all caps.

  • date: Date on which the speech was given, as an ISO-standard string.

  • speaker_name: First and last name of the speaker.

  • speaker_party: Political party of the speaker: “R” for Republican, “D” for Democrat, “I” for Independent.

  • congress: Number of the Congress in which the speech was given: ranges continuously between 104 and 114.

  • chamber: Chamber of Congress in which the speech was given: almost all are either “House” or “Senate”, with a small number of “Extensions”.

This dataset was derived from data provided by the (now defunct) Sunlight Foundation’s Capitol Words API.

class textacy.datasets.capitol_words.CapitolWords(data_dir=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data/capitol_words'))[source]

Stream a collection of Congressional speeches from a compressed json file on disk, either as texts or text + metadata pairs.

Download the data (one time only!) from the textacy-data repo (https://github.com/bdewilde/textacy-data), and save its contents to disk:

>>> ds = CapitolWords()
>>> ds.download()
>>> ds.info
{'name': 'capitol_words',
 'site_url': 'http://sunlightlabs.github.io/Capitol-Words/',
 'description': 'Collection of ~11k speeches in the Congressional Record given by notable U.S. politicians between Jan 1996 and Jun 2016.'}

Iterate over speeches as texts or records with both text and metadata:

>>> for text in ds.texts(limit=3):
...     print(text, end="\n\n")
>>> for text, meta in ds.records(limit=3):
...     print("\n{} ({})\n{}".format(meta["title"], meta["speaker_name"], text))

Filter speeches by a variety of metadata fields and text length:

>>> for text, meta in ds.records(speaker_name="Bernie Sanders", limit=3):
...     print("\n{}, {}\n{}".format(meta["title"], meta["date"], text))
>>> for text, meta in ds.records(speaker_party="D", congress={110, 111, 112},
...                          chamber="Senate", limit=3):
...     print(meta["title"], meta["speaker_name"], meta["date"])
>>> for text, meta in ds.records(speaker_name={"Barack Obama", "Hillary Clinton"},
...                              date_range=("2005-01-01", "2005-12-31")):
...     print(meta["title"], meta["speaker_name"], meta["date"])
>>> for text in ds.texts(min_len=50000):
...     print(len(text))

Stream speeches into a textacy.Corpus:

>>> textacy.Corpus("en", data=ota.records(limit=100))
Corpus(100 docs; 70496 tokens)
Parameters

data_dir (str or pathlib.Path) – Path to directory on disk under which dataset is stored, i.e. /path/to/data_dir/capitol_words.

full_date_range

First and last dates for which speeches are available, each as an ISO-formatted string (YYYY-MM-DD).

Type

Tuple[str]

speaker_names

Full names of all speakers included in corpus, e.g. “Bernie Sanders”.

Type

Set[str]

speaker_parties

All distinct political parties of speakers, e.g. “R”.

Type

Set[str]

chambers

All distinct chambers in which speeches were given, e.g. “House”.

Type

Set[str]

congresses

All distinct numbers of the congresses in which speeches were given, e.g. 114.

Type

Set[int]

property filepath

Full path on disk for CapitolWords data as compressed json file. None if file is not found, e.g. has not yet been downloaded.

Type

str

download(*, force=False)[source]

Download the data as a Python version-specific compressed json file and save it to disk under the data_dir directory.

Parameters

force (bool) – If True, download the dataset, even if it already exists on disk under data_dir.

texts(*, speaker_name=None, speaker_party=None, chamber=None, congress=None, date_range=None, min_len=None, limit=None)[source]

Iterate over speeches in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only, in chronological order.

Parameters
  • speaker_name (str or Set[str]) – Filter speeches by the speakers’ name; see CapitolWords.speaker_names.

  • speaker_party (str or Set[str]) – Filter speeches by the speakers’ party; see CapitolWords.speaker_parties.

  • chamber (str or Set[str]) – Filter speeches by the chamber in which they were given; see CapitolWords.chambers.

  • congress (int or Set[int]) – Filter speeches by the congress in which they were given; see CapitolWords.congresses.

  • date_range (List[str] or Tuple[str]) – Filter speeches by the date on which they were given. Both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.

  • min_len (int) – Filter speeches by the length (number of characters) of their text content.

  • limit (int) – Yield no more than limit speeches that match all specified filters.

Yields

str – Full text of next (by chronological order) speech in dataset passing all filter params.

Raises

ValueError – If any filtering options are invalid.

records(*, speaker_name=None, speaker_party=None, chamber=None, congress=None, date_range=None, min_len=None, limit=None)[source]

Iterate over speeches in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs, in chronological order.

Parameters
  • speaker_name (str or Set[str]) – Filter speeches by the speakers’ name; see CapitolWords.speaker_names.

  • speaker_party (str or Set[str]) – Filter speeches by the speakers’ party; see CapitolWords.speaker_parties.

  • chamber (str or Set[str]) – Filter speeches by the chamber in which they were given; see CapitolWords.chambers.

  • congress (int or Set[int]) – Filter speeches by the congress in which they were given; see CapitolWords.congresses.

  • date_range (List[str] or Tuple[str]) – Filter speeches by the date on which they were given. Both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.

  • min_len (int) – Filter speeches by the length (number of characters) of their text content.

  • limit (int) – Yield no more than limit speeches that match all specified filters.

Yields

str – Text of the next speech in dataset passing all filters. dict: Metadata of the next speech in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

Supreme Court decisions

A collection of ~8.4k (almost all) decisions issued by the U.S. Supreme Court from November 1946 through June 2016 – the “modern” era.

Records include the following data:

  • text: Full text of the Court’s decision.

  • case_name: Name of the court case, in all caps.

  • argument_date: Date on which the case was argued before the Court, as an ISO-formatted string (“YYYY-MM-DD”).

  • decision_date: Date on which the Court’s decision was announced, as an ISO-formatted string (“YYYY-MM-DD”).

  • decision_direction: Ideological direction of the majority’s decision: one of “conservative”, “liberal”, or “unspecifiable”.

  • maj_opinion_author: Name of the majority opinion’s author, if available and identifiable, as an integer code whose mapping is given in SupremeCourt.opinion_author_codes.

  • n_maj_votes: Number of justices voting in the majority.

  • n_min_votes: Number of justices voting in the minority.

  • issue: Subject matter of the case’s core disagreement (e.g. “affirmative action”) rather than its legal basis (e.g. “the equal protection clause”), as a string code whose mapping is given in SupremeCourt.issue_codes.

  • issue_area: Higher-level categorization of the issue (e.g. “Civil Rights”), as an integer code whose mapping is given in SupremeCourt.issue_area_codes.

  • us_cite_id: Citation identifier for each case according to the official United States Reports. Note: There are ~300 cases with duplicate ids, and it’s not clear if that’s “correct” or a data quality problem.

The text in this dataset was derived from FindLaw’s searchable database of court cases: http://caselaw.findlaw.com/court/us-supreme-court.

The metadata was extracted without modification from the Supreme Court Database: Harold J. Spaeth, Lee Epstein, et al. 2016 Supreme Court Database, Version 2016 Release 1. http://supremecourtdatabase.org. Its license is CC BY-NC 3.0 US: https://creativecommons.org/licenses/by-nc/3.0/us/.

This dataset’s creation was inspired by a blog post by Emily Barry: http://www.emilyinamillion.me/blog/2016/7/13/visualizing-supreme-court-topics-over-time.

The two datasets were merged through much munging and a carefully trained model using the dedupe package. The model’s duplicate threshold was set so as to maximize the F-score where precision had twice as much weight as recall. Still, given occasionally baffling inconsistencies in case naming, citation ids, and decision dates, a very small percentage of texts may be incorrectly matched to metadata. (Sorry.)

class textacy.datasets.supreme_court.SupremeCourt(data_dir=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data/supreme_court'))[source]

Stream a collection of U.S. Supreme Court decisions from a compressed json file on disk, either as texts or text + metadata pairs.

Download the data (one time only!) from the textacy-data repo (https://github.com/bdewilde/textacy-data), and save its contents to disk:

>>> ds = SupremeCourt()
>>> ds.download()
>>> ds.info
{'name': 'supreme_court',
 'site_url': 'http://caselaw.findlaw.com/court/us-supreme-court',
 'description': 'Collection of ~8.4k decisions issued by the U.S. Supreme Court between November 1946 and June 2016.'}

Iterate over decisions as texts or records with both text and metadata:

>>> for text in ds.texts(limit=3):
...     print(text[:500], end="\n\n")
>>> for text, meta in ds.records(limit=3):
...     print("\n{} ({})\n{}".format(meta["case_name"], meta["decision_date"], text[:500]))

Filter decisions by a variety of metadata fields and text length:

>>> for text, meta in ds.records(opinion_author=109, limit=3):  # Notorious RBG!
...     print(meta["case_name"], meta["decision_direction"], meta["n_maj_votes"])
>>> for text, meta in ds.records(decision_direction="liberal",
...                              issue_area={1, 9, 10}, limit=3):
...     print(meta["case_name"], meta["maj_opinion_author"], meta["n_maj_votes"])
>>> for text, meta in ds.records(opinion_author=102, date_range=('1985-02-11', '1986-02-11')):
...     print("\n{} ({})".format(meta["case_name"], meta["decision_date"]))
...     print(ds.issue_codes[meta["issue"]], "=>", meta["decision_direction"])
>>> for text in ds.texts(min_len=250000):
...     print(len(text))

Stream decisions into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(limit=25))
Corpus(25 docs; 136696 tokens)
Parameters

data_dir (str or pathlib.Path) – Path to directory on disk under which the data is stored, i.e. /path/to/data_dir/supreme_court.

full_date_range

First and last dates for which decisions are available, each as an ISO-formatted string (YYYY-MM-DD).

Type

Tuple[str]

decision_directions

All distinct decision directions, e.g. “liberal”.

Type

Set[str]

opinion_author_codes

Mapping of majority opinion authors, from id code to full name.

Type

Dict[int, str]

issue_area_codes

Mapping of the high-level issue area of the case’s core disagreement, from id code to description.

Type

Dict[int, str]

issue_codes

Mapping of the specific issue of the case’s core disagreement, from id code to description.

Type

Dict[int, str]

property filepath

Full path on disk for SupremeCourt data as compressed json file. None if file is not found, e.g. has not yet been downloaded.

Type

str

download(*, force=False)[source]

Download the data as a Python version-specific compressed json file and save it to disk under the data_dir directory.

Parameters

force (bool) – If True, download the dataset, even if it already exists on disk under data_dir.

texts(*, opinion_author=None, decision_direction=None, issue_area=None, date_range=None, min_len=None, limit=None)[source]

Iterate over decisions in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only, in chronological order by decision date.

Parameters
  • opinion_author (int or Set[int]) – Filter decisions by the name(s) of the majority opinion’s author, coded as an integer whose mapping is given in SupremeCourt.opinion_author_codes.

  • decision_direction (str or Set[str]) – Filter decisions by the ideological direction of the majority’s decision; see SupremeCourt.decision_directions.

  • issue_area (int or Set[int]) – Filter decisions by the issue area of the case’s subject matter, coded as an integer whose mapping is given in SupremeCourt.issue_area_codes.

  • date_range (List[str] or Tuple[str]) – Filter decisions by the date on which they were decided; both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.

  • min_len (int) – Filter decisions by the length (number of characters) of their text content.

  • limit (int) – Yield no more than limit decisions that match all specified filters.

Yields

str – Text of the next decision in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

records(*, opinion_author=None, decision_direction=None, issue_area=None, date_range=None, min_len=None, limit=None)[source]

Iterate over decisions in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs, in chronological order by decision date.

Parameters
  • opinion_author (int or Set[int]) – Filter decisions by the name(s) of the majority opinion’s author, coded as an integer whose mapping is given in SupremeCourt.opinion_author_codes.

  • decision_direction (str or Set[str]) – Filter decisions by the ideological direction of the majority’s decision; see SupremeCourt.decision_directions.

  • issue_area (int or Set[int]) – Filter decisions by the issue area of the case’s subject matter, coded as an integer whose mapping is given in SupremeCourt.issue_area_codes.

  • date_range (List[str] or Tuple[str]) – Filter decisions by the date on which they were decided; both start and end date must be specified, but a null value for either will be replaced by the min/max date available for the dataset.

  • min_len (int) – Filter decisions by the length (number of characters) of their text content.

  • limit (int) – Yield no more than limit decisions that match all specified filters.

Yields

str – Text of the next decision in dataset passing all filters. dict: Metadata of the next decision in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

Wikimedia articles

All articles for a given Wikimedia project, specified by language and version.

Records include the following key fields (plus a few others):

  • text: Plain text content of the wiki page – no wiki markup!

  • title: Title of the wiki page.

  • wiki_links: A list of other wiki pages linked to from this page.

  • ext_links: A list of external URLs linked to from this page.

  • categories: A list of categories to which this wiki page belongs.

  • dt_created: Date on which the wiki page was first created.

  • page_id: Unique identifier of the wiki page, usable in Wikimedia APIs.

Datasets are generated by the Wikimedia Foundation for a variety of projects, such as Wikipedia and Wikinews. The source files are meant for search indexes, so they’re dumped in Elasticsearch bulk insert format – basically, a compressed JSON file with one record per line. For more information, refer to https://meta.wikimedia.org/wiki/Data_dumps.

class textacy.datasets.wikimedia.Wikimedia(name, meta, project, data_dir, lang='en', version='current', namespace=0)[source]

Base class for project-specific Wikimedia datasets. See:

property filepath

Full path on disk for the Wikimedia CirrusSearch db dump corresponding to the project, lang, and version.

Type

str

download(*, force=False)[source]

Download the Wikimedia CirrusSearch db dump corresponding to the given project, lang, and version as a compressed JSON file, and save it to disk under the data_dir directory.

Parameters

force (bool) – If True, download the dataset, even if it already exists on disk under data_dir.

Note

Some datasets are quite large (e.g. English Wikipedia is ~28GB) and can take hours to fully download.

texts(*, category=None, wiki_link=None, min_len=None, limit=None)[source]

Iterate over wiki pages in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only, in order of appearance in the db dump file.

Parameters
  • category (str or Set[str]) – Filter wiki pages by the categories to which they’ve been assigned. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s categories.

  • wiki_link (str or Set[str]) – Filter wiki pages by the other wiki pages to which they’ve been linked. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s wiki links.

  • min_len (int) – Filter wiki pages by the length (number of characters) of their text content.

  • limit (int) – Yield no more than limit wiki pages that match all specified filters.

Yields

str – Text of the next wiki page in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

records(*, category=None, wiki_link=None, min_len=None, limit=None)[source]

Iterate over wiki pages in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs, in order of appearance in the db dump file.

Parameters
  • category (str or Set[str]) – Filter wiki pages by the categories to which they’ve been assigned. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s categories.

  • wiki_link (str or Set[str]) – Filter wiki pages by the other wiki pages to which they’ve been linked. For multiple values (Set[str]), ANY rather than ALL of the values must be found among a given page’s wiki links.

  • min_len (int) – Filter wiki pages by the length (number of characters) of their text content.

  • limit (int) – Yield no more than limit wiki pages that match all specified filters.

Yields

str – Text of the next wiki page in dataset passing all filters. dict: Metadata of the next wiki page in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

class textacy.datasets.wikimedia.Wikipedia(data_dir=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data/wikipedia'), lang='en', version='current', namespace=0)[source]

Stream a collection of Wikipedia pages from a version- and language-specific database dump, either as texts or text + metadata pairs.

Download a database dump (one time only!) and save its contents to disk:

>>> ds = Wikipedia(lang="en", version="current")
>>> ds.download()
>>> ds.info
{'name': 'wikipedia',
 'site_url': 'https://en.wikipedia.org/wiki/Main_Page',
 'description': 'All pages for a given language- and version-specific Wikipedia site snapshot.'}

Iterate over wiki pages as texts or records with both text and metadata:

>>> for text in ds.texts(limit=5):
...     print(text[:500])
>>> for text, meta in ds.records(limit=5):
...     print(meta["page_id"], meta["title"])

Filter wiki pages by a variety of metadata fields and text length:

>>> for text, meta in ds.records(category="Living people", limit=5):
...     print(meta["title"], meta["categories"])
>>> for text, meta in ds.records(wiki_link="United_States", limit=5):
...     print(meta["title"], meta["wiki_links"])
>>> for text in ds.texts(min_len=10000, limit=5):
...     print(len(text))

Stream wiki pages into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(min_len=2000, limit=50))
Corpus(50 docs; 72368 tokens)
Parameters
  • data_dir (str or pathlib.Path) – Path to directory on disk under which database dump files are stored. Each file is expected as {lang}{project}/{version}/{lang}{project}-{version}-cirrussearch-content.json.gz immediately under this directory.

  • lang (str) – Standard two-letter language code, e.g. “en” => “English”, “de” => “German”. https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

  • version (str) – Database dump version to use. Either “current” for the most recently available version or a date formatted as “YYYYMMDD”. Dumps are produced weekly; check for available versions at https://dumps.wikimedia.org/other/cirrussearch/.

  • namespace (int) – Namespace of the wiki pages to include. Typical, public- facing content is in the 0 (default) namespace.

class textacy.datasets.wikimedia.Wikinews(data_dir=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data/wikinews'), lang='en', version='current', namespace=0)[source]

Stream a collection of Wikinews pages from a version- and language-specific database dump, either as texts or text + metadata pairs.

Download a database dump (one time only!) and save its contents to disk:

>>> ds = Wikinews(lang="en", version="current")
>>> ds.download()
>>> ds.info
{'name': 'wikinews',
 'site_url': 'https://en.wikinews.org/wiki/Main_Page',
 'description': 'All pages for a given language- and version-specific Wikinews site snapshot.'}

Iterate over wiki pages as texts or records with both text and metadata:

>>> for text in ds.texts(limit=5):
...     print(text[:500])
>>> for text, meta in ds.records(limit=5):
...     print(meta["page_id"], meta["title"])

Filter wiki pages by a variety of metadata fields and text length:

>>> for text, meta in ds.records(category="Politics and conflicts", limit=5):
...     print(meta["title"], meta["categories"])
>>> for text, meta in ds.records(wiki_link="Reuters", limit=5):
...     print(meta["title"], meta["wiki_links"])
>>> for text in ds.texts(min_len=5000, limit=5):
...     print(len(text))

Stream wiki pages into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(limit=100))
Corpus(100 docs; 33092 tokens)
Parameters
  • data_dir (str or pathlib.Path) – Path to directory on disk under which database dump files are stored. Each file is expected as {lang}{project}/{version}/{lang}{project}-{version}-cirrussearch-content.json.gz immediately under this directory.

  • lang (str) – Standard two-letter language code, e.g. “en” => “English”, “de” => “German”. https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

  • version (str) – Database dump version to use. Either “current” for the most recently available version or a date formatted as “YYYYMMDD”. Dumps are produced weekly; check for available versions at https://dumps.wikimedia.org/other/cirrussearch/.

  • namespace (int) – Namespace of the wiki pages to include. Typical, public- facing content is in the 0 (default) namespace.

Reddit comments

A collection of up to ~1.5 billion Reddit comments posted from October 2007 through May 2015.

Records include the following key fields (plus a few others):

  • body: Full text of the comment.

  • created_utc: Date on which the comment was posted.

  • subreddit: Sub-reddit in which the comment was posted, excluding the familiar “/r/” prefix.

  • score: Net score (upvotes - downvotes) on the comment.

  • gilded: Number of times this comment received reddit gold.

The raw data was originally collected by /u/Stuck_In_the_Matrix via Reddit’s APIS, and stored for posterity by the Internet Archive. For more details, refer to https://archive.org/details/2015_reddit_comments_corpus.

class textacy.datasets.reddit_comments.RedditComments(data_dir=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data/reddit_comments'))[source]

Stream a collection of Reddit comments from 1 or more compressed files on disk, either as texts or text + metadata pairs.

Download the data (one time only!) or subsets thereof by specifying a date range:

>>> ds = RedditComments()
>>> ds.download(date_range=("2007-10", "2008-03"))
>>> ds.info
{'name': 'reddit_comments',
 'site_url': 'https://archive.org/details/2015_reddit_comments_corpus',
 'description': 'Collection of ~1.5 billion publicly available Reddit comments from October 2007 through May 2015.'}

Iterate over comments as texts or records with both text and metadata:

>>> for text in ds.texts(limit=5):
...     print(text)
>>> for text, meta in ds.records(limit=5):
...     print("\n{} {}\n{}".format(meta["author"], meta["created_utc"], text))

Filter comments by a variety of metadata fields and text length:

>>> for text, meta in ds.records(subreddit="politics", limit=5):
...     print(meta["score"], ":", text)
>>> for text, meta in ds.records(date_range=("2008-01", "2008-03"), limit=5):
...     print(meta["created_utc"])
>>> for text, meta in ds.records(score_range=(10, None), limit=5):
...     print(meta["score"], ":", text)
>>> for text in ds.texts(min_len=2000, limit=5):
...     print(len(text))

Stream comments into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(limit=1000))
Corpus(1000 docs; 27582 tokens)
Parameters

data_dir (str or pathlib.Path) – Path to directory on disk under which the data is stored, i.e. /path/to/data_dir/reddit_comments. Each file covers a given month, as indicated in the filename like “YYYY/RC_YYYY-MM.bz2”.

full_date_range

First and last dates for which comments are available, each as an ISO-formatted string (YYYY-MM-DD).

Type

Tuple[str]

filepaths

Full paths on disk for all Reddit comments files found under ReddictComments.data_dir directory, sorted in chronological order.

Type

Tuple[str]

property filepaths

Full paths on disk for all Reddit comments files found under the data_dir directory, sorted chronologically.

Type

Tuple[str]

download(*, date_range=(None, None), force=False)[source]

Download 1 or more monthly Reddit comments files from archive.org and save them to disk under the data_dir directory.

Parameters
  • date_range (Tuple[str]) – Interval specifying the [start, end) dates for which comments files will be downloaded. Each item must be a str formatted as YYYY-MM or YYYY-MM-DD (the latter is converted to the corresponding YYYY-MM value). Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • force (bool) – If True, download the dataset, even if it already exists on disk under data_dir.

texts(*, subreddit=None, date_range=None, score_range=None, min_len=None, limit=None)[source]

Iterate over comments (text-only) in 1 or more files of this dataset, optionally filtering by a variety of metadata and/or text length, in chronological order.

Parameters
  • subreddit (str or Set[str]) – Filter comments for those which were posted in the specified subreddit(s).

  • date_range (Tuple[str]) – Filter comments for those which were posted within the interval [start, end). Each item must be a str in ISO-standard format, i.e. some amount of YYYY-MM-DDTHH:mm:ss. Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • score_range (Tuple[int]) – Filter comments for those whose score (# upvotes minus # downvotes) is within the interval [low, high). Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • min_len (int) – Filter comments for those whose body length in chars is at least this long.

  • limit (int) – Maximum number of comments passing all filters to yield. If None, all comments are iterated over.

Yields

str – Text of the next comment in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

records(*, subreddit=None, date_range=None, score_range=None, min_len=None, limit=None)[source]

Iterate over comments (including text and metadata) in 1 or more files of this dataset, optionally filtering by a variety of metadata and/or text length, in chronological order.

Parameters
  • subreddit (str or Set[str]) – Filter comments for those which were posted in the specified subreddit(s).

  • date_range (Tuple[str]) – Filter comments for those which were posted within the interval [start, end). Each item must be a str in ISO-standard format, i.e. some amount of YYYY-MM-DDTHH:mm:ss. Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • score_range (Tuple[int]) – Filter comments for those whose score (# upvotes minus # downvotes) is within the interval [low, high). Both start and end values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • min_len (int) – Filter comments for those whose body length in chars is at least this long.

  • limit (int) – Maximum number of comments passing all filters to yield. If None, all comments are iterated over.

Yields

str – Text of the next comment in dataset passing all filters. dict: Metadata of the next comment in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

Oxford Text Archive literary works

A collection of ~2.7k Creative Commons literary works from the Oxford Text Archive, containing primarily English-language 16th-20th century literature and history.

Records include the following data:

  • text: Full text of the literary work.

  • title: Title of the literary work.

  • author: Author(s) of the literary work.

  • year: Year that the literary work was published.

  • url: URL at which literary work can be found online via the OTA.

  • id: Unique identifier of the literary work within the OTA.

This dataset was compiled by David Mimno from the Oxford Text Archive and stored in his GitHub repo to avoid unnecessary scraping of the OTA site. It is downloaded from that repo, and excluding some light cleaning of its metadata, is reproduced exactly here.

class textacy.datasets.oxford_text_archive.OxfordTextArchive(data_dir=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data/oxford_text_archive'))[source]

Stream a collection of English-language literary works from text files on disk, either as texts or text + metadata pairs.

Download the data (one time only!), saving and extracting its contents to disk:

>>> ds = OxfordTextArchive()
>>> ds.download()
>>> ds.info
{'name': 'oxford_text_archive',
 'site_url': 'https://ota.ox.ac.uk/',
 'description': 'Collection of ~2.7k Creative Commons texts from the Oxford Text Archive, containing primarily English-language 16th-20th century literature and history.'}

Iterate over literary works as texts or records with both text and metadata:

>>> for text in ds.texts(limit=3):
...     print(text[:200])
>>> for text, meta in ds.records(limit=3):
...     print("\n{}, {}".format(meta["title"], meta["year"]))
...     print(text[:300])

Filter literary works by a variety of metadata fields and text length:

>>> for text, meta in ds.records(author="Shakespeare, William", limit=1):
...     print("{}\n{}".format(meta["title"], text[:500]))
>>> for text, meta in ds.records(date_range=("1900-01-01", "1990-01-01"), limit=5):
...     print(meta["year"], meta["author"])
>>> for text in ds.texts(min_len=4000000):
...     print(len(text))

Stream literary works into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(limit=5))
Corpus(5 docs; 182289 tokens)
Parameters

data_dir (str or pathlib.Path) – Path to directory on disk under which dataset is stored, i.e. /path/to/data_dir/oxford_text_archive.

full_date_range

First and last dates for which works are available, each as an ISO-formatted string (YYYY-MM-DD).

Type

Tuple[str]

authors

Full names of all distinct authors included in this dataset, e.g. “Shakespeare, William”.

Type

Set[str]

download(*, force=False)[source]

Download the data as a zip archive file, then save it to disk and extract its contents under the OxfordTextArchive.data_dir directory.

Parameters

force (bool) – If True, always download the dataset even if it already exists.

property metadata

Dict[str, dict]

texts(*, author=None, date_range=None, min_len=None, limit=None)[source]

Iterate over works in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only.

Parameters
  • author (str or Set[str]) – Filter texts by the authors’ name. For multiple values (Set[str]), ANY rather than ALL of the authors must be found among a given works’s authors.

  • date_range (List[str] or Tuple[str]) – Filter texts by the date on which it was published; both start and end date must be specified, but a null value for either will be replaced by the min/max date available in the dataset.

  • min_len (int) – Filter texts by the length (number of characters) of their text content.

  • limit (int) – Return no more than limit texts.

Yields

str – Text of the next work in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

records(*, author=None, date_range=None, min_len=None, limit=None)[source]

Iterate over works in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs.

Parameters
  • author (str or Set[str]) – Filter records by the authors’ name; see OxfordTextArchive.authors.

  • date_range (List[str] or Tuple[str]) – Filter records by the date on which it was published; both start and end date must be specified, but a null value for either will be replaced by the min/max date available in the dataset.

  • min_len (int) – Filter records by the length (number of characters) of their text content.

  • limit (int) – Yield no more than limit records.

Yields

str – Text of the next work in dataset passing all filters. dict: Metadata of the next work in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

IMDB movie reviews

A collection of 50k highly polar movie reviews posted to IMDB, split evenly into training and testing sets, with 25k positive and 25k negative sentiment labels, as well as some unlabeled reviews.

Records include the following key fields (plus a few others):

  • text: Full text of the review.

  • subset: Subset of the dataset (“train” or “test”) into which the review has been split.

  • label: Sentiment label (“pos” or “neg”) assigned to the review.

  • rating: Numeric rating assigned by the original reviewer, ranging from 1 to 10. Reviews with a rating <= 5 are “neg”; the rest are “pos”.

  • movie_id: Unique identifier for the movie under review within IMDB, useful for grouping reviews or joining with an external movie dataset.

Reference: Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

class textacy.datasets.imdb.IMDB(data_dir=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data/imdb'))[source]

Stream a collection of IMDB movie reviews from text files on disk, either as texts or text + metadata pairs.

Download the data (one time only!), saving and extracting its contents to disk:

>>> ds = IMDB()
>>> ds.download()
>>> ds.info
{'name': 'imdb',
 'site_url': 'http://ai.stanford.edu/~amaas/data/sentiment',
 'description': 'Collection of 50k highly polar movie reviews split evenly into train and test sets, with 25k positive and 25k negative labels. Also includes some unlabeled reviews.'}

Iterate over movie reviews as texts or records with both text and metadata:

>>> for text in ds.texts(limit=5):
...     print(text)
>>> for text, meta in ds.records(limit=5):
...     print("\n{} {}\n{}".format(meta["label"], meta["rating"], text))

Filter movie reviews by a variety of metadata fields and text length:

>>> for text, meta in ds.records(label="pos", limit=5):
...     print(meta["rating"], ":", text)
>>> for text, meta in ds.records(rating_range=(9, 11), limit=5):
...     print(meta["rating"], text)
>>> for text in ds.texts(min_len=1000, limit=5):
...     print(len(text))

Stream movie reviews into a textacy.Corpus:

>>> textacy.Corpus("en", data=ds.records(limit=100))
Corpus(100 docs; 24340 tokens)
Parameters

data_dir (str or pathlib.Path) – Path to directory on disk under which the data is stored, i.e. /path/to/data_dir/imdb.

full_rating_range

Lowest and highest ratings for which movie reviews are available.

Type

Tuple[int]

download(*, force=False)[source]

Download the data as a compressed tar archive file, then save it to disk and extract its contents under the data_dir directory.

Parameters

force (bool) – If True, always download the dataset even if it already exists.

texts(*, subset=None, label=None, rating_range=None, min_len=None, limit=None)[source]

Iterate over movie reviews in this dataset, optionally filtering by a variety of metadata and/or text length, and yield texts only.

Parameters
  • subset (str, {"train", "test"}) – Filter movie reviews by the dataset subset into which they’ve already been split.

  • label (str, {"pos", "neg", "unsup"}) – Filter movie reviews by the assigned sentiment label (or lack thereof, for “unsup”).

  • rating_range (Tuple[int, int]) – Filter movie reviews by the rating assigned by the reviewer. Only those with ratings in the interval [low, high) are included. Both low and high values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • min_len (int) – Filter movie reviews by the length (number of characters) of their text content.

  • limit (int) – Yield no more than limit movie reviews that match all specified filters.

Yields

str – Text of the next movie review in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

records(*, subset=None, label=None, rating_range=None, min_len=None, limit=None)[source]

Iterate over movie reviews in this dataset, optionally filtering by a variety of metadata and/or text length, and yield text + metadata pairs.

Parameters
  • subset (str, {"train", "test"}) – Filter movie reviews by the dataset subset into which they’ve already been split.

  • label (str, {"pos", "neg", "unsup"}) – Filter movie reviews by the assigned sentiment label (or lack thereof, for “unsup”).

  • rating_range (Tuple[int, int]) – Filter movie reviews by the rating assigned by the reviewer. Only those with ratings in the interval [low, high) are included. Both low and high values must be specified, but a null value for either is automatically replaced by the minimum or maximum valid values, respectively.

  • min_len (int) – Filter movie reviews by the length (number of characters) of their text content.

  • limit (int) – Yield no more than limit movie reviews that match all specified filters.

Yields

str – Text of the next movie review in dataset passing all filters. dict: Metadata of the next movie review in dataset passing all filters.

Raises

ValueError – If any filtering options are invalid.

UDHR translations

A collection of translations of the Universal Declaration of Human Rights (UDHR), a milestone document in the history of human rights that first, formally established fundamental human rights to be universally protected.

Records include the following fields:

  • text: Full text of the translated UDHR document.

  • lang: ISO-639-1 language code of the text.

  • lang_name: Ethnologue entry for the language (see https://www.ethnologue.com).

The source dataset was compiled and is updated by the Unicode Consortium as a way to demonstrate the use of unicode in representing a wide variety of languages. In fact, the UDHR was chosen because it’s been translated into more languages than any other document! However, this dataset only provides access to records translated into ISO-639-1 languages — that is, major living languages only, rather than every language, major or minor, that has ever existed. If you need access to texts in those other languages, you can find them at UDHR._texts_dirpath.

For more details, go to https://unicode.org/udhr.

class textacy.datasets.udhr.UDHR(data_dir=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data/udhr'))[source]

Stream a collection of UDHR translations from disk, either as texts or text + metadata pairs.

Download the data (one time only!), saving and extracting its contents to disk:

>>> ds = UDHR()
>>> ds.download()
>>> ds.info
{'name': 'udhr',
 'site_url': 'http://www.ohchr.org/EN/UDHR',
 'description': 'A collection of translations of the Universal Declaration of Human Rights (UDHR), a milestone document in the history of human rights that first, formally established fundamental human rights to be universally protected.'}

Iterate over translations as texts or records with both text and metadata:

>>> for text in ds.texts(limit=5):
...     print(text[:500])
>>> for text, meta in ds.records(limit=5):
...     print("\n{} ({})\n{}".format(meta["lang_name"], meta["lang"], text[:500]))

Filter translations by language, and note that some languages have multiple translations:

>>> for text, meta in ds.records(lang="en"):
...     print("\n{} ({})\n{}".format(meta["lang_name"], meta["lang"], text[:500]))
>>> for text, meta in ds.records(lang="zh"):
...     print("\n{} ({})\n{}".format(meta["lang_name"], meta["lang"], text[:500]))

Note: Streaming translations into a textacy.Corpus doesn’t work as for other available datasets, since this dataset is multilingual.

Parameters

data_dir (str or pathlib.Path) – Path to directory on disk under which the data is stored, i.e. /path/to/data_dir/udhr.

langs

All distinct language codes with texts in this dataset, e.g. “en” for English.

Type

Set[str]

download(*, force=False)[source]

Download the data as a zipped archive of language-specific text files, then save it to disk and extract its contents under the data_dir directory.

Parameters

force (bool) – If True, always download the dataset even if it already exists.

property index

List[Dict[str, obj]]

texts(*, lang=None, limit=None)[source]

Iterate over records in this dataset, optionally filtering by language, and yield texts only.

Parameters
  • lang (str or Set[str]) – Filter records by the language in which they’re written; see UDHR.langs.

  • limit (int) – Return no more than limit texts.

Yields

str – Text of the next record in dataset passing filters.

Raises

ValueError – If any filtering options are invalid.

records(*, lang=None, limit=None)[source]

Iterate over reocrds in this dataset, optionally filtering by a language, and yield text + metadata pairs.

Parameters
  • lang (str or Set[str]) – Filter records by the language in which they’re written; see UDHR.langs.

  • limit (int) – Yield no more than limit records.

Yields

str – Text of the next record in dataset passing filters. dict: Metadata of the next record in dataset passing filters.

Raises

ValueError – If any filtering options are invalid.

Dataset Utils

Shared functionality for downloading, naming, and extracting the contents of datasets, as well as filtering for particular subsets.