IO

Text

Functions for reading from and writing to disk records in plain text format, either as one text per file or one text per line in a file.

textacy.io.text.read_text(filepath, *, mode='rt', encoding=None, lines=False)[source]

Read the contents of a text file at filepath, either all at once or streaming line-by-line.

Parameters
  • filepath (str or pathlib.Path) – Path to file on disk from which data will be read.

  • mode (str) – Mode with which filepath is opened.

  • encoding (str) – Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

  • lines (bool) – If False, all data is read in at once; otherwise, data is read in one line at a time.

Yields

str – Next line of text to read in.

If lines is False, wrap this output in next() to conveniently access the full text.

textacy.io.text.write_text(data, filepath, *, mode='wt', encoding=None, make_dirs=False, lines=False)[source]

Write text data to disk at filepath, either all at once or streaming line-by-line.

Parameters
  • data (str or Iterable[str]) –

    If lines is False, a single string to write to disk; for example:

    "isnt rick and morty that thing you get when you die and your body gets all stiff"
    

    If lines is True, an iterable of strings to write to disk, one item per line; for example:

    ["isnt rick and morty that thing you get when you die and your body gets all stiff",
     "You're thinking of rigor mortis. Rick and morty is when you get trolled into watching "never gonna give you up"",
     "That's rickrolling. Rick and morty is a type of pasta"]
    

  • filepath (str or pathlib.Path) – Path to file on disk to which data will be written.

  • mode (str) – Mode with which filepath is opened.

  • encoding (str) – Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

  • make_dirs (bool) – If True, automatically create (sub)directories if not already present in order to write filepath.

  • lines (bool) – If False, all data is written at once; otherwise, data is written to disk one line at a time.

JSON

Functions for reading from and writing to disk records in JSON format, as one record per file or one record per line in a file.

textacy.io.json.read_json(filepath, *, mode='rt', encoding=None, lines=False)[source]

Read the contents of a JSON file at filepath, either all at once or streaming item-by-item.

Parameters
  • filepath (str or pathlib.Path) – Path to file on disk from which data will be read.

  • mode (str) – Mode with which filepath is opened.

  • encoding (str) – Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

  • lines (bool) – If False, all data is read in at once; otherwise, data is read in one line at a time.

Yields

object – Next JSON item; could be a dict, list, int, float, str, depending on the value of lines.

textacy.io.json.read_json_mash(filepath, *, mode='rt', encoding=None, buffer_size=2048)[source]

Read the contents of a JSON file at filepath one item at a time, where all of the items have been mashed together, end-to-end, on a single line.

Parameters
  • filepath (str or pathlib.Path) – Path to file on disk to which data will be written.

  • mode (str) – Mode with which filepath is opened.

  • encoding (str) – Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

  • buffer_size (int) – Number of bytes to read in as a chunk.

Yields

object – Next valid JSON object, converted to native Python equivalent.

Note

Storing JSON data in this format is Not Good. Reading it is doable, so this function is included for users’ convenience, but note that there is no analogous write_json_mash() function. Don’t do it.

textacy.io.json.write_json(data, filepath, *, mode='wt', encoding=None, make_dirs=False, lines=False, ensure_ascii=False, separators=(', ', ':'), sort_keys=False, indent=None)[source]

Write JSON data to disk at filepath, either all at once or streaming item-by-item.

Parameters
  • data (JSON) –

    JSON data to write to disk, including any Python objects encodable by default in json, as well as dates and datetimes. For example:

    [
        {"title": "Harrison Bergeron", "text": "The year was 2081, and everybody was finally equal."},
        {"title": "2BR02B", "text": "Everything was perfectly swell."},
        {"title": "Slaughterhouse-Five", "text": "All this happened, more or less."},
    ]
    

    If lines is False, all of data is written as a single object; if True, each item is written to a separate line in filepath.

  • filepath (str or pathlib.Path) – Path to file on disk to which data will be written.

  • mode (str) – Mode with which filepath is opened.

  • encoding (str) – Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

  • make_dirs (bool) – If True, automatically create (sub)directories if not already present in order to write filepath.

  • lines (bool) – If False, all data is written at once; otherwise, data is written to disk one item at a time.

  • ensure_ascii (bool) – If True, all non-ASCII characters are escaped; otherwise, non-ASCII characters are output as-is.

  • separators (Tuple[str, str]) – An (item_separator, key_separator) pair specifying how items and keys are separated in output.

  • sort_keys (bool) – If True, each output dictionary is sorted by key; otherwise, dictionary ordering is taken as-is.

  • indent (int or str) – If a non-negative integer or string, items are pretty-printed with the specified indent level; if 0, negative, or “”, items are separated by newlines; if None, the most compact representation is used when storing data.

class textacy.io.json.ExtendedJSONEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Sub-class of json.JSONEncoder, used to write JSON data to disk in write_json() while handling a broader range of Python objects.

default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)

CSV

Functions for reading from and writing to disk records in CSV format, where CSVs may be delimited not only by commas (the default) but tabs, pipes, and other valid one-char delimiters.

textacy.io.csv.read_csv(filepath, *, encoding=None, fieldnames=None, dialect='excel', delimiter=', ', quoting=2)[source]

Read the contents of a CSV file at filepath, streaming line-by-line, where each line is a list of strings and/or floats whose values are separated by delimiter.

Parameters
  • filepath (str or pathlib.Path) – Path to file on disk from which data will be read.

  • encoding (str) – Name of the encoding used to decode or encode the data in filepath.

  • fieldnames (List[str] or 'infer') – If specified, gives names for columns of values, which are used as keys in an ordered dictionary representation of each line’s data. If ‘infer’, the first kB of data is analyzed to make a guess about whether the first row is a header of column names, and if so, those names are used as keys. If None, no column names are used, and each line is returned as a list of strings/floats.

  • dialect (str) – Grouping of formatting parameters that determine how the data is parsed when reading/writing. If ‘infer’, the first kB of data is analyzed to get a best guess for the correct dialect.

  • delimiter (str) – 1-character string used to separate fields in a row.

  • quoting (int) – Type of quoting to apply to field values. See: https://docs.python.org/3/library/csv.html#csv.QUOTE_NONNUMERIC

Yields

List[obj] – Next row, whose elements are strings and/or floats. If fieldnames is None or ‘infer’ doesn’t detect a header row.

or

Dict[str, obj]: Next row, as an ordered dictionary of (key, value) pairs, where keys are column names and values are the corresponding strings and/or floats. If fieldnames is a list of column names or ‘infer’ detects a header row.

textacy.io.csv.write_csv(data, filepath, *, encoding=None, make_dirs=False, fieldnames=None, dialect='excel', delimiter=', ', quoting=2)[source]

Write rows of data to disk at filepath, where each row is an iterable or a dictionary of strings and/or numbers, written to one line with values separated by delimiter.

Parameters
  • data (Iterable[Iterable] or Iterable[dict]) –

    If fieldnames is None, an iterable of iterables of strings and/or numbers to write to disk; for example:

    [['That was a great movie!', 0.9],
     ['The movie was okay, I guess.', 0.2],
     ['Worst. Movie. Ever.', -1.0]]
    

    If fieldnames is specified, an iterable of dictionaries with string and/or number values to write to disk; for example:

    [{'text': 'That was a great movie!', 'score': 0.9},
     {'text': 'The movie was okay, I guess.', 'score': 0.2},
     {'text': 'Worst. Movie. Ever.', 'score': -1.0}]
    

  • filepath (str or pathlib.Path) – Path to file on disk to which data will be written.

  • encoding (str) – Name of the encoding used to decode or encode the data in filepath.

  • make_dirs (bool) – If True, automatically create (sub)directories if not already present in order to write filepath.

  • fieldnames (List[str]) –

    Sequence of keys that identify the order in which values in each rows’ dictionary is written to filepath. These are included in filepath as a header row of column names.

    Note

    Only specify this if data is an iterable of dictionaries.

  • dialect (str) – Grouping of formatting parameters that determine how the data is parsed when reading/writing.

  • delimiter (str) – 1-character string used to separate fields in a row.

  • quoting (int) – Type of quoting to apply to field values. See: https://docs.python.org/3/library/csv.html#csv.QUOTE_NONNUMERIC

Matrix

Functions for reading from and writing to disk CSC and CSR sparse matrices in numpy binary format.

textacy.io.matrix.read_sparse_matrix(filepath, *, kind='csc')[source]

Read the data, indices, indptr, and shape arrays from a .npz file on disk at filepath, and return an instantiated sparse matrix.

Parameters
  • filepath (str or pathlib.Path) – Path to file on disk from which data will be read.

  • kind ({'csc', 'csr'}) – Kind of sparse matrix to instantiate.

Returns

An instantiated sparse matrix, depending on the value of kind.

Return type

scipy.sparse.csc_matrix or scipy.sparse.csr_matrix

textacy.io.matrix.write_sparse_matrix(data, filepath, *, compressed=True, make_dirs=False)[source]

Write sparse matrix data to disk at filepath, optionally compressed, into a single .npz file.

Parameters
  • data (scipy.sparse.csc_matrix or scipy.sparse.csr_matrix) –

  • filepath (str or pathlib.Path) – Path to file on disk to which data will be written. If filepath does not end in .npz, that extension is automatically appended to the name.

  • compressed (bool) – If True, save arrays into a single file in compressed numpy binary format.

  • make_dirs (bool) – If True, automatically create (sub)directories if not already present in order to write filepath.

spaCy

Functions for reading from and writing to disk spacy documents in either pickle or binary format. Be warned: Both formats have pros and cons.

textacy.io.spacy.read_spacy_docs(filepath, *, format='pickle', lang=None)[source]

Read the contents of a file at filepath, written either in pickle or binary format.

Parameters
  • filepath (str or pathlib.Path) – Path to file on disk from which data will be read.

  • format ({"pickle", "binary"}) –

    Format of the data that was written to disk. If ‘pickle’, use pickle in python’s stdlib; if ‘binary’, use the 3rd-party msgpack library.

    Warning

    Docs written in pickle format were saved all together as a list, which means they’re all loaded into memory at once before streaming one by one. Mind your RAM usage, especially when reading many docs!

    Warning

    When writing docs in binary format, spaCy’s built-in spacy.Doc.to_bytes() method is used, but when reading the data back in read_spacy_docs(), experimental and unofficial work-arounds are used to allow for all the docs in data to be read from the same file. If spaCy changes, this code could break, so use this functionality at your own risk!

  • lang (str or spacy.Language) – Already-instantiated spacy.Language object, or the string name by which it can be loaded, used to process the docs written to disk at filepath. Note that this is only applicable when format="binary".

Yields

spacy.tokens.Doc – Next deserialized document.

Raises

ValueError – if format is not “pickle” or “binary”, or if lang is not provided when format="binary"

textacy.io.spacy.write_spacy_docs(data, filepath, *, make_dirs=False, format='pickle', exclude=('tensor', ), include_tensor=None)[source]

Write one or more Doc s to disk at filepath in either pickle or binary format.

Parameters
  • data (spacy.tokens.Doc or Iterable[spacy.tokens.Doc]) – A single Doc or a sequence of Doc s to write to disk.

  • filepath (str or pathlib.Path) – Path to file on disk to which data will be written.

  • make_dirs (bool) – If True, automatically create (sub)directories if not already present in order to write filepath.

  • format ({"pickle", "binary"}) –

    Format of the data written to disk. If “pickle”, use python’s stdlib pickle; if “binary”, use the 3rd-party msgpack library.

    Warning

    When writing docs in pickle format, all the docs in data must be saved as a list, which means they’re all loaded into memory. Mind your RAM usage, especially when writing many docs!

    Warning

    When writing docs in binary format, spaCy’s built-in spacy.Doc.to_bytes() method is used, but when reading the data back in read_spacy_docs(), experimental and unofficial work-arounds are used to allow for all the docs in data to be read from the same file. If spaCy changes, this code could break, so use this functionality at your own risk!

  • exclude (List[str]) – String names of serialization fields to exclude; see https://spacy.io/api/doc#serialization-fields for options. By default, excludes tensors in order to reproduce existing behavior of include_tensor=False.

  • include_tensor (bool) – DEPRECATED! Use exclude instead. If False, Doc tensors are not written to disk; otherwise, they are. Note that this is only applicable when format="binary". Also note that including tensors significantly increases the file size of serialized docs.

Raises

ValueError – if format is not “pickle” or “binary”

HTTP

Functions for reading data from URLs via streaming HTTP requests and either reading it into memory or writing it directly to disk.

textacy.io.http.read_http_stream(url, *, lines=False, decode_unicode=False, chunk_size=1024, auth=None)[source]

Read data from url in a stream, either all at once or line-by-line.

Parameters
  • url (str) – URL to which a GET request is made for data.

  • lines (bool) – If False, yield all of the data at once; otherwise, yield data line-by-line.

  • decode_unicode (bool) – If True, yield data as unicode, where the encoding is taken from the HTTP response headers; otherwise, yield bytes.

  • chunk_size (int) – Number of bytes read into memory per chunk. Because decoding may occur, this is not necessarily the length of each chunk.

  • auth (Tuple[str, str]) –

    (username, password) pair for simple HTTP authentication required (if at all) to access the data at url.

Yields

str or bytes – If lines is True, the next line in the response data, which is bytes if decode_unicode is False or unicode otherwise. If lines is False, yields the full response content, either as bytes or unicode.

textacy.io.http.write_http_stream(url, filepath, *, mode='wt', encoding=None, make_dirs=False, chunk_size=1024, auth=None)[source]

Download data from url in a stream, and write successive chunks to disk at filepath.

Parameters
  • url (str) – URL to which a GET request is made for data.

  • filepath (str or pathlib.Path) – Path to file on disk to which data will be written.

  • mode (str) – Mode with which filepath is opened.

  • encoding (str) –

    Name of the encoding used to decode or encode the data in filepath. Only applicable in text mode.

    Note

    The encoding on the HTTP response is inferred from its headers, or set to ‘utf-8’ as a fall-back in the case that no encoding is detected. It is not set by encoding.

  • make_dirs (bool) – If True, automatically create (sub)directories if not already present in order to write filepath.

  • chunk_size (int) – Number of bytes read into memory per chunk. Because decoding may occur, this is not necessarily the length of each chunk.

  • auth (Tuple[str, str]) –

    (username, password) pair for simple HTTP authentication required (if at all) to access the data at url.

IO Utils

Functions to help read and write data to disk in a variety of formats.

textacy.io.utils.open_sesame(filepath, *, mode='rt', encoding=None, errors=None, newline=None, compression='infer', make_dirs=False)[source]

Open file filepath. Automatically handle file compression, relative paths and symlinks, and missing intermediate directory creation, as needed.

open_sesame may be used as a drop-in replacement for io.open().

Parameters
  • filepath (str or pathlib.Path) – Path on disk (absolute or relative) of the file to open.

  • mode (str) – The mode in which filepath is opened.

  • encoding (str) – Name of the encoding used to decode or encode filepath. Only applicable in text mode.

  • errors (str) – String specifying how encoding/decoding errors are handled. Only applicable in text mode.

  • newline (str) – String specifying how universal newlines mode works. Only applicable in text mode.

  • compression (str) – Type of compression, if any, with which filepath is read from or written to disk. If None, no compression is used; if ‘infer’, compression is inferrred from the extension on filepath.

  • make_dirs (bool) – If True, automatically create (sub)directories if not already present in order to write filepath.

Returns

file object

Raises
  • TypeError – if filepath is not a string

  • ValueError – if encoding is specified but mode is binary

  • OSError – if filepath doesn’t exist but mode is read

textacy.io.utils.coerce_content_type(content, file_mode)[source]

If the content to be written to file and the file_mode used to open it are incompatible (either bytes with text mode or unicode with bytes mode), try to coerce the content type so it can be written.

textacy.io.utils.split_records(items, content_field, itemwise=False)[source]

Split records’ content (text) from associated metadata, but keep them paired together.

Parameters
  • items (Iterable[dict] or Iterable[list]) – An iterable of dicts, e.g. as read from disk by read_json(lines=True), or an iterable of lists, e.g. as read from disk by read_csv().

  • content_field (str or int) – If str, key in each dict item whose value is the item’s content (text); if int, index of the value in each list item corresponding to the item’s content (text).

  • itemwise (bool) – If True, content + metadata are paired item-wise as an iterable of (content, metadata) 2-tuples; if False, content + metadata are paired by position in two parallel iterables in the form of a (iterable(content), iterable(metadata)) 2-tuple.

Returns

If itemwise is True and items is Iterable[dict]; the first element in each tuple is the item’s content, the second element is its metadata as a dictionary.

Generator(Tuple[str, list]): If itemwise is True and items is Iterable[list]; the first element in each tuple is the item’s content, the second element is its metadata as a list.

Tuple[Iterable[str], Iterable[dict]]: If itemwise is False and items is Iterable[dict]; the first element of the tuple is an iterable of items’ contents, the second is an iterable of their metadata dicts.

Tuple[Iterable[str], Iterable[list]]: If itemwise is False and items is Iterable[list]; the first element of the tuple is an iterable of items’ contents, the second is an iterable of their metadata lists.

Return type

Generator(Tuple[str, dict])

textacy.io.utils.unzip(seq)[source]

Borrowed from toolz.sandbox.core.unzip, but using cytoolz instead of toolz to avoid the additional dependency.

textacy.io.utils.get_filepaths(dirpath, *, match_regex=None, ignore_regex=None, extension=None, ignore_invisible=True, recursive=False)[source]

Yield full paths of files on disk under directory dirpath, optionally filtering for or against particular patterns or file extensions and crawling all subdirectories.

Parameters
  • dirpath (str of pathlib.Path) – Path to directory on disk where files are stored.

  • match_regex (str) – Regular expression pattern. Only files whose names match this pattern are included.

  • ignore_regex (str) – Regular expression pattern. Only files whose names do not match this pattern are included.

  • extension (str) – File extension, e.g. “.txt” or “.json”. Only files whose extensions match are included.

  • ignore_invisible (bool) – If True, ignore invisible files, i.e. those that begin with a period.; otherwise, include them.

  • recursive (bool) – If True, iterate recursively through subdirectories in search of files to include; otherwise, only return files located directly under dirpath.

Yields

str – Next file’s name, including the full path on disk.

Raises

OSError – if dirpath is not found on disk

textacy.io.utils.download_file(url, *, filename=None, dirpath=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data'), force=False)[source]

Download a file from url and save it to disk.

Parameters
  • url (str) – Web address from which to download data.

  • filename (str) – Name of the file to which downloaded data is saved. If None, a filename will be inferred from the url.

  • dirpath (str or pathlib.Path) – Full path to the directory on disk under which downloaded data will be saved as filename.

  • force (bool) – If True, download the data even if it already exists at dirpath/filename; otherwise, only download if the data doesn’t already exist on disk.

Returns

Full path of file saved to disk.

Return type

str

textacy.io.utils.get_filename_from_url(url)[source]

Derive a filename from a URL’s path.

Parameters

url (str) – URL from which to extract a filename.

Returns

Filename in URL.

Return type

str

textacy.io.utils.unpack_archive(filepath, *, extract_dir=None)[source]

Extract data from a zip or tar archive file into a directory (or do nothing if the file isn’t an archive).

Parameters
  • filepath (str or pathlib.Path) – Full path to file on disk from which archived contents will be extracted.

  • extract_dir (str or pathlib.Path) – Full path of the directory into which contents will be extracted. If not provided, the same directory as filepath is used.

Returns

Path to directory of extracted contents.

Return type

str