Resources

ConceptNet

ConceptNet is a multilingual knowledge base, representing common words and phrases and the common-sense relationships between them. This information is collected from a variety of sources, including crowd-sourced resources (e.g. Wiktionary, Open Mind Common Sense), games with a purpose (e.g. Verbosity, nadya.jp), and expert-created resources (e.g. WordNet, JMDict).

The interface in textacy gives access to several key relationships between terms that are useful in a variety of NLP tasks:

  • antonyms: terms that are opposites of each other in some relevant way

  • hyponyms: terms that are subtypes or specific instances of other terms

  • meronyms: terms that are parts of other terms

  • synonyms: terms that are sufficiently similar that they may be used interchangeably

class textacy.resources.concept_net.ConceptNet(data_dir=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data/concept_net'), version='5.7.0')[source]

Interface to ConceptNet, a multilingual knowledge base representing common words and phrases and the common-sense relationships between them.

Download the data (one time only!), and save its contents to disk:

>>> rs = textacy.resoures.ConceptNet()
>>> rs.download()
>>> rs.info
{'name': 'concept_net',
 'site_url': 'http://conceptnet.io',
 'publication_url': 'https://arxiv.org/abs/1612.03975',
 'description': 'An open, multilingual semantic network of general knowledge, designed to help computers understand the meanings of words.'}

Access other same-language terms related to a given term in a variety of ways:

>>> rs.get_synonyms("spouse", lang="en", sense="n")
['mate', 'married person', 'better half', 'partner']
>>> rs.get_antonyms("love", lang="en", sense="v")
['detest', 'hate', 'loathe']
>>> rs.get_hyponyms("marriage", lang="en", sense="n")
['cohabitation situation', 'union', 'legal agreement', 'ritual', 'family', 'marital status']

Note: The very first time a given relationship is accessed, the full ConceptNet db must be parsed and split for fast future access. This can take a couple minutes; be patient.

When passing a spaCy Token or Span, the corresponding lang and sense are inferred automatically from the object:

>>> text = "The quick brown fox jumps over the lazy dog."
>>> doc = textacy.make_spacy_doc(text, lang="en")
>>> rs.get_synonyms(doc[1])  # quick
['flying', 'fast', 'rapid', 'ready', 'straightaway', 'nimble', 'speedy', 'warm']
>>> rs.get_synonyms(doc[4:5])  # jumps over
['leap', 'startle', 'hump', 'flinch', 'jump off', 'skydive', 'jumpstart', ...]

Many terms won’t have entries, for actual linguistic reasons or because the db’s coverage of a given language’s vocabulary isn’t comprehensive:

>>> rs.get_meronyms(doc[3])  # fox
[]
>>> rs.get_antonyms(doc[7])  # lazy
[]
Parameters
  • data_dir (str or pathlib.Path) – Path to directory on disk under which resource data is stored, i.e. /path/to/data_dir/concept_net.

  • version ({"5.7.0", "5.6.0", "5.5.5"}) – Version string of the ConceptNet db to use. Since newer versions typically represent improvements over earlier versions, you’ll probably want “5.7.0” (the default value).

download(*, force=False)[source]

Download resource data as a gzipped csv file, then save it to disk under the ConceptNet.data_dir directory.

Parameters

force (bool) – If True, download resource data, even if it already exists on disk; otherwise, don’t re-download the data.

property filepath

Full path on disk for the ConceptNet gzipped csv file corresponding to the given ConceptNet.data_dir.

Type

str

property antonyms

Mapping of language code to term to sense to set of term’s antonyms – opposites of the term in some relevant way, like being at opposite ends of a scale or fundamentally similar but with a key difference between them – such as black <=> white or hot <=> cold. Note that this relationship is symmetric.

Based on the “/r/Antonym” relation in ConceptNet.

Type

Dict[str, Dict[str, Dict[str, List[str]]]]

get_antonyms(term, *, lang=None, sense=None)[source]
Parameters
  • term (str or spacy.tokens.Token or spacy.tokens.Span) –

  • lang (str) – Standard code for the language of term.

  • sense (str) – Sense in which term is used in context, which in practice is just its part of speech. Valid values: “n” or “NOUN”, “v” or “VERB”, “a” or “ADJ”, “r” or “ADV”.

Returns

List[str]

property hyponyms

Mapping of language code to term to sense to set of term’s hyponyms – subtypes or specific instances of the term – such as car => vehicle or Chicago => city. Every A is a B.

Based on the “/r/IsA” relation in ConceptNet.

Type

Dict[str, Dict[str, Dict[str, List[str]]]]

get_hyponyms(term, *, lang=None, sense=None)[source]
Parameters
  • term (str or spacy.tokens.Token or spacy.tokens.Span) –

  • lang (str) – Standard code for the language of term.

  • sense (str) – Sense in which term is used in context, which in practice is just its part of speech. Valid values: “n” or “NOUN”, “v” or “VERB”, “a” or “ADJ”, “r” or “ADV”.

Returns

List[str]

property meronyms

Mapping of language code to term to sense to set of term’s meronyms – parts of the term – such as gearshift => car.

Based on the “/r/PartOf” relation in ConceptNet.

Type

Dict[str, Dict[str, Dict[str, List[str]]]]

get_meronyms(term, *, lang=None, sense=None)[source]
Parameters
  • term (str or spacy.tokens.Token or spacy.tokens.Span) –

  • lang (str) – Standard code for the language of term.

  • sense (str) – Sense in which term is used in context, which in practice is just its part of speech. Valid values: “n” or “NOUN”, “v” or “VERB”, “a” or “ADJ”, “r” or “ADV”.

Returns

List[str]

property synonyms

Mapping of language code to term to sense to set of term’s synonyms – sufficiently similar concepts that they may be used interchangeably – such as sunlight <=> sunshine. Note that this relationship is symmetric.

Based on the “/r/Synonym” relation in ConceptNet.

Type

Dict[str, Dict[str, Dict[str, List[str]]]]

get_synonyms(term, *, lang=None, sense=None)[source]
Parameters
  • term (str or spacy.tokens.Token or spacy.tokens.Span) –

  • lang (str) – Standard code for the language of term.

  • sense (str) – Sense in which term is used in context, which in practice is just its part of speech. Valid values: “n” or “NOUN”, “v” or “VERB”, “a” or “ADJ”, “r” or “ADV”.

Returns

List[str]

DepecheMood

DepecheMood is a high-quality and high-coverage emotion lexicon for English and Italian text, mapping individual terms to their emotional valences. These word-emotion weights are inferred from crowd-sourced datasets of emotionally tagged news articles (rappler.com for English, corriere.it for Italian).

English terms are assigned weights to eight emotions:

  • AFRAID

  • AMUSED

  • ANGRY

  • ANNOYED

  • DONT_CARE

  • HAPPY

  • INSPIRED

  • SAD

Italian terms are assigned weights to five emotions:

  • DIVERTITO (~amused)

  • INDIGNATO (~annoyed)

  • PREOCCUPATO (~afraid)

  • SODDISFATTO (~happy)

  • TRISTE (~sad)

class textacy.resources.depeche_mood.DepecheMood(data_dir=PosixPath('/home/travis/virtualenv/python3.7.1/lib/python3.7/site-packages/textacy/data/depeche_mood'), lang='en', word_rep='lemmapos', min_freq=3)[source]

Interface to DepecheMood, an emotion lexicon for English and Italian text.

Download the data (one time only!), and save its contents to disk:

>>> rs = textacy.resources.DepecheMood(lang="en", word_rep="lemmapos")
>>> rs.download()
>>> rs.info
{'name': 'depeche_mood',
 'site_url': 'http://www.depechemood.eu',
 'publication_url': 'https://arxiv.org/abs/1810.03660',
 'description': 'A simple tool to analyze the emotions evoked by a text.'}

Access emotional valences for individual terms:

>>> rs.get_emotional_valence("disease#n")
{'AFRAID': 0.37093526222120465,
 'AMUSED': 0.06953745082761113,
 'ANGRY': 0.06979683067736414,
 'ANNOYED': 0.06465401081252636,
 'DONT_CARE': 0.07080580707440012,
 'HAPPY': 0.07537324330608403,
 'INSPIRED': 0.13394731320662606,
 'SAD': 0.14495008187418348}
>>> rs.get_emotional_valence("heal#v")
{'AFRAID': 0.060450319886187334,
 'AMUSED': 0.09284046387491741,
 'ANGRY': 0.06207816933776029,
 'ANNOYED': 0.10027622719958346,
 'DONT_CARE': 0.11259594401785,
 'HAPPY': 0.09946106491457314,
 'INSPIRED': 0.37794768332634626,
 'SAD': 0.09435012744278205}

When passing multiple terms in the form of a List[str] or Span or Doc, emotion weights are averaged over all terms for which weights are available:

>>> rs.get_emotional_valence(["disease#n", "heal#v"])
{'AFRAID': 0.215692791053696,
 'AMUSED': 0.08118895735126427,
 'ANGRY': 0.06593750000756221,
 'ANNOYED': 0.08246511900605491,
 'DONT_CARE': 0.09170087554612506,
 'HAPPY': 0.08741715411032858,
 'INSPIRED': 0.25594749826648616,
 'SAD': 0.11965010465848278}
>>> text = "The acting was sweet and amazing, but the plot was dumb and terrible."
>>> doc = textacy.make_spacy_doc(text, lang="en")
>>> rs.get_emotional_valence(doc)
{'AFRAID': 0.05272350876803627,
 'AMUSED': 0.13725054992595098,
 'ANGRY': 0.15787016147081184,
 'ANNOYED': 0.1398733360688608,
 'DONT_CARE': 0.14356943460620503,
 'HAPPY': 0.11923217912716871,
 'INSPIRED': 0.17880214720077342,
 'SAD': 0.07067868283219296}
>>> rs.get_emotional_valence(doc[0:6])  # the acting was sweet and amazing
{'AFRAID': 0.039790959333750785,
 'AMUSED': 0.1346884072825313,
 'ANGRY': 0.1373596223131593,
 'ANNOYED': 0.11391999698695347,
 'DONT_CARE': 0.1574819173485831,
 'HAPPY': 0.1552521762333925,
 'INSPIRED': 0.21232264216449326,
 'SAD': 0.049184278337136296}

For good measure, here’s how Italian w/o POS-tagged words looks:

>>> rs = textacy.resources.DepecheMood(lang="it", word_rep="lemma")
>>> rs.get_emotional_valence("amore")
{'INDIGNATO': 0.11451408951814121,
 'PREOCCUPATO': 0.1323655108545536,
 'TRISTE': 0.18249663560400609,
 'DIVERTITO': 0.33558928569110086,
 'SODDISFATTO': 0.23503447833219815}
Parameters
  • data_dir (str or pathlib.Path) – Path to directory on disk under which resource data is stored, i.e. /path/to/data_dir/depeche_mood.

  • lang ({"en", "it"}) – Standard two-letter code for the language of terms for which emotional valences are to be retrieved.

  • word_rep ({"token", "lemma", "lemmapos"}) – Level of text processing used in computing terms’ emotion weights. “token” => tokenization only; “lemma” => tokenization and lemmatization; “lemmapos” => tokenization, lemmatization, and part-of-speech tagging.

  • min_freq (int) – Minimum number of times that a given term must have appeared in the source dataset for it to be included in the emotion weights dict. This can be used to remove noisy terms at the expense of reducing coverage. Researchers observed peak performance at 10, but anywhere between 1 and 20 is reasonable.

property filepath

Full path on disk for the DepecheMood tsv file corresponding to the lang and word_rep.

Type

str

property weights

Mapping of term string (or term#POS, if DepecheMood.word_rep is “lemmapos”) to the terms’ normalized weights on a fixed set of affective dimensions (aka “emotions”).

Type

Dict[str, Dict[str, float]]

download(*, force=False)[source]

Download resource data as a zip archive file, then save it to disk and extract its contents under the data_dir directory.

Parameters

force (bool) – If True, download the resource, even if it already exists on disk under data_dir.

get_emotional_valence(terms)[source]

Get average emotional valence over all terms in terms for which emotion weights are available.

Parameters

terms (str or Sequence[str], Token or Sequence[Token]) –

One or more terms over which to average emotional valences. Note that only nouns, adjectives, adverbs, and verbs are included.

Note

If the resource was initialized with word_rep="lemmapos", then string terms must have matching parts-of-speech appended to them like TERM#POS. Only “n” => noun, “v” => verb, “a” => adjective, and “r” => adverb are included in the data.

Returns

Mapping of emotion to average weight.

Return type

Dict[str, float]