SentiWordNet#
A comprehensive dictionary that assigns graded sentiment scores to synsets in WordNet đ.#
SentiWordNet Summary
Composition:
Approximately 117k n-gram synsets
Synset-level polarity, with positive and negative scores ranging [0,1] (Note that the sentiment scores appear to be ordinal representations rather than truly continuous values.)
Note that the processed dictionary provides approximately 23k terms with continuous scoring metrics ranging [-1,1].
Creation Methodology:
SentiWordNet 1.0 (Esuli and Sebastiani, 2006): Employed a committee of eight ternary classifiers, each trained on different subsets derived from positive and negative âseed termsâ. Ratings were assigned to WordNet synsets based on the classifiersâ decisions.
SentiWordNet 3.0 (Baccianella, Esuli and Sebastiani, 2010): Departed from the committee approach, adopting a âbag-of-synsetsâ representation and introducing a graph-based random walk procedure for sentiment scoring.
Evaluation: Esuli and Sebastiani (2006) provided an initial validation of SentiWordNet 1.0 by comparing it against the General Inquirer lexicon, demonstrating its potential utility but also noting challenges in directly evaluating accuracy due to the absence of benchmark with manual word-level sentiment annotations.
Baccianella, Esuli and Sebastiani (2010) conducted a more rigorous evaluation of SentiWordNet 3.0 using Micro-WN(Op)-3.0, an automatically mapped version of the Micro-WN(Op) dataset originally compiled by Cerini et al. (2007). Micro-WN(Op) contains 1,105 WordNet synsets which were manually annotated for degrees of positivity, negativity and neutrality by five human coders. To evaluate SentiWordNet 3.0, Baccianella, Esuli and Sebastiani (2010) tested how well it could predict the polarity ratings (positivity and negativity values) of synsets in Micro-WN(Op)-3.0. They computed the ranking correlation between the gold standard Micro-WN(Op)-3.0 rankings and the SentiWordNet 3.0 predicted rankings using p-normalised Kendallâs tau. In comparison to SentiWordNet 1.0, version 3.0 demonstrated substantial improvements in correlation (19.48% relative gain for positivity and 21.96% for negativity).
Usage Guidance: A comprehensive dictionary offering synset-level sentiment scores. Ideal as a semantic foundation for contextual sentiment analysis, acknowledging the multifaceted nature of sentiment. Access processed dictionaries via sentibank.archive.load().dict("SentiWordNet_v2010_simple")
for dictionary that only includes strictly positive and negative terms, or sentibank.archive.load().dict("SentiWordNet_v2010_logtransform")
for dictionary that contains ambiguous terms.
đ Introduction#
SentiWordNet (Esuli and Sebastiani, 2006; Baccianella, Esuli and Sebastiani, 2010) is a lexicon that annotates English words from WordNet with âgradedâ sentiment scores indicating how objective, positive, and negative they are. Note that the term âgradedâ aligns with âvalenceâ in modern sentiment analysis research. The lexicon, evolving from the original 2006 version (SentiWordNet 1.0) to the improved SentiWordNet 3.0, recognises that terms can possess both positive and negative polarities to varying degrees. In this overview, we trace the evolution of SentiWordNet, emphasising the key methodological differences in scoring word senses between the two versions.
đ Original Dictionary#
ver.2006#
SentiWordNet (ver.1.0) assigned three sentiment scores that range [0,1] to each WordNet (ver.2.0) synset: (i) Objective score (Obj
) describing how objective the terms in the synset are; (ii) Positive score (Pos
) describing how positive the terms are; and (iii) Negative score (Neg
) describing how negative the terms are. These scores were derived by combining the results produced by a âcommitteeâ of 8 âternary classifiersâ (Esuli and Sebastiani, 2006, p.418): That is, in cases where all classifiers unanimously assign the same label to a synset, that label receives the maximum score; Otherwise, each labelâs score is proportionate to the number of classifiers that have assigned it.
Classifiers, distinguished by the training set it used to train (k = 0, 2, 4, 6) and in the machine learning algorithm (SVM versus Rocchio), followed these steps: 1. Identification of Positive (LPos) and Negative Seeds (LNeg); 2. Expanding Seed with WordNet relations to create Training datasets (TrObj, TrPos, TrNeg); and 3. Training machine-learning models.
Identification of Positive and Negative Seeds
Two subsets, LPos and LNeg were first obtained from the seed terms proposed in Turney and Littman (2003). 47 Positive and 58 Negative synsets were obtained after removing irrelevant WordNet synsets (i.e for the term âniceâ, the authors removed the synset relative to the French city of Nice).
Expanding Seed to produce Training datasets
LPos and LNeg were iteratively expanded for k iterations, generating four training datasets (Trk=0, Trk=2, Trk=4, and Trk=6). Each Trk (for k = 0, 2, 4, 6) comprised Positive (TrkPos), Negative (TrkNeg), and Objective (TrkObj) subsets. At each iteration, the seed sets were expanded using WordNet lexical relations that preserved affective meaning, mirroring the approach taken by WordNet-Affect in expanding their affective core dictionary (Strapparava and Valitutti, 2004; Valitutti, Strapparava and Stock, 2004). For instance, all the synsets of LPos with WordNet relations such as âalso-seeâ were added to TrkPos and those with WordNet relations such as âdirect-antonymyâ were added to TrkNeg.
Note that TrkObj was consistent across all four datasets, heuristically defined as synsets not belonging to TrkPos or TrkNeg, containing terms not marked as Positive or Negative in the Harvard General Inquirer lexicon (p.419). The resulting TrkObj comprised 17,530 synsets.
Train machine-learning algorithms
Each term was given vector representations based on their âglossesâ, which are textual definitions in WordNet. A textual representation was generated by collating all the glosses of a term in WordNet. This means that if a term has multiple senses (associated with multiple synsets), each sense contributes to the representation. The collation was converted into vectorial form by cosine-normalised TF-IDF.
The âternary classifierâ distinguishes terms into Positive
, Negative
, or Objective
based on two binary classifiers. The first classifier discerned Positive
and not Positive
. It was trained with the dataset TrkPos for Positive
instances and the combination of datasets TrkNeg and TrkObj for instances labelled as not Positive
. The second classifier discriminated between Negative
and not Negative
. It was trained with the dataset TrkNeg for Negative
instances and the combination of datasets TrkPos and TrkObj for instances labelled as not Negative
. The final classification was determined based on the outcomes of both classifiers, represented in a table below.
Classifier 1 |
Classifier 2 |
Final Classification |
---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Each of the three scores Pos
, Neg
and Obj
for a term ranges in [0,1] based on the results of 8 ternary classifiers.
ver.2010#
SentiWordNet 3.0 takes a departure from its predecessor with two notable methods: 1. Representing a term using âbag-of-synsetsâ instead of âbag-of-wordsâ; and 2. Calculating the sentiment score using graph-based random walk models on WordNet, departing from the classifier committee used previously.
In SentiWordNet 3.0, Baccianella, Esuli and Sebastiani (2010) leverage manually disambiguated glosses obtained from the Princeton WordNet Gloss Corpus. Unlike the previous version (1.0) that utilised a âbag-of-wordsâ model, SentiWordNet 3.0 represents glosses as a sequence of WordNet synsets. The term âmanually disambiguatedâ signifies the effort to resolve ambiguity in gloss interpretation, particularly when a term has multiple senses.
For clarity, consider the transformation of a term representation: instead of a bag-of-words like [âword 1â, âword 2â, âŠ, âword Nâ]
, SentiWordNet 3.0 adopts a bag-of-synsets like [âsynset 1â, âsynset 2â, âŠ, âsynset Nâ]
. This shift to a sequence of WordNet synsets allows SentiWordNet 3.0 to capture nuanced meanings associated with different senses of a word in its gloss, providing a more sophisticated and contextually rich representation compared to the simpler bag-of-words model used in SentiWordNet 1.0.
The âbag-of-synsetsâ representation facilitated modelling WordNet as a graph, enabling a new sentiment scoring approach. SentiWordNet 3.0 introduced a graph-based random walk procedure, by revising the PageRank algorithm (for detailed discussion, see Esuli and Sebastiani, 2007). It views WordNet as a directed graph, with synsets serving as nodes and edges connecting synsets based on their appearance in the textual definitions (glosses) of each other. A graph-based random walk procedure is then employed, allowing sentiment to dynamically âflowâ through the WordNet graph.
The random walk iteratively propagates scores through the WordNet graph until convergence, leveraging its inherent structure. This contrasts the earlier iterative expansion method. Propagating scores in a context-aware manner enhanced accuracy compared to the previous approach.
The dataset includes 67,176 nouns, 14,004 adjectives, 7,440 verbs, and 3,050 adverbs, each with Pos
and Neg
scores in the range of [0, 1]. Notably, there are 3,047 nouns, 1,947 adjectives, 1,381 verbs, and 225 adverbs with duplicates.
from sentibank import archive
load = archive.load()
SentiWordNet = load.origin("SentiWordNet_v2010")
POS | ID | PosScore | NegScore | SynsetTerms | Gloss |
---|---|---|---|---|---|
Loading... (need help?) |