VADER#

A gold-standard lexicon optimised for social media 📱 sentiment analysis.#

Summary

Composition:

  • Approximately 7.5k unigram lexicon entries (4.2k negative and 3.3k positive)

  • Includes acronyms, emoticons, slang, initialisms

  • Continuous Scoring metrics ranging [-4,4]

Creation Methodology:

  • Constructed a list by examining existing well-established sentiment dictionaries (LIWC, ANEW, and General Inquirer).

  • Manual ratings by 10 independent annotators over 90,000 candidate terms on a scale from -4 to +4 via Amazon Mechanical Turk

  • If a worker was more than one std away from the mean of the pre-validated sentiment rating distribution, the authors discarded entire ratings in the batch from that worker.

Evaluation: Hutto and Gilbert (2014) hired two validation approaches:

  1. Connotative Validity: Using a pool of 90,000 candidate terms sourced from established sentiment dictionaries (LIWC, ANEW, and General Inquirer), the authors subjected each word to manual ratings by 10 independent annotators. This rating process, conducted via Amazon Mechanical Turk, involved assigning scores on a scale ranging from -4 to +4. If a worker was more than one std away from the mean of the pre-validated sentiment rating distribution, the authors discarded entire ratings in the batch from that worker.

  2. Benchmark Validity: The performances of seven external dictionaries were evaluated across multiple benchmark datasets (Social Media, Product Review, Movie Review, and News Editorial). On social media, VADER (F1=0.96) even outperformed human raters (F1=0.84). Additionally, across the three other benchmark datasets, VADER consistently outperformed all seven dictionaries (Hutto and Gilbert, 2014, pp. 223-224).

Usage Guidance: Excellent general purpose lexicon for social media. Useful for informal language. Pair with domain-specific lexica for improved coverage. Access processed dictionary via sentibank.archive.load().dict(“VADER_v2014”)

📋 Introduction#

VADER stands for Valence Aware Dictionary and sEntiment Reasoner (Hutto and Gilbert 2014). It is a rule-based sentiment analysis algorithm, particularly aimed at social media texts, that uses a sentiment polarity and intensity (valence) sensitive lexicon dictionary.

Despite the fact that it was designed for scoring sentiments of social media texts, VADER achieved the highest F1 scores in all four genres – namely Social Media Text (Tweets), Movie Reviews, Amazon Product Reviews and NY Times Editorial – among seven other dictionary based methods. It should be noted that VADER particularly showed exceptional performance when classifying sentiments in the social media genre, even outperforming human raters (see Table 4, Hutto and Gilbert 2014, p.223).

📚 Original Dictionary#

The construction and validation of VADER dictionary (ver.2014) encompassed the subsequent processes.

Step

Description

1

Construct a sentiment word-bank list inspired by established sources (LIWC, ANEW, and Harvard_GI).

2

Incorporate 9,000 lexical feature candidates common to sentiment expression in microblogs, including: (i) Western-style emoticons (i.e ‘D=’) from Wikipedia; (ii) Sentiment-related acronyms and initialisms (i.e LOL) from Wikipedia; and (iii) Commonly used slangs from Internet Slang

3

Assess the general applicability of each lexicon using the Wisdom-of-the-Crowd (WotC) approach, using Amazon Mechanical Turk[1]. Ten independent raters (or ‘turkers’) rated each lexicons on a scale from -4 (‘Extremely Negative’) to 4 (‘Extremely Positive’)

To produce reliable Wisdom-of-the-Crowd sentiment intensity ratings, Hutto and Gilbert (2014) implemented several quality control processes. To name few: (i) Raters were prescreened for English Comprehension; (ii) Raters had to complete sentiment rating training and score 90% on matching known ratings; (iii) Batches contained “golden items” to check rating consistency; and (iv) Bonus incentives were given for matching group means (for further details, see Hutto and Gilbert 2014, pp.219-221).

Once all lexicons were scored between the range of [-4,4], Hutto and Gilbert (2014) calculated the summary statistics based on the aggregate ratings of ten independent raters. The final set of lexicons consisted of 7,517 entries with non-zero mean valence scores and a standard deviation below 2.5.

from sentibank import archive 

load = archive.load()
vader = load.origin("VADER_v2014") 
VADER (Hutto and Gilbert, 2014)
SentimentExpression mean std sample
Loading... (need help?)

🧹 Processed Dictionary#

First-Pass Processing#

The original VADER lexicon contained 15 duplicate entries. For most duplicates, such as “fav” and “lmao”, we calculated average sample scores. However, the entries “d:” and “d=” had contradictory valence ratings:

  • “d:” had scores of 1.2 (positive) and -2.9 (negative)

  • “d=” had scores of 1.5 (positive) and -3.0 (negative)

Since both terms lean toward negative sentiment polarity, we removed the positive scores for “d:” (“D:”) and “d=” (“D=”) from the final lexicon. In general, averaging the duplicate ratings created a unified score. But with clear polarity conflicts like “d:” and “d=”, we deferred to the predominant negative rating based on the term’s apparent sentiment leaning.

We also removed redundant uppercase emoticons and substituted two uncommon variants with their lowercase forms. This accounts for VADER comparing tokens in lowercase, ensuring consistency with the lexicon dictionary format:

  • VADER compares tokens in lowercase form. So we removed 7 uppercase emoticons (“-:O”, “(:O”, “:-D”, “:D”, “;D”, “=-D”, “=D”) since their lowercase versions (“:-o”, “(:o”, “:-d”, “:d”, “;d”, “=-d”, “=d”) already existed.

  • We substituted “:-Þ” and “:Þ” with their lowercase versions “:-þ” and “:þ” for consistency, since VADER checks the lowercase form.

As a result, the first-pass processed VADER dictionary has 7,495 lexicons instead of 7,517.

Note

We are performing an ongoing quality check focused on abbreviations, slangs, acronyms, and initialisms in the lexicon. For example, in future versions we may remove terms like “aug-00” that do not qualify as accepted types of shorthand.

Contributions to this preliminary lexicon refinement are welcome. If you would like to provide suggestions, please open an issue here.

Note#

[1] Amazon Mechanical Turk (AMT) is a micro-labor website enabling researchers to outsource human intelligence tasks (HITs), such as image annotation, surveys and data validation. To maintain the quality of data, Hutto and Gilbert (2014) hired a quality control processes (see 3.1.1 Screening, Training, Selecting, and Data Quality Checking Crowd-Sourced Evaluations and Validations, Hutto and Gilbert 2014, p.220)