MASTER#
Sentiment expressions common in financial regulatory filings đŒ, such as 10-K.#
Summary
Composition:
Approximately 3.8k unigram lexicon entries
Financial domain-specific business phrases and jargon
Multi-Class Multi-Label (
Negative
,Positive
,Uncertainty
,Litigious
,Strong_Modal
,Weak_Modal
, andConstraining
)
Creation Methodology:
Lexicon candidates were based on tokens that appeared at least 5% of sampled 10-K filings
The tokens were then manually labelled by the authors
Evaluation: Loughran and McDonald (2011) tested the significance of the MASTER dictionary by examining the marketâs reaction at the time of a 10-K filing - âIf tone matters, firms filing 10-Ks with a high measure of negative words should, on average, experience negative excess returns around the filing dateâ (p.18). Compared to the Harvard_GIâs TagNeg Dictionary (H4N) which showed no significant association to the file date excess returns (t = -1.35, below 95% confidence), MASTERâs âNegativeâ dictionary showed a statistically significant negative correlation (t = -2.64, above 99% confidence).
Bodnaruk, Loughran and McDonald (2015) investigated how well âdifferent measures are able to predict future developments associated with the deterioration or improvement of external financing conditionsâ (p.16). Compared to other âtraditional measures of financial constraintsâ, MASTERâs âConstrainingâ dictionary was the only measure that predicted all four directions of ex-post liquidity events with 1% significance level (p.17, 40).
Usage Guidance: Useful for analysis of finance publications, earnings calls, annual reports. Access processed dictionary via sentibank.archive.load().dict(âMASTER_v2022â)
.
đ Introduction#
While a sentiment dictionary designed to be applied in the general domain may be useful, âEnglish words have many meanings, and a word categorisation scheme derived for one discipline might not translate effectively into a discipline with its own dialectâ (Loughran and McDonald, 2011, p.1). Such generality could potentially lead to Type-I errors due to sentiment misclassification in a different context. For example, Loughran and McDonald (2011) revealed that nearly 73.8% of the negative word counts in Harvard Psychosociological TagNeg Dictionary (H4N) are actually words that are not typically negative in a financial context.
To address such constraints, Loughran and McDonald (2011) created a finance-specific dictionary called MASTER. The dictionary is regularly updated by the authors, and was last updated in 2022.
đ Original Dictionary#
The master dictionary originally started with 6 labels â Negative
, Positive
, Uncertainty
, Litigious
, Strong_Modal
, and Weak_Modal
â but was later enhanced by Bodnaruk, Loughran and McDonald (2015) with addition of Constraining
label. Note that both Litigious
and Constraining
lexicons capture a similar âtoneâ: According to Bodnaruk, Loughran and McDonald (2015), there was a positive correlation between the frequency of Constraining
lexicons and the frequency of Litigious
lexicons in the 10-K samples.
ver.2011#
The sentiment lexicons of MASTER dictionary (ver.2011) is from a ârelatively exhaustive listâ of tokens that occurred in at least 5% of 10-K documents between 1994-2008 (p.12). From the SECâs EDGAR, Loughran and McDonald (2011) collected 121,217 firm-year 10-K/10-K405 samples, which was later filtered into 50,115 samples consisting of 8,341 unique firms.
Filter |
Sample Size |
Dropped |
---|---|---|
EDGAR 10-K / 10-K405 1994-2008 complete sample (excluding duplicates) |
121,217 |
|
Include only first filing in a given year |
120,290 |
927 |
At least 180 days between a given firmâs 10-K filings |
120,074 |
216 |
CRSP PERMNO match |
75,252 |
44,822 |
Reported on CRSP as an ordinary common equity firm |
70,061 |
5,191 |
CRSP market capitalization data available |
64,227 |
5,834 |
Price on filing date day minus one > $3 |
55,946 |
8,281 |
Returns and volume for day 0-3 event period |
55,630 |
316 |
NYSE, AMEX, or Nasdaq exchange listing |
55,612 |
18 |
At least 60 days of returns and volume in year prior to and following file date |
55,038 |
574 |
Book-to-market COMPUSTAT data available and book value > 0 |
50,268 |
4,770 |
Number of words in 10-K > 2,000 |
50,115 |
153 |
The collection of lexicons was further extended by accounting for its inflection (i.e for a token âaccidentâ, its inflections âaccidentalâ, âaccidentallyâ and âaccidentsâ were added) using ver.4.0 of 2of12inf dictionary developed by SCOWL [1],[2]. Note that Loughran and McDonald (2011) used inflections instead of stemming because âif the focus is on tone, using explicit inflections is less error prone than extending a word using stemming (root morpheme + derivational morphemes)â (Software Repository of Accounting and Finance, n.d.).
3,752 lexicons were collected, and Loughran and McDonald (2011) labelled such lexicons under 6 categories: Negative
, Positive
, Uncertainty
, Litigious
, Strong_Modal
and Weak_Modal
. There were 2,337 Negative
lexicons (i.e âfelonyâ, âlitigationâ, ârestatedâ, âmisstatementâ, âunanticipatedâ), while there were 353 Positive
lexicons (i.e âachieveâ, âattainâ, âefficientâ, âimproveâ, âprofitableâ), which were substantially fewer. The top 5 most frequently occurring Negative
lexicons in the 10-Ks sample were âlossâ, âlossesâ, âimpairmentâ, âagainstâ and âadverseâ.
There were 285 lexicons categorised as Uncertainty
(i.e. âapproximateâ, âcontingencyâ, âdependâ, âfluctuateâ, âindefiniteâ, âuncertainâ, and âvariabilityâ), emphasising the general notion of imprecision rather than exclusively focusing on risk. Additionally, there were 731 lexicons categorised as Litigious
(i.e. âclaimantâ, âdepositionâ, âinterlocutoryâ, âtestimonyâ, and âtortâ), reflecting a propensity for legal contest. The inclusion of words like âlegislationâ and âregulationâ was made, which do not necessarily imply a legal contest but may indicate a more litigious environment. Itâs important to note that many lexicons overlapped between the Negative
, Uncertainty
, and Litigious
categories.
Loughran and McDonald (2011) expanded Jordanâs (1999, cited in Loughran and McDonald, 2011, p.14) categories of strong and weak modal words to include other terms expressing levels of confidence. There were 19 lexicons categorised as Strong_Modal
(i.e. âalwaysâ, âhighestâ, âmustâ, and âwillâ), and 27 lexicons categorised as Weak_Modal
(i.e. âcouldâ, âdependingâ, âmightâ, and âpossiblyâ).
ver.2015#
On top of the ver.2011 of master dictionary, Bodnaruk, Loughran and McDonald (2015) collected 184 lexicons that captures whether or not a firm is financially constrained (Constraining
). Similar to Loughran and McDonald (2011), the collection was from tokens that appeared at least 5% of 10-K filings between 1996-2011. The original collection had 183,214 firm-year samples, which was filtered and reduced to 51,533.
Filter |
Sample Size |
Dropped |
---|---|---|
SEC 10-K files 1996â2011 |
183,214 |
|
Drop financial firms and utilities |
133,992 |
49,222 |
Eliminate duplicates within year/CIK |
130,450 |
3,542 |
Drop if file date < 180 days from prior |
129,986 |
464 |
CRSP PERMNO match and ordinary common equity |
59,177 |
70,809 |
Drop if number of 10-K words is < 2,000 |
59,137 |
40 |
Drop if required Compustat data is missing |
55,530 |
3,607 |
Market capitalization data available on CRSP |
51,533 |
3,997 |
Only seven lexicons, ârequiredâ, âobligationsâ, ârequirementsâ, ârequireâ, âimpairmentâ, âobligationâ, and ârequiresâ, account for more than half of all the counts for the constraining words which appeared in 10-Ks. Appendix C (p.32) contains the entire list of Constraining
lexicons, and Table 3 (p.39) reports the 50 most frequently occurring Constraining
lexicons.
ver.2022#
The Master dictionary in SentiBank is the most up-to-date dictionary (ver.2022) maintained by Loughran and McDonald. While pre-2018 versions did not include abbreviations in general, post-2018 versions are included with a limited number of abbreviations.
from sentibank import archive
load = archive.load()
master = load.origin("MASTER_v2022")
Word | Seq_num | Word Count | Word Proportion | Average Proportion | Std Dev | Doc Count | Negative | Positive | Uncertainty | Litigious | Strong_Modal | Weak_Modal | Constraining | Syllables | Source |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Loading... (need help?) |
đ§č Processed Dictionary#
From the original csv, 8 select columns irrelevant to sentiment were programmatically filtered to purify the data for core sentiment modelling[3]. Upon filtration, the rows lacking substantive sentiment content were additionally removed to refine the dataset. As a result, the corpus of 86,531 were distilled into a lexicon of 3,876 domain-specific affect terms. No notable modifications or removals have been made further on the lexicons.
Note#
[1] To make sure all possible inflections are considered, Loughran and McDonald (2011) extended the core 2of12inf word list in the following procedure: (i) All tokens in variety of 10-K filings (10-K, 10-K/A, 10-K405, 10-K405/A, 10KSB, 10KSB/A, 10-KSB, 10-KSB/A, 10KSB40, 10KSB40/A) that did not appear in the 2of2inf word list were identified; (ii) Such a collection was then sorted by frequency of occurrence; and (iii) If a token had a frequency count of 50 or more OR was an inflection of a more common word, such a token was evaluated for inclusion in the master dictionary.
[2] The 2of12inf dictionary originated from the 12dicts project, which explored different methods to extract core vocabulary lists from the 12 source dictionaries. From the name 2of12inf, the â2of12â represents the core vocabulary list containing over 40,000 words that appeared in at least 2 of the 12 source dictionaries. This excluded capitalised words, phrases, abbreviations, affixes, and non-American/secondary spellings.
The âinfâ represents the added inflections of those words, expanding the total size to around 81,000 words. However, 2of12inf diverged from only using the 12 source dictionaries. The starting point was a subset of the AGID list by Kevin Atkinson, incorporating public domain sources like Moby Words and WordNet. The list does not exclude secondary spellings, non-American usages
In summary, 2of12inf ultimately optimised coverage at the cost of authority by inflecting 2of12 and adding public domain words.
[3] The removed columns were Seq_num
, Word Count
, Word Proportion
, Average Proportion
, Std Dev
, Doc Count
, Syllables
, Source