MASTER

MASTER#

Sentiment expressions common in financial regulatory filings 💼, such as 10-K.#

Summary

Composition:

Approximately 3.8k unigram lexicon entries
Financial domain-specific business phrases and jargon
Multi-Class Multi-Label (Negative, Positive, Uncertainty, Litigious, Strong_Modal, Weak_Modal, and Constraining)

Creation Methodology:

Lexicon candidates were based on tokens that appeared at least 5% of sampled 10-K filings
The tokens were then manually labelled by the authors

Evaluation: Loughran and McDonald (2011) tested the significance of the MASTER dictionary by examining the market’s reaction at the time of a 10-K filing - ‘If tone matters, firms filing 10-Ks with a high measure of negative words should, on average, experience negative excess returns around the filing date’ (p.18). Compared to the Harvard_GI’s TagNeg Dictionary (H4N) which showed no significant association to the file date excess returns (t = -1.35, below 95% confidence), MASTER’s ‘Negative’ dictionary showed a statistically significant negative correlation (t = -2.64, above 99% confidence).

Bodnaruk, Loughran and McDonald (2015) investigated how well ‘different measures are able to predict future developments associated with the deterioration or improvement of external financing conditions’ (p.16). Compared to other ‘traditional measures of financial constraints’, MASTER’s ‘Constraining’ dictionary was the only measure that predicted all four directions of ex-post liquidity events with 1% significance level (p.17, 40).

Usage Guidance: Useful for analysis of finance publications, earnings calls, annual reports. Access processed dictionary via sentibank.archive.load().dict(“MASTER_v2022”).

📋 Introduction#

While a sentiment dictionary designed to be applied in the general domain may be useful, ‘English words have many meanings, and a word categorisation scheme derived for one discipline might not translate effectively into a discipline with its own dialect’ (Loughran and McDonald, 2011, p.1). Such generality could potentially lead to Type-I errors due to sentiment misclassification in a different context. For example, Loughran and McDonald (2011) revealed that nearly 73.8% of the negative word counts in Harvard Psychosociological TagNeg Dictionary (H4N) are actually words that are not typically negative in a financial context.

To address such constraints, Loughran and McDonald (2011) created a finance-specific dictionary called MASTER. The dictionary is regularly updated by the authors, and was last updated in 2022.

📚 Original Dictionary#

The master dictionary originally started with 6 labels – Negative, Positive, Uncertainty, Litigious, Strong_Modal, and Weak_Modal – but was later enhanced by Bodnaruk, Loughran and McDonald (2015) with addition of Constraining label. Note that both Litigious and Constraining lexicons capture a similar “tone”: According to Bodnaruk, Loughran and McDonald (2015), there was a positive correlation between the frequency of Constraining lexicons and the frequency of Litigious lexicons in the 10-K samples.

ver.2011#

The sentiment lexicons of MASTER dictionary (ver.2011) is from a ‘relatively exhaustive list’ of tokens that occurred in at least 5% of 10-K documents between 1994-2008 (p.12). From the SEC’s EDGAR, Loughran and McDonald (2011) collected 121,217 firm-year 10-K/10-K405 samples, which was later filtered into 50,115 samples consisting of 8,341 unique firms.

Filter	Sample Size	Dropped
EDGAR 10-K / 10-K405 1994-2008 complete sample (excluding duplicates)	121,217
Include only first filing in a given year	120,290	927
At least 180 days between a given firm’s 10-K filings	120,074	216
CRSP PERMNO match	75,252	44,822
Reported on CRSP as an ordinary common equity firm	70,061	5,191
CRSP market capitalization data available	64,227	5,834
Price on filing date day minus one > $3	55,946	8,281
Returns and volume for day 0-3 event period	55,630	316
NYSE, AMEX, or Nasdaq exchange listing	55,612	18
At least 60 days of returns and volume in year prior to and following file date	55,038	574
Book-to-market COMPUSTAT data available and book value > 0	50,268	4,770
Number of words in 10-K > 2,000	50,115	153

The collection of lexicons was further extended by accounting for its inflection (i.e for a token ‘accident’, its inflections ‘accidental’, ‘accidentally’ and ‘accidents’ were added) using ver.4.0 of 2of12inf dictionary developed by SCOWL ^[1],[2]. Note that Loughran and McDonald (2011) used inflections instead of stemming because ‘if the focus is on tone, using explicit inflections is less error prone than extending a word using stemming (root morpheme + derivational morphemes)’ (Software Repository of Accounting and Finance, n.d.).

3,752 lexicons were collected, and Loughran and McDonald (2011) labelled such lexicons under 6 categories: Negative, Positive, Uncertainty, Litigious, Strong_Modal and Weak_Modal. There were 2,337 Negative lexicons (i.e ‘felony’, ‘litigation’, ‘restated’, ‘misstatement’, ‘unanticipated’), while there were 353 Positive lexicons (i.e ‘achieve’, ‘attain’, ‘efficient’, ‘improve’, ‘profitable’), which were substantially fewer. The top 5 most frequently occurring Negative lexicons in the 10-Ks sample were ‘loss’, ‘losses’, ‘impairment’, ‘against’ and ‘adverse’.

There were 285 lexicons categorised as Uncertainty (i.e. ‘approximate’, ‘contingency’, ‘depend’, ‘fluctuate’, ‘indefinite’, ‘uncertain’, and ‘variability’), emphasising the general notion of imprecision rather than exclusively focusing on risk. Additionally, there were 731 lexicons categorised as Litigious (i.e. ‘claimant’, ‘deposition’, ‘interlocutory’, ‘testimony’, and ‘tort’), reflecting a propensity for legal contest. The inclusion of words like ‘legislation’ and ‘regulation’ was made, which do not necessarily imply a legal contest but may indicate a more litigious environment. It’s important to note that many lexicons overlapped between the Negative, Uncertainty, and Litigious categories.

Loughran and McDonald (2011) expanded Jordan’s (1999, cited in Loughran and McDonald, 2011, p.14) categories of strong and weak modal words to include other terms expressing levels of confidence. There were 19 lexicons categorised as Strong_Modal (i.e. ‘always’, ‘highest’, ‘must’, and ‘will’), and 27 lexicons categorised as Weak_Modal (i.e. ‘could’, ‘depending’, ‘might’, and ‘possibly’).

ver.2015#

On top of the ver.2011 of master dictionary, Bodnaruk, Loughran and McDonald (2015) collected 184 lexicons that captures whether or not a firm is financially constrained (Constraining). Similar to Loughran and McDonald (2011), the collection was from tokens that appeared at least 5% of 10-K filings between 1996-2011. The original collection had 183,214 firm-year samples, which was filtered and reduced to 51,533.

Filter	Sample Size	Dropped
SEC 10-K files 1996–2011	183,214
Drop financial firms and utilities	133,992	49,222
Eliminate duplicates within year/CIK	130,450	3,542
Drop if file date < 180 days from prior	129,986	464
CRSP PERMNO match and ordinary common equity	59,177	70,809
Drop if number of 10-K words is < 2,000	59,137	40
Drop if required Compustat data is missing	55,530	3,607
Market capitalization data available on CRSP	51,533	3,997

Only seven lexicons, ‘required’, ‘obligations’, ‘requirements’, ‘require’, ‘impairment’, ‘obligation’, and ‘requires’, account for more than half of all the counts for the constraining words which appeared in 10-Ks. Appendix C (p.32) contains the entire list of Constraining lexicons, and Table 3 (p.39) reports the 50 most frequently occurring Constraining lexicons.

ver.2022#

The Master dictionary in SentiBank is the most up-to-date dictionary (ver.2022) maintained by Loughran and McDonald. While pre-2018 versions did not include abbreviations in general, post-2018 versions are included with a limited number of abbreviations.

from sentibank import archive

load = archive.load()
master = load.origin("MASTER_v2022") 

MASTER (Loughran and McDonald, 2011; Bodnaruk, Loughran and McDonald, 2015)
Word	Seq_num	Word Count	Word Proportion	Average Proportion	Std Dev	Doc Count	Negative	Positive	Uncertainty	Litigious	Strong_Modal	Weak_Modal	Constraining	Syllables	Source
Loading... (need help?)

🧹 Processed Dictionary#

From the original csv, 8 select columns irrelevant to sentiment were programmatically filtered to purify the data for core sentiment modelling^[3]. Upon filtration, the rows lacking substantive sentiment content were additionally removed to refine the dataset. As a result, the corpus of 86,531 were distilled into a lexicon of 3,876 domain-specific affect terms. No notable modifications or removals have been made further on the lexicons.

Note#

^[1] To make sure all possible inflections are considered, Loughran and McDonald (2011) extended the core 2of12inf word list in the following procedure: (i) All tokens in variety of 10-K filings (10-K, 10-K/A, 10-K405, 10-K405/A, 10KSB, 10KSB/A, 10-KSB, 10-KSB/A, 10KSB40, 10KSB40/A) that did not appear in the 2of2inf word list were identified; (ii) Such a collection was then sorted by frequency of occurrence; and (iii) If a token had a frequency count of 50 or more OR was an inflection of a more common word, such a token was evaluated for inclusion in the master dictionary.

^[2] The 2of12inf dictionary originated from the 12dicts project, which explored different methods to extract core vocabulary lists from the 12 source dictionaries. From the name 2of12inf, the ‘2of12’ represents the core vocabulary list containing over 40,000 words that appeared in at least 2 of the 12 source dictionaries. This excluded capitalised words, phrases, abbreviations, affixes, and non-American/secondary spellings.

The ‘inf’ represents the added inflections of those words, expanding the total size to around 81,000 words. However, 2of12inf diverged from only using the 12 source dictionaries. The starting point was a subset of the AGID list by Kevin Atkinson, incorporating public domain sources like Moby Words and WordNet. The list does not exclude secondary spellings, non-American usages

In summary, 2of12inf ultimately optimised coverage at the cost of authority by inflecting 2of12 and adding public domain words.

^[3] The removed columns were Seq_num, Word Count, Word Proportion, Average Proportion, Std Dev, Doc Count, Syllables, Source