MASTER#

Sentiment expressions common in financial regulatory filings đŸ’Œ, such as 10-K.#

Summary

Composition:

  • Approximately 3.8k unigram lexicon entries

  • Financial domain-specific business phrases and jargon

  • Multi-Class Multi-Label (Negative, Positive, Uncertainty, Litigious, Strong_Modal, Weak_Modal, and Constraining)

Creation Methodology:

  • Lexicon candidates were based on tokens that appeared at least 5% of sampled 10-K filings

  • The tokens were then manually labelled by the authors

Evaluation: Loughran and McDonald (2011) tested the significance of the MASTER dictionary by examining the market’s reaction at the time of a 10-K filing - ‘If tone matters, firms filing 10-Ks with a high measure of negative words should, on average, experience negative excess returns around the filing date’ (p.18). Compared to the Harvard_GI’s TagNeg Dictionary (H4N) which showed no significant association to the file date excess returns (t = -1.35, below 95% confidence), MASTER’s ‘Negative’ dictionary showed a statistically significant negative correlation (t = -2.64, above 99% confidence).

Bodnaruk, Loughran and McDonald (2015) investigated how well ‘different measures are able to predict future developments associated with the deterioration or improvement of external financing conditions’ (p.16). Compared to other ‘traditional measures of financial constraints’, MASTER’s ‘Constraining’ dictionary was the only measure that predicted all four directions of ex-post liquidity events with 1% significance level (p.17, 40).

Usage Guidance: Useful for analysis of finance publications, earnings calls, annual reports. Access processed dictionary via sentibank.archive.load().dict(“MASTER_v2022”).

📋 Introduction#

While a sentiment dictionary designed to be applied in the general domain may be useful, ‘English words have many meanings, and a word categorisation scheme derived for one discipline might not translate effectively into a discipline with its own dialect’ (Loughran and McDonald, 2011, p.1). Such generality could potentially lead to Type-I errors due to sentiment misclassification in a different context. For example, Loughran and McDonald (2011) revealed that nearly 73.8% of the negative word counts in Harvard Psychosociological TagNeg Dictionary (H4N) are actually words that are not typically negative in a financial context.

To address such constraints, Loughran and McDonald (2011) created a finance-specific dictionary called MASTER. The dictionary is regularly updated by the authors, and was last updated in 2022.

📚 Original Dictionary#

The master dictionary originally started with 6 labels – Negative, Positive, Uncertainty, Litigious, Strong_Modal, and Weak_Modal – but was later enhanced by Bodnaruk, Loughran and McDonald (2015) with addition of Constraining label. Note that both Litigious and Constraining lexicons capture a similar “tone”: According to Bodnaruk, Loughran and McDonald (2015), there was a positive correlation between the frequency of Constraining lexicons and the frequency of Litigious lexicons in the 10-K samples.

ver.2011#

The sentiment lexicons of MASTER dictionary (ver.2011) is from a ‘relatively exhaustive list’ of tokens that occurred in at least 5% of 10-K documents between 1994-2008 (p.12). From the SEC’s EDGAR, Loughran and McDonald (2011) collected 121,217 firm-year 10-K/10-K405 samples, which was later filtered into 50,115 samples consisting of 8,341 unique firms.

Filter

Sample Size

Dropped

EDGAR 10-K / 10-K405 1994-2008 complete sample (excluding duplicates)

121,217

Include only first filing in a given year

120,290

927

At least 180 days between a given firm’s 10-K filings

120,074

216

CRSP PERMNO match

75,252

44,822

Reported on CRSP as an ordinary common equity firm

70,061

5,191

CRSP market capitalization data available

64,227

5,834

Price on filing date day minus one > $3

55,946

8,281

Returns and volume for day 0-3 event period

55,630

316

NYSE, AMEX, or Nasdaq exchange listing

55,612

18

At least 60 days of returns and volume in year prior to and following file date

55,038

574

Book-to-market COMPUSTAT data available and book value > 0

50,268

4,770

Number of words in 10-K > 2,000

50,115

153

The collection of lexicons was further extended by accounting for its inflection (i.e for a token ‘accident’, its inflections ‘accidental’, ‘accidentally’ and ‘accidents’ were added) using ver.4.0 of 2of12inf dictionary developed by SCOWL [1],[2]. Note that Loughran and McDonald (2011) used inflections instead of stemming because ‘if the focus is on tone, using explicit inflections is less error prone than extending a word using stemming (root morpheme + derivational morphemes)’ (Software Repository of Accounting and Finance, n.d.).

3,752 lexicons were collected, and Loughran and McDonald (2011) labelled such lexicons under 6 categories: Negative, Positive, Uncertainty, Litigious, Strong_Modal and Weak_Modal. There were 2,337 Negative lexicons (i.e ‘felony’, ‘litigation’, ‘restated’, ‘misstatement’, ‘unanticipated’), while there were 353 Positive lexicons (i.e ‘achieve’, ‘attain’, ‘efficient’, ‘improve’, ‘profitable’), which were substantially fewer. The top 5 most frequently occurring Negative lexicons in the 10-Ks sample were ‘loss’, ‘losses’, ‘impairment’, ‘against’ and ‘adverse’.

There were 285 lexicons categorised as Uncertainty (i.e. ‘approximate’, ‘contingency’, ‘depend’, ‘fluctuate’, ‘indefinite’, ‘uncertain’, and ‘variability’), emphasising the general notion of imprecision rather than exclusively focusing on risk. Additionally, there were 731 lexicons categorised as Litigious (i.e. ‘claimant’, ‘deposition’, ‘interlocutory’, ‘testimony’, and ‘tort’), reflecting a propensity for legal contest. The inclusion of words like ‘legislation’ and ‘regulation’ was made, which do not necessarily imply a legal contest but may indicate a more litigious environment. It’s important to note that many lexicons overlapped between the Negative, Uncertainty, and Litigious categories.

Loughran and McDonald (2011) expanded Jordan’s (1999, cited in Loughran and McDonald, 2011, p.14) categories of strong and weak modal words to include other terms expressing levels of confidence. There were 19 lexicons categorised as Strong_Modal (i.e. ‘always’, ‘highest’, ‘must’, and ‘will’), and 27 lexicons categorised as Weak_Modal (i.e. ‘could’, ‘depending’, ‘might’, and ‘possibly’).

ver.2015#

On top of the ver.2011 of master dictionary, Bodnaruk, Loughran and McDonald (2015) collected 184 lexicons that captures whether or not a firm is financially constrained (Constraining). Similar to Loughran and McDonald (2011), the collection was from tokens that appeared at least 5% of 10-K filings between 1996-2011. The original collection had 183,214 firm-year samples, which was filtered and reduced to 51,533.

Filter

Sample Size

Dropped

SEC 10-K files 1996–2011

183,214

Drop financial firms and utilities

133,992

49,222

Eliminate duplicates within year/CIK

130,450

3,542

Drop if file date < 180 days from prior

129,986

464

CRSP PERMNO match and ordinary common equity

59,177

70,809

Drop if number of 10-K words is < 2,000

59,137

40

Drop if required Compustat data is missing

55,530

3,607

Market capitalization data available on CRSP

51,533

3,997

Only seven lexicons, ‘required’, ‘obligations’, ‘requirements’, ‘require’, ‘impairment’, ‘obligation’, and ‘requires’, account for more than half of all the counts for the constraining words which appeared in 10-Ks. Appendix C (p.32) contains the entire list of Constraining lexicons, and Table 3 (p.39) reports the 50 most frequently occurring Constraining lexicons.

ver.2022#

The Master dictionary in SentiBank is the most up-to-date dictionary (ver.2022) maintained by Loughran and McDonald. While pre-2018 versions did not include abbreviations in general, post-2018 versions are included with a limited number of abbreviations.

from sentibank import archive

load = archive.load()
master = load.origin("MASTER_v2022") 
MASTER (Loughran and McDonald, 2011; Bodnaruk, Loughran and McDonald, 2015)
Word Seq_num Word Count Word Proportion Average Proportion Std Dev Doc Count Negative Positive Uncertainty Litigious Strong_Modal Weak_Modal Constraining Syllables Source
Loading... (need help?)

đŸ§č Processed Dictionary#

From the original csv, 8 select columns irrelevant to sentiment were programmatically filtered to purify the data for core sentiment modelling[3]. Upon filtration, the rows lacking substantive sentiment content were additionally removed to refine the dataset. As a result, the corpus of 86,531 were distilled into a lexicon of 3,876 domain-specific affect terms. No notable modifications or removals have been made further on the lexicons.

Note#

[1] To make sure all possible inflections are considered, Loughran and McDonald (2011) extended the core 2of12inf word list in the following procedure: (i) All tokens in variety of 10-K filings (10-K, 10-K/A, 10-K405, 10-K405/A, 10KSB, 10KSB/A, 10-KSB, 10-KSB/A, 10KSB40, 10KSB40/A) that did not appear in the 2of2inf word list were identified; (ii) Such a collection was then sorted by frequency of occurrence; and (iii) If a token had a frequency count of 50 or more OR was an inflection of a more common word, such a token was evaluated for inclusion in the master dictionary.

[2] The 2of12inf dictionary originated from the 12dicts project, which explored different methods to extract core vocabulary lists from the 12 source dictionaries. From the name 2of12inf, the ‘2of12’ represents the core vocabulary list containing over 40,000 words that appeared in at least 2 of the 12 source dictionaries. This excluded capitalised words, phrases, abbreviations, affixes, and non-American/secondary spellings.

The ‘inf’ represents the added inflections of those words, expanding the total size to around 81,000 words. However, 2of12inf diverged from only using the 12 source dictionaries. The starting point was a subset of the AGID list by Kevin Atkinson, incorporating public domain sources like Moby Words and WordNet. The list does not exclude secondary spellings, non-American usages

In summary, 2of12inf ultimately optimised coverage at the cost of authority by inflecting 2of12 and adding public domain words.

[3] The removed columns were Seq_num, Word Count, Word Proportion, Average Proportion, Std Dev, Doc Count, Syllables, Source