Machine Learning Techniques for Detecting Greenwashing in Corporate Financial Reports
Keywords:
Greenwashing Detection, Natural Language Processing, Corporate Reporting, Financial Disclosures, Machine Learning, SustainabilityAbstract
This research introduces a novel, hybrid machine learning framework specifically
designed to detect and quantify greenwashing—the practice of making misleading
environmental claims—within the narrative sections of corporate financial reports.
While existing literature primarily focuses on sentiment analysis or keyword spotting for sustainability reporting, our approach uniquely integrates three distinct
methodologies to capture the nuanced, often obfuscated nature of greenwashing.
First, we employ a transformer-based language model fine-tuned on a purposebuilt corpus of verified greenwashing and legitimate sustainability disclosures to
perform deep semantic analysis, moving beyond surface-level features. Second,
we implement a novel coherence scoring mechanism that measures the alignment
between environmental claims made in the front-of-report narratives and the quantitative environmental performance data presented in appendices or supplementary
reports, identifying strategic decoupling. Third, we develop a temporal inconsistency detector using recurrent neural networks to flag claims that contradict a
company’s own historical environmental disclosures. We validate our framework
on a manually annotated dataset of 500 annual reports from the SP 500 between
1995 and 2004, achieving a detection accuracy of 91.7% and a precision of 88.3%
in identifying materially misleading statements, significantly outperforming baseline keyword-matching and sentiment analysis models. Our findings reveal that
greenwashing is not merely a function of exaggerated positive sentiment but is
characterized by specific rhetorical patterns, strategic vagueness, and measurable
disconnects between narrative and data. This work provides auditors, regulators,
and investors with a powerful, automated tool for enhanced scrutiny of corporate
environmental communications and establishes a new methodological paradigm for
computational analysis of corporate discourse.