Methodology

How Risk Diff extracts, matches, and classifies SEC 10-K risk factor changes.

Data Source

All filings are fetched directly from SEC EDGAR via the EDGAR full-text search API. Risk Diff uses the official User-Agent header required by SEC fair-access policy. No third-party data providers are used. Every claim on this site is traceable to a specific filing on EDGAR.

Extraction

Each 10-K filing contains a section called Item 1A — Risk Factors. Locating it is non-trivial: filings vary significantly in structure across companies and years, from clean HTML to deeply nested iXBRL documents.

Risk Diff tries four strategies in order, stopping at the first that succeeds:

  1. Named anchor matching — looks for id or name attributes matching patterns like item1a, item-1a, or riskfactor.
  2. TOC href following — finds table-of-contents links with text matching "Item 1A" or "Risk Factors" and follows the href to the target anchor. Handles iXBRL EDGAR filings and non-standard layouts.
  3. Block element scan — walks block-level elements looking for Risk Factors headings, skipping elements inside <table> tags to avoid false matches on TOC rows.
  4. Flat-div iXBRL — covers filings (e.g. Intel, GE Aerospace) that use standalone body-level divs with Risk Factors headings rather than anchored sections.

If none of the four strategies succeed, the filing is omitted rather than guessed. Risk Diff degrades gracefully — uncertain extractions are excluded, not approximated.

Diff Algorithm

Once extracted, risk factor sections from two consecutive filings are matched using TF-IDF cosine similarity (via scikit-learn). This is a deterministic, reproducible algorithm with no LLM dependency for the core matching step.

Each section from the prior year is compared against all sections from the current year. The highest-scoring pair above the match threshold is selected. Unmatched sections become ADDED or REMOVED.

Classification Thresholds

StatusConditionMeaning
ADDEDNo match above 0.45 in prior yearNew risk section with no close predecessor
REMOVEDNo match above 0.45 in current yearPrior section with no close successor
MODIFIEDSimilarity 0.45 – 0.92Matched pair with meaningful text changes
UNCHANGEDSimilarity ≥ 0.92Matched pair with essentially identical text

MODIFIED entries also carry a confidence level based on their similarity score:

ConfidenceSimilarity score
High≥ 0.75
Medium0.60 – 0.75
Low0.45 – 0.60

AI-Generated Summaries

Each company page includes a short narrative summary generated by Claude Haiku (Anthropic). These summaries are clearly labeled and should be treated as interpretive aids, not factual claims. All other content — titles, section text, classifications, counts — is deterministically extracted from the source filing.

Machine-Readable Formats

Every company page is available in three formats:

All formats are listed in llms.txt and the sitemap. The HTML pages advertise alternate formats via <link rel="alternate"> tags.

Limitations

SEC disclosures are lawyer-mediated. Companies often add risk language defensively after competitors do, use deliberately vague wording to reduce litigation exposure, or disclose risks primarily to satisfy legal requirements rather than to inform investors. Changes in wording are signals — not definitive evidence of actual operational risk.
Classification is based on text similarity, not semantics. A section that is substantially rewritten may score as MODIFIED even if the underlying risk is entirely new, or vice versa. Low-confidence MODIFIED entries (similarity 0.45–0.60) should be interpreted with particular care.
Coverage is limited to the companies in our dataset. Risk Diff currently covers 135 S&P 500 companies with two consecutive annual filings each. Not all S&P 500 companies are included and coverage expands over time.