---
title: Methodology
description: How riskdiff.com extracts, matches, and classifies SEC 10-K risk factor changes.
url: https://riskdiff.com/methodology/
markdown_url: https://riskdiff.com/methodology/index.md
---

# Methodology

How Risk Diff extracts, matches, and classifies SEC 10-K risk factor changes.

---

## Data Source

All filings are fetched directly from **SEC EDGAR** via the EDGAR full-text search API. Risk Diff uses the official `User-Agent` header required by SEC fair-access policy. No third-party data providers are used. Every claim on this site is traceable to a specific filing on EDGAR.

---

## Extraction

Each 10-K filing contains a section called **Item 1A — Risk Factors**. Locating it is non-trivial: filings vary significantly in structure across companies and years, from clean HTML to deeply nested iXBRL documents.

Risk Diff tries four strategies in order, stopping at the first that succeeds:

1. **Named anchor matching** — looks for `id` or `name` attributes matching patterns like `item1a`, `item-1a`, or `riskfactor`.
2. **TOC href following** — finds table-of-contents links with text matching "Item 1A" or "Risk Factors" and follows the `href` to the target anchor. Handles iXBRL EDGAR filings and non-standard layouts.
3. **Block element scan** — walks block-level elements looking for Risk Factors headings, skipping elements inside `<table>` tags to avoid false matches on TOC rows.
4. **Flat-div iXBRL** — covers filings (e.g. Intel, GE Aerospace) that use standalone body-level divs with Risk Factors headings rather than anchored sections.

If none of the four strategies succeed, the filing is omitted rather than guessed. Risk Diff degrades gracefully — uncertain extractions are excluded, not approximated.

---

## Diff Algorithm

Once extracted, risk factor sections from two consecutive filings are matched using **TF-IDF cosine similarity** (via scikit-learn). This is a deterministic, reproducible algorithm with no LLM dependency for the core matching step.

Each section from the prior year is compared against all sections from the current year. The highest-scoring pair above the match threshold is selected. Unmatched sections become ADDED or REMOVED.

---

## Classification Thresholds

| Status | Condition | Meaning |
|--------|-----------|---------|
| `ADDED` | No match above 0.45 in prior year | New risk section with no close predecessor |
| `REMOVED` | No match above 0.45 in current year | Prior section with no close successor |
| `MODIFIED` | Similarity 0.45 – 0.92 | Matched pair with meaningful text changes |
| `UNCHANGED` | Similarity ≥ 0.92 | Matched pair with essentially identical text |

MODIFIED entries also carry a confidence level based on their similarity score:

| Confidence | Similarity score |
|------------|-----------------|
| High | ≥ 0.75 |
| Medium | 0.60 – 0.75 |
| Low | 0.45 – 0.60 |

---

## AI-Generated Summaries

Each company page includes a short narrative summary generated by **Claude Haiku** (Anthropic). These summaries are clearly labeled and should be treated as interpretive aids, not factual claims. All other content — titles, section text, classifications, counts — is deterministically extracted from the source filing.

---

## Machine-Readable Formats

Every company page is available in three formats:

- **HTML** — `/aapl/2025-vs-2024/`
- **Markdown** — `/aapl/2025-vs-2024/index.md` (includes YAML frontmatter)
- **JSON** — `/aapl/2025-vs-2024/index.json` (structured risk entries with scores)

All formats are listed in `llms.txt` and the sitemap. HTML pages advertise alternate formats via `<link rel="alternate">` tags.

---

## Limitations

**SEC disclosures are lawyer-mediated.** Companies often add risk language defensively after competitors do, use deliberately vague wording to reduce litigation exposure, or disclose risks primarily to satisfy legal requirements rather than to inform investors. Changes in wording are signals — not definitive evidence of actual operational risk.

**Classification is based on text similarity, not semantics.** A section that is substantially rewritten may score as MODIFIED even if the underlying risk is entirely new, or vice versa. Low-confidence MODIFIED entries (similarity 0.45–0.60) should be interpreted with particular care.

**Coverage is limited to the companies in our dataset.** Risk Diff currently covers 135 S&P 500 companies with two consecutive annual filings each. Coverage expands over time.
