How Risk Diff extracts, matches, and classifies SEC 10-K risk factor changes.
All filings are fetched directly from SEC EDGAR via the EDGAR full-text search API. Risk Diff uses the official User-Agent header required by SEC fair-access policy. No third-party data providers are used. Every claim on this site is traceable to a specific filing on EDGAR.
Each 10-K filing contains a section called Item 1A — Risk Factors. Locating it is non-trivial: filings vary significantly in structure across companies and years, from clean HTML to deeply nested iXBRL documents.
Risk Diff tries four strategies in order, stopping at the first that succeeds:
id or name attributes matching patterns like item1a, item-1a, or riskfactor.href to the target anchor. Handles iXBRL EDGAR filings and non-standard layouts.<table> tags to avoid false matches on TOC rows.If none of the four strategies succeed, the filing is omitted rather than guessed. Risk Diff degrades gracefully — uncertain extractions are excluded, not approximated.
Once extracted, risk factor sections from two consecutive filings are matched using TF-IDF cosine similarity (via scikit-learn). This is a deterministic, reproducible algorithm with no LLM dependency for the core matching step.
Each section from the prior year is compared against all sections from the current year. The highest-scoring pair above the match threshold is selected. Unmatched sections become ADDED or REMOVED.
| Status | Condition | Meaning |
|---|---|---|
ADDED | No match above 0.45 in prior year | New risk section with no close predecessor |
REMOVED | No match above 0.45 in current year | Prior section with no close successor |
MODIFIED | Similarity 0.45 – 0.92 | Matched pair with meaningful text changes |
UNCHANGED | Similarity ≥ 0.92 | Matched pair with essentially identical text |
MODIFIED entries also carry a confidence level based on their similarity score:
| Confidence | Similarity score |
|---|---|
| High | ≥ 0.75 |
| Medium | 0.60 – 0.75 |
| Low | 0.45 – 0.60 |
Each company page includes a short narrative summary generated by Claude Haiku (Anthropic). These summaries are clearly labeled and should be treated as interpretive aids, not factual claims. All other content — titles, section text, classifications, counts — is deterministically extracted from the source filing.
Every company page is available in three formats:
/aapl/2025-vs-2024//aapl/2025-vs-2024/index.md (includes YAML frontmatter)/aapl/2025-vs-2024/index.json (structured risk entries with scores)All formats are listed in llms.txt and the sitemap. The HTML pages advertise alternate formats via <link rel="alternate"> tags.