How we extract risk factors, compute diffs, and interpret them. Where the deterministic pipeline ends and the AI layer begins.
AI never invents facts. It only interprets the deterministic diff.
RiskDiff has five stages. The first four are 100% deterministic — given the same source filing, they produce the same output every time, with no language model involvement. The fifth stage is an AI interpretation layer that runs strictly on top of the deterministic output, and every AI-derived value is clearly labeled on the site.
10-K filings are fetched directly from www.sec.gov using the company's CIK. Every fetch carries a User-Agent header per SEC requirements. Each filing's accession number is preserved so any claim on the site can be traced back to the exact source document.
Risk factors live in Item 1A of the 10-K. The extractor tries four strategies in order:
item1a or riskfactor.<a href="#id">Item 1A</a> or Risk Factors links.<table> parents to avoid TOC false matches.Extraction uses BeautifulSoup. No large language model is involved. If all four strategies fail, the filing is skipped — we never guess.
Each risk factor from the prior year is matched against the current year using TF-IDF vectorisation (unigrams + bigrams) and cosine similarity. The greedy best-first match assigns pairs above a similarity threshold; unmatched entries become ADDED or REMOVED. Matched pairs are classified UNCHANGED above similarity 0.92 and MODIFIED below it.
For MODIFIED entries we run a sentence-level diff using difflib.SequenceMatcher to produce a short list of reworded, added, or removed sentences.
Every change gets a deterministic severity score. The formula:
ADDED with no near-match prior: base 6 — brand-new risk factor.ADDED with weak prior match: base 4.REMOVED: base 4.MODIFIED: base 2 + round((1 - similarity) × 4) — bigger rewrites score higher.Tiers: 8-10 = high, 4-7 = medium, 1-3 = low. The same input always yields the same score. No large language model involved.
A single call to Claude Haiku produces a strict JSON object with four fields:
more_concerning, less_concerning, mixed, neutral.The AI never sees raw filing text without it having first been extracted and diffed deterministically. The AI cannot add, remove, or rephrase a risk factor — only summarise, label direction, tag themes, and grade business impact. If the structured-output call fails, the AI layer is omitted; the deterministic pipeline still renders the full page.
On every company page, every AI-derived value carries an ⚠ AI badge. Deterministic values (extracted text, sentence diffs, confidence scores, the deterministic severity baseline) do not.
Stages 1-4 are reproducible from source filings alone: given the same SEC documents, anyone running the pipeline will get the same diff and the same deterministic severity scores. Stage 5 is reproducible only up to the model's own determinism; we persist the AI output to JSON so the site renders the exact same brief on every regeneration until we choose to refresh it.
Each company page has a index.md (Markdown) and index.json alternate. The JSON includes every diff entry with its full deterministic severity object and any AI severity bump. The MCP server at https://riskdiff.com/api/mcp exposes the same data programmatically for AI agents and pipelines.