Methodology

How we extract risk factors, compute diffs, and interpret them. Where the deterministic pipeline ends and the AI layer begins.

Last updated 2026-07-05

AI never invents facts. It only interprets the deterministic diff.

Pipeline overview

RiskDiff has five stages. The first four are 100% deterministic — given the same source filing, they produce the same output every time, with no language model involvement. The fifth stage is an AI interpretation layer that runs strictly on top of the deterministic output, and every AI-derived value is clearly labeled on the site.

Stage 1 — EDGAR fetch

Deterministic

SEC EDGAR HTTP fetch

10-K filings are fetched directly from www.sec.gov using the company's CIK. Every fetch carries a User-Agent header per SEC requirements. Each filing's accession number is preserved so any claim on the site can be traced back to the exact source document.

Stage 2 — Item 1A extraction

Deterministic

4-strategy locator

Risk factors live in Item 1A of the 10-K. The extractor tries four strategies in order:

Anchor matching — named IDs like item1a or riskfactor.
TOC href — <a href="#id">Item 1A</a> or Risk Factors links.
Block element scan — skip <table> parents to avoid TOC false matches.
Flat-div iXBRL — standalone tables or body-level divs with a "Risk Factors" heading (covers Intel, GE Aerospace, and similar non-standard layouts).

Extraction uses BeautifulSoup. No large language model is involved. If all four strategies fail, the filing is skipped — we never guess.

Stage 3 — TF-IDF similarity diff

Deterministic

scikit-learn cosine matching

Each risk factor from the prior year is matched against the current year using TF-IDF vectorisation (unigrams + bigrams) and cosine similarity. The greedy best-first match assigns pairs above a similarity threshold; unmatched entries become ADDED or REMOVED. Matched pairs are classified UNCHANGED above similarity 0.92 and MODIFIED below it.

For MODIFIED entries we run a sentence-level diff using difflib.SequenceMatcher to produce a short list of reworded, added, or removed sentences.

Stage 4 — Deterministic severity baseline

Deterministic

Rule-based 1-10 score

Every change gets a deterministic severity score. The formula:

ADDED with no near-match prior: base 6 — brand-new risk factor.
ADDED with weak prior match: base 4.
REMOVED: base 4.
MODIFIED: base 2 + round((1 - similarity) × 4) — bigger rewrites score higher.
+1 per signals.py keyword hit (AI, tariffs, China, cyber, etc.), capped at +3.
+1 if length delta exceeds 500 characters.
Clamped to [1, 10]. UNCHANGED is excluded from the severity-sorted view.

Tiers: 8-10 = high, 4-7 = medium, 1-3 = low. The same input always yields the same score. No large language model involved.

Stage 5 — AI interpretation layer

⚠ AI-derived — clearly labelled on every page

Claude Haiku, structured output

A single call to Claude Haiku produces a strict JSON object with four fields:

executive_summary — one sentence (≤25 words) in plain business language.
direction — one of more_concerning, less_concerning, mixed, neutral.
top_themes — up to 3 themes, drawn from the deterministic signals taxonomy when applicable.
ai_severity_bumps — for up to 10 of the highest deterministic-scored changes, an optional 0-3 bump reflecting AI's judgement of business impact, plus a one-sentence rationale.

The AI never sees raw filing text without it having first been extracted and diffed deterministically. The AI cannot add, remove, or rephrase a risk factor — only summarise, label direction, tag themes, and grade business impact. If the structured-output call fails, the AI layer is omitted; the deterministic pipeline still renders the full page.

On every company page, every AI-derived value carries an ⚠ AI badge. Deterministic values (extracted text, sentence diffs, confidence scores, the deterministic severity baseline) do not.

Reproducibility

Stages 1-4 are reproducible from source filings alone: given the same SEC documents, anyone running the pipeline will get the same diff and the same deterministic severity scores. Stage 5 is reproducible only up to the model's own determinism; we persist the AI output to JSON so the site renders the exact same brief on every regeneration until we choose to refresh it.

Open data

Each company page has a index.md (Markdown) and index.json alternate. The JSON includes every diff entry with its full deterministic severity object and any AI severity bump. The MCP server at https://riskdiff.com/api/mcp exposes the same data programmatically for AI agents and pipelines.