DeepFact: Evolving Benchmarks and Verification Agents for Research Factuality

How to build document-level factuality verification agents for deep research outputs, and why benchmark labels need to be explicitly revisable.

Deep research agents — systems that autonomously browse, synthesize, and write multi-page reports — create a new factuality problem that traditional claim-level fact-checking barely touches. A report can be internally consistent, fluently written, and still silently contradict its own cited sources or introduce plausible-sounding claims with no grounding at all. Evaluating this reliably requires both a specialized verification architecture and benchmarks that can keep pace with improving agents.

Why Sentence-Level Fact-Checking Breaks Down for Research Reports

Most factuality pipelines decompose text into atomic claims and verify each one independently against a retrieval corpus. That works well for short, self-contained statements like Wikipedia edits. Research reports are a different animal: a single paragraph may weave together ten sources, with each sentence’s accuracy depending on how earlier sentences set up context. A claim like “the adoption rate doubled compared to the previous year” is unverifiable without knowing which year, which metric, and which source established the baseline — context that lives pages earlier in the same document.

Document-level verification must therefore treat the report as a graph of inter-dependent claims, trace provenance chains across multiple cited sources, and distinguish three failure modes that look similar on the surface:

Hallucination: a claim with no supporting source at all
Misattribution: a claim that cites a real source but misrepresents what it says
Unsupported inference: a logical leap beyond what the cited sources actually establish

A verification agent that cannot distinguish these three types produces signals too coarse to be actionable for engineers trying to improve the underlying research agent.

The Audit-then-Score Benchmark Design

A persistent problem in agent evaluation is that benchmarks go stale faster than the agents do. When a research agent reaches human-level performance on a fixed dataset, the natural response is to build a harder dataset — but that response is slow, expensive, and introduces distributional shift between benchmark versions.

The Audit-then-Score (AtS) approach addresses this differently: benchmark labels are treated as explicitly revisable hypotheses rather than ground truth. After a new agent generation runs against the benchmark, an audit phase re-examines cases where the agent’s verdict disagrees with the stored label. A human or automated auditor then decides whether the label was wrong, the agent was wrong, or the case is genuinely ambiguous. Confirmed label errors get corrected; confirmed agent errors become harder negatives in the next training cycle.

Note

AtS flips the typical relationship between benchmark and model. Instead of a fixed benchmark judging a changing model, the benchmark and the agent co-evolve: each agent generation stress-tests the labels, and each label correction stress-tests the next agent generation.

This matters practically because research factuality benchmarks are expensive to construct — they require expert annotators who can actually read and understand the source documents. Amortizing that expert effort across multiple agent generations, rather than discarding labels when a model surpasses them, is a more sustainable infrastructure pattern.

DeepFact-Eval: A Document-Level Verification Agent

A verification agent for research reports needs to operate at a different granularity than a sentence-level classifier. The core pipeline looks roughly like this:

Research Report
      │
      ▼
┌─────────────────────┐
│  Claim Decomposer   │  ← identifies verifiable assertions
└─────────┬───────────┘       with their document context
          │
          ▼
┌─────────────────────┐
│  Provenance Tracer  │  ← maps claims to cited source passages
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Source Retriever   │  ← fetches and chunks actual source docs
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Entailment Judge   │  ← rates support: full / partial / none
└─────────┬───────────┘
          │
          ▼
┌─────────────────────┐
│  Report Aggregator  │  ← document-level verdict + evidence
└─────────────────────┘

The Claim Decomposer is more than a simple sentence splitter. It preserves the surrounding context that makes a claim interpretable and tags each claim with its citation anchors. The Provenance Tracer resolves those anchors to actual source URLs or document identifiers, which the Source Retriever then fetches and chunks for comparison. The Entailment Judge — typically an LLM prompted with both the claim-in-context and the retrieved passage — assigns a support rating. Finally, the Report Aggregator combines per-claim ratings into a document-level factuality score, weighting by claim significance rather than treating every sentence equally.

Tip

For production deployments, the Source Retriever step is the most fragile. URLs in research reports go dead, paywalled, or redirect to updated versions. Build in a fallback that queries a web search or cached document store when direct retrieval fails, and flag claims whose sources could not be retrieved separately from claims whose sources were retrieved but don’t support the claim.

Engineering Considerations for Production Verification Pipelines

Running a document-level verification agent in production introduces latency and cost pressures that don’t exist in offline evaluation. A few patterns help:

Lazy verification by claim risk. Not all claims in a research report carry the same factuality risk. Numerical claims, causal claims, and attributed quotes are high-risk; scene-setting or transitional sentences are low-risk. A risk classifier as the first step lets you skip full provenance tracing for the majority of sentences, spending verification budget only where it matters.

Caching source retrievals. Research agents querying overlapping topics will repeatedly cite the same sources. A shared document cache keyed by URL and retrieval timestamp avoids redundant fetches and makes verification latency predictable.

Structured verdicts for downstream use. A binary hallucinated/not-hallucinated label is rarely actionable. Emit structured verdicts that include the specific claim, the source passage that was checked, the support rating, and the failure mode type. This output feeds both human review workflows and automated reranking loops that can teach the research agent which source-citation patterns it gets wrong most often.

@dataclass
class ClaimVerdict:
    claim_text: str
    claim_context: str          # surrounding sentences for interpretability
    source_url: str
    retrieved_passage: str
    support_rating: Literal["full", "partial", "none", "source_unavailable"]
    failure_mode: Literal["hallucination", "misattribution", "unsupported_inference", None]
    confidence: float           # verifier's own confidence in its judgment

Connecting Verification Back to the Research Agent

A verification agent is most valuable when its output closes a feedback loop rather than just producing a post-hoc score. The practical integration points are:

Training signal: Claims rated as misattributed or hallucinated become negative examples for fine-tuning the research agent’s synthesis step.
Retrieval reranking: If the verifier consistently finds that certain source types (e.g., secondary summaries vs. primary papers) produce more unsupported inferences, the research agent’s retrieval strategy can down-rank those source types.
User-facing confidence: A document-level factuality score surfaced to end users sets appropriate expectations and flags reports that warrant human expert review before being acted upon.

Warning

Don’t conflate the verification agent’s confidence score with the report’s actual factuality. A verification agent can be confidently wrong, especially on domain-specialized claims where the underlying LLM lacks grounding. Always validate verification agent performance on a held-out sample from your specific domain before using its scores to gate production outputs.

The deeper lesson from co-evolving benchmarks and agents is architectural: evaluation infrastructure should be treated as a live system, not a static artifact. As research agents improve, the cases they get wrong become more subtle, and the benchmark labels most likely to be wrong are exactly the hard cases near the decision boundary. Building the audit loop into the evaluation pipeline from day one — rather than retrofitting it after the model has outpaced the benchmark — is the kind of infrastructure investment that compounds over successive agent generations.