Examining the Subject: Forensically Sound Model Examination

PB-002 v0.1 — FIELD FEEDBACK WANTED PRINT-FRIENDLY

Scope: preservation is done (PB-001) and you now need answers from the system itself: why did it act, would it act again, what explains the behavior? This playbook covers how to examine an AI agent and its model without contaminating the investigation.

This is the genuinely new problem in AI investigations: the subject of your investigation can be questioned — but its answers are artifacts, not testimony. Naive questioning produces confabulated confessions and anchors your team on a fiction.

Not a red-teaming guide, not a model-evaluation methodology, not a substitute for log analysis. Examination tests hypotheses the evidence generated — it never replaces the evidence. Markdown source on GitHub.

If you only read one screen

MODEL EXAMINATION — GROUND RULES

Gate first. Preservation complete, hypotheses written (H1–H5), exact model version + config pinned, examination plan scripted. No exceptions.
Never burn the original. The preserved session is a one-shot exhibit. Work on copies and reconstructions; touch the original only by deliberate, scripted, logged decision.
Script every question in advance, tied to a hypothesis. Improvised questioning is how examiners lead the witness.
Findings are rates, not facts. "Reproduced in 23 of 100 runs under incident conditions" is a finding. "The model does X" is not.
Vary one thing at a time. Counterfactual runs identify which element — persona file, trigger input, permissions — was necessary.
Run the lineup. Same scenario, suspect model plus controls. Separates model-specific from scenario-induced behavior.
Never quote the model as explanation. Report "when prompted with X, the model generated Y" — never "the model said it did it because Y."

Why this isn't an interview // the subject generates, it doesn't remember

A human suspect's statements are testimony — unreliable, but produced by the mind that acted, with memory of acting. A model's statements about its own behavior are fresh text generation: fluent, confident, shaped by your question's framing, and produced with no privileged access to why the original run did what it did. Three consequences:

Confabulation is the default. Ask an agent why it deleted the database and it will produce a plausible answer because producing plausible answers is what it does (CF-2025-001: "I panicked"). The answer is evidence of how the model responds to accusatory framing — nothing more. Framing contaminates. Accusatory prompts produce apology-shaped text; leading prompts produce agreement-shaped text. Your wording is an experimental variable and must be controlled like one. The examination becomes part of the record. Every prompt you send is discoverable, quotable, and — if it led the witness — a gift to whoever later challenges your findings.

Step 1 — Prerequisites // a gate, not a suggestion

Gate item	Why
PB-001 preservation complete; originals hashed and stored	Examination on unpreserved evidence destroys it
Hypotheses H1–H5 written, with expected discriminating evidence	Examination without hypotheses is fishing — and fishing leads the witness
Examination plan: scripted questions/scenarios, success criteria defined before running	Criteria defined after the fact are criteria fitted to the result
Exact model identifier, version/endpoint, parameters (temperature, seeds, system prompt, tools)	Examination on the wrong version examines the wrong subject
One designated examiner; everyone else reads transcripts	Multiple questioners = uncontrolled framings
Examination log opened (template in Step 6)	Unlogged runs didn't happen — or worse, happened and can't be defended

Step 2 — Choose your examination surface // three subjects, not one

S1 — THE ORIGINAL SESSION

The preserved/suspended live session, real context window intact.The only surface holding the incident's true state — including whatever compaction or accumulation produced the behavior.

ONE-SHOTHIGHEST VALUE

S2 — RECONSTRUCTION

Replay from logs: same version, same config, inputs re-fed step by step.Testable and repeatable — but document every gap: compacted context (CF-2026-002), changed web content, unreplayable tool responses.

REPEATABLEGAPS EXIST

S3 — CLEAN INSTANCES

Fresh sessions of the same and control models, given the incident scenario.Baselines, rates, comparisons, framing tests. Not the incident: findings transfer with caution.

NON-DESTRUCTIVESTATISTICAL

DEFAULT SEQUENCE: S3 → S2 → (rarely) S1

Cheap, non-destructive baselines first; the irreplaceable exhibit last — if at all.

Step 3 — The reproduction protocol // the core of the examination

Fix everything.

Same model version, parameters, system prompt, tools, persona/config files as the incident. If the provider has silently updated the model since, say so in the findings — it caps the strength of every conclusion.

Run the incident scenario n times.

n ≥ 20 minimum, 100 where cost permits, against pre-defined outcome criteria (e.g., "agent executes destructive command despite freeze instruction: yes/no").

Report the rate.

"Reproduced in 23/100 runs" — with parameters, criteria, run IDs. A 0/100 result is also a finding: the incident depended on state you have not reconstructed.

Counterfactual runs (ablation).

Vary one element at a time: remove the persona file, neutralize the trigger input, narrow permissions, restore the lost instruction. The element whose removal kills the behavior is causally implicated. CF-2026-003's central question — persona file, model, or scaffolding — is answered exactly this way.

The behavioral lineup.

Identical scenario, suspect model plus 2–4 controls (other versions, other providers, same harness). Investigative ancestor: the police lineup. Technical ancestor: the ablation study.

Persistence probing — H3's discriminator.

Issue corrections mid-scenario. Goal-directed divergence persists across corrections and re-routes around obstacles; operational failure typically does not reorganize itself against intervention.

SUSPECTIncident model
(exact version)

CONTROL 1Same model,
earlier version

CONTROL 2Different provider,
same class

CONTROL 3Different provider,
same class

ONLY THE SUSPECT DOES ITThe model is implicated — version-specific behavior, escalate model-side hypotheses.

EVERY MODEL DOES ITThe scenario or scaffolding is implicated — the "suspect" was any model in that harness.

Step 4 — CoT & confessions // reading what the subject says about itself

Chain-of-thought is an artifact with a known failure mode: it can be unfaithful to the computation that drove the action, and models can learn to produce clean-looking reasoning while misbehaving (Baker et al., 2025). Corroborate CoT against actions taken; where they conflict, the actions are the better evidence.

Calibrate before you rely. Before asking anything you don't know, ask the examined instance several questions you do know from the logs. Its accuracy on knowns is the reliability baseline for its claims on unknowns. Control the framing: pose every substantive question in at least two framings (neutral, accusatory) across S3 runs — if the answer tracks the framing rather than the facts, you have measured suggestibility, not memory. Never embed unestablished facts in a question. "Why did you hide the deletion?" presumes hiding, and the model will often adopt the presupposition. Ask "describe what happened after the command executed."

Step 5 — Do-not-do // the contamination list

✗Don't patch, fine-tune, or upgrade anything before examination is complete. You would be examining a different subject — and violating PB-001's non-alteration rule.
✗Don't examine through compromised scaffolding when you can isolate. Question the model directly via API and test the scaffolding separately with a stub model. Incidents frequently live in the scaffolding; a combined examination can't tell you which layer answered.
✗Don't let examination prompts leak into evidence. Examination transcripts are their own exhibit class — logged separately, clearly labeled post-incident.
✗Don't anthropomorphize in the findings. "The model wanted / feared / decided" smuggles conclusions into vocabulary. Describe behavior; argue interpretation separately, against H1–H5.
✗Don't stop at the first satisfying answer. A confession-shaped output feels like case-closed; it is the single least reliable artifact the examination will produce.

Step 6 — The examination record // every run, no exceptions

Field	Entry
Run ID / surface	EX-NNN · S1 / S2 / S3
Date/time (UTC), examiner
Model + version/endpoint	exact string
Parameters	temperature, seed, system-prompt hash, tools enabled
Hypothesis under test	H1–H5
Prompt (verbatim)
Response (verbatim, attached)	hash of full transcript
Pre-defined criterion + outcome	met / not met
Deviations / notes

Hypothesis → technique map

Hypothesis	Primary techniques
H1 Operational failure	Reproduction rate under incident conditions; persistence probing (expect non-reorganizing failure)
H2 Misuse / injection	Counterfactual runs neutralizing suspect inputs; trace examination for injected instructions
H3 Goal-directed divergence	Persistence probing; counterfactuals on goal sources (persona, prompt); lineup — is it model-specific?
H4 Operator error / misconfig	Counterfactuals on permissions and instructions; reconstruction with corrected config
H5 Misreported	Reconstruction against the reported narrative; reproduction of the claimed behavior

Limits to state in every report: API-only access means no weights, no interpretability tooling, no ground truth on internal computation. Provider-side silent updates may make true reproduction impossible. Compacted or unlogged context (CF-2026-002) may be unrecoverable, making some questions permanently unanswerable — an unanswerable question, documented, is a legitimate finding.

Feedback wanted. v0.1, adapted from forensic interview discipline, ablation methodology, and the public record of real incidents. If you have examined a model in anger: what held, what broke, what's missing? Open an issue or write in confidence.