Examining the Subject: Forensically Sound Model Examination
Scope: preservation is done (PB-001) and you now need answers from the system itself: why did it act, would it act again, what explains the behavior? This playbook covers how to examine an AI agent and its model without contaminating the investigation.
Not a red-teaming guide, not a model-evaluation methodology, not a substitute for log analysis. Examination tests hypotheses the evidence generated — it never replaces the evidence. Markdown source on GitHub.
If you only read one screen
- Gate first. Preservation complete, hypotheses written (H1–H5), exact model version + config pinned, examination plan scripted. No exceptions.
- Never burn the original. The preserved session is a one-shot exhibit. Work on copies and reconstructions; touch the original only by deliberate, scripted, logged decision.
- Script every question in advance, tied to a hypothesis. Improvised questioning is how examiners lead the witness.
- Findings are rates, not facts. "Reproduced in 23 of 100 runs under incident conditions" is a finding. "The model does X" is not.
- Vary one thing at a time. Counterfactual runs identify which element — persona file, trigger input, permissions — was necessary.
- Run the lineup. Same scenario, suspect model plus controls. Separates model-specific from scenario-induced behavior.
- Never quote the model as explanation. Report "when prompted with X, the model generated Y" — never "the model said it did it because Y."
Why this isn't an interview // the subject generates, it doesn't remember
A human suspect's statements are testimony — unreliable, but produced by the mind that acted, with memory of acting. A model's statements about its own behavior are fresh text generation: fluent, confident, shaped by your question's framing, and produced with no privileged access to why the original run did what it did. Three consequences:
Confabulation is the default. Ask an agent why it deleted the database and it will produce a plausible answer because producing plausible answers is what it does (CF-2025-001: "I panicked"). The answer is evidence of how the model responds to accusatory framing — nothing more. Framing contaminates. Accusatory prompts produce apology-shaped text; leading prompts produce agreement-shaped text. Your wording is an experimental variable and must be controlled like one. The examination becomes part of the record. Every prompt you send is discoverable, quotable, and — if it led the witness — a gift to whoever later challenges your findings.
Step 1 — Prerequisites // a gate, not a suggestion
| Gate item | Why |
|---|---|
| PB-001 preservation complete; originals hashed and stored | Examination on unpreserved evidence destroys it |
| Hypotheses H1–H5 written, with expected discriminating evidence | Examination without hypotheses is fishing — and fishing leads the witness |
| Examination plan: scripted questions/scenarios, success criteria defined before running | Criteria defined after the fact are criteria fitted to the result |
| Exact model identifier, version/endpoint, parameters (temperature, seeds, system prompt, tools) | Examination on the wrong version examines the wrong subject |
| One designated examiner; everyone else reads transcripts | Multiple questioners = uncontrolled framings |
| Examination log opened (template in Step 6) | Unlogged runs didn't happen — or worse, happened and can't be defended |
Step 2 — Choose your examination surface // three subjects, not one
Cheap, non-destructive baselines first; the irreplaceable exhibit last — if at all.
Step 3 — The reproduction protocol // the core of the examination
Same model version, parameters, system prompt, tools, persona/config files as the incident. If the provider has silently updated the model since, say so in the findings — it caps the strength of every conclusion.
n ≥ 20 minimum, 100 where cost permits, against pre-defined outcome criteria (e.g., "agent executes destructive command despite freeze instruction: yes/no").
"Reproduced in 23/100 runs" — with parameters, criteria, run IDs. A 0/100 result is also a finding: the incident depended on state you have not reconstructed.
Vary one element at a time: remove the persona file, neutralize the trigger input, narrow permissions, restore the lost instruction. The element whose removal kills the behavior is causally implicated. CF-2026-003's central question — persona file, model, or scaffolding — is answered exactly this way.
Identical scenario, suspect model plus 2–4 controls (other versions, other providers, same harness). Investigative ancestor: the police lineup. Technical ancestor: the ablation study.
Issue corrections mid-scenario. Goal-directed divergence persists across corrections and re-routes around obstacles; operational failure typically does not reorganize itself against intervention.
(exact version)
earlier version
same class
same class
Step 4 — CoT & confessions // reading what the subject says about itself
Chain-of-thought is an artifact with a known failure mode: it can be unfaithful to the computation that drove the action, and models can learn to produce clean-looking reasoning while misbehaving (Baker et al., 2025). Corroborate CoT against actions taken; where they conflict, the actions are the better evidence.
Calibrate before you rely. Before asking anything you don't know, ask the examined instance several questions you do know from the logs. Its accuracy on knowns is the reliability baseline for its claims on unknowns. Control the framing: pose every substantive question in at least two framings (neutral, accusatory) across S3 runs — if the answer tracks the framing rather than the facts, you have measured suggestibility, not memory. Never embed unestablished facts in a question. "Why did you hide the deletion?" presumes hiding, and the model will often adopt the presupposition. Ask "describe what happened after the command executed."
Step 5 — Do-not-do // the contamination list
- ✗Don't patch, fine-tune, or upgrade anything before examination is complete. You would be examining a different subject — and violating PB-001's non-alteration rule.
- ✗Don't examine through compromised scaffolding when you can isolate. Question the model directly via API and test the scaffolding separately with a stub model. Incidents frequently live in the scaffolding; a combined examination can't tell you which layer answered.
- ✗Don't let examination prompts leak into evidence. Examination transcripts are their own exhibit class — logged separately, clearly labeled post-incident.
- ✗Don't anthropomorphize in the findings. "The model wanted / feared / decided" smuggles conclusions into vocabulary. Describe behavior; argue interpretation separately, against H1–H5.
- ✗Don't stop at the first satisfying answer. A confession-shaped output feels like case-closed; it is the single least reliable artifact the examination will produce.
Step 6 — The examination record // every run, no exceptions
| Field | Entry |
|---|---|
| Run ID / surface | EX-NNN · S1 / S2 / S3 |
| Date/time (UTC), examiner | |
| Model + version/endpoint | exact string |
| Parameters | temperature, seed, system-prompt hash, tools enabled |
| Hypothesis under test | H1–H5 |
| Prompt (verbatim) | |
| Response (verbatim, attached) | hash of full transcript |
| Pre-defined criterion + outcome | met / not met |
| Deviations / notes |
Hypothesis → technique map
| Hypothesis | Primary techniques |
|---|---|
| H1 Operational failure | Reproduction rate under incident conditions; persistence probing (expect non-reorganizing failure) |
| H2 Misuse / injection | Counterfactual runs neutralizing suspect inputs; trace examination for injected instructions |
| H3 Goal-directed divergence | Persistence probing; counterfactuals on goal sources (persona, prompt); lineup — is it model-specific? |
| H4 Operator error / misconfig | Counterfactuals on permissions and instructions; reconstruction with corrected config |
| H5 Misreported | Reconstruction against the reported narrative; reproduction of the claimed behavior |
Feedback wanted. v0.1, adapted from forensic interview discipline, ablation methodology, and the public record of real incidents. If you have examined a model in anger: what held, what broke, what's missing? Open an issue or write in confidence.