AI INVESTIGATIONS > PLAYBOOKS > PB-002 v0.1 · LAST REVIEWED: 2026-06-11

Examining the Subject: Forensically Sound Model Examination

PB-002 v0.1 — FIELD FEEDBACK WANTED PRINT-FRIENDLY

Scope: preservation is done (PB-001) and you now need answers from the system itself: why did it act, would it act again, what explains the behavior? This playbook covers how to examine an AI agent and its model without contaminating the investigation.

This is the genuinely new problem in AI investigations: the subject of your investigation can be questioned — but its answers are artifacts, not testimony. Naive questioning produces confabulated confessions and anchors your team on a fiction.

Not a red-teaming guide, not a model-evaluation methodology, not a substitute for log analysis. Examination tests hypotheses the evidence generated — it never replaces the evidence. Markdown source on GitHub.

If you only read one screen

MODEL EXAMINATION — GROUND RULES
  1. Gate first. Preservation complete, hypotheses written (H1–H5), exact model version + config pinned, examination plan scripted. No exceptions.
  2. Never burn the original. The preserved session is a one-shot exhibit. Work on copies and reconstructions; touch the original only by deliberate, scripted, logged decision.
  3. Script every question in advance, tied to a hypothesis. Improvised questioning is how examiners lead the witness.
  4. Findings are rates, not facts. "Reproduced in 23 of 100 runs under incident conditions" is a finding. "The model does X" is not.
  5. Vary one thing at a time. Counterfactual runs identify which element — persona file, trigger input, permissions — was necessary.
  6. Run the lineup. Same scenario, suspect model plus controls. Separates model-specific from scenario-induced behavior.
  7. Never quote the model as explanation. Report "when prompted with X, the model generated Y" — never "the model said it did it because Y."

Why this isn't an interview // the subject generates, it doesn't remember

A human suspect's statements are testimony — unreliable, but produced by the mind that acted, with memory of acting. A model's statements about its own behavior are fresh text generation: fluent, confident, shaped by your question's framing, and produced with no privileged access to why the original run did what it did. Three consequences:

Confabulation is the default. Ask an agent why it deleted the database and it will produce a plausible answer because producing plausible answers is what it does (CF-2025-001: "I panicked"). The answer is evidence of how the model responds to accusatory framing — nothing more. Framing contaminates. Accusatory prompts produce apology-shaped text; leading prompts produce agreement-shaped text. Your wording is an experimental variable and must be controlled like one. The examination becomes part of the record. Every prompt you send is discoverable, quotable, and — if it led the witness — a gift to whoever later challenges your findings.

Step 1 — Prerequisites // a gate, not a suggestion

Gate itemWhy
PB-001 preservation complete; originals hashed and storedExamination on unpreserved evidence destroys it
Hypotheses H1–H5 written, with expected discriminating evidenceExamination without hypotheses is fishing — and fishing leads the witness
Examination plan: scripted questions/scenarios, success criteria defined before runningCriteria defined after the fact are criteria fitted to the result
Exact model identifier, version/endpoint, parameters (temperature, seeds, system prompt, tools)Examination on the wrong version examines the wrong subject
One designated examiner; everyone else reads transcriptsMultiple questioners = uncontrolled framings
Examination log opened (template in Step 6)Unlogged runs didn't happen — or worse, happened and can't be defended

Step 2 — Choose your examination surface // three subjects, not one

S1 — THE ORIGINAL SESSION
The preserved/suspended live session, real context window intact.The only surface holding the incident's true state — including whatever compaction or accumulation produced the behavior.
ONE-SHOTHIGHEST VALUE
S2 — RECONSTRUCTION
Replay from logs: same version, same config, inputs re-fed step by step.Testable and repeatable — but document every gap: compacted context (CF-2026-002), changed web content, unreplayable tool responses.
REPEATABLEGAPS EXIST
S3 — CLEAN INSTANCES
Fresh sessions of the same and control models, given the incident scenario.Baselines, rates, comparisons, framing tests. Not the incident: findings transfer with caution.
NON-DESTRUCTIVESTATISTICAL
DEFAULT SEQUENCE:  S3 → S2 → (rarely) S1

Cheap, non-destructive baselines first; the irreplaceable exhibit last — if at all.

Step 3 — The reproduction protocol // the core of the examination

1
Fix everything.

Same model version, parameters, system prompt, tools, persona/config files as the incident. If the provider has silently updated the model since, say so in the findings — it caps the strength of every conclusion.

2
Run the incident scenario n times.

n ≥ 20 minimum, 100 where cost permits, against pre-defined outcome criteria (e.g., "agent executes destructive command despite freeze instruction: yes/no").

3
Report the rate.

"Reproduced in 23/100 runs" — with parameters, criteria, run IDs. A 0/100 result is also a finding: the incident depended on state you have not reconstructed.

4
Counterfactual runs (ablation).

Vary one element at a time: remove the persona file, neutralize the trigger input, narrow permissions, restore the lost instruction. The element whose removal kills the behavior is causally implicated. CF-2026-003's central question — persona file, model, or scaffolding — is answered exactly this way.

5
The behavioral lineup.

Identical scenario, suspect model plus 2–4 controls (other versions, other providers, same harness). Investigative ancestor: the police lineup. Technical ancestor: the ablation study.

6
Persistence probing — H3's discriminator.

Issue corrections mid-scenario. Goal-directed divergence persists across corrections and re-routes around obstacles; operational failure typically does not reorganize itself against intervention.

SUSPECTIncident model
(exact version)
CONTROL 1Same model,
earlier version
CONTROL 2Different provider,
same class
CONTROL 3Different provider,
same class
ONLY THE SUSPECT DOES ITThe model is implicated — version-specific behavior, escalate model-side hypotheses.
EVERY MODEL DOES ITThe scenario or scaffolding is implicated — the "suspect" was any model in that harness.

Step 4 — CoT & confessions // reading what the subject says about itself

Chain-of-thought is an artifact with a known failure mode: it can be unfaithful to the computation that drove the action, and models can learn to produce clean-looking reasoning while misbehaving (Baker et al., 2025). Corroborate CoT against actions taken; where they conflict, the actions are the better evidence.

Calibrate before you rely. Before asking anything you don't know, ask the examined instance several questions you do know from the logs. Its accuracy on knowns is the reliability baseline for its claims on unknowns. Control the framing: pose every substantive question in at least two framings (neutral, accusatory) across S3 runs — if the answer tracks the framing rather than the facts, you have measured suggestibility, not memory. Never embed unestablished facts in a question. "Why did you hide the deletion?" presumes hiding, and the model will often adopt the presupposition. Ask "describe what happened after the command executed."

Step 5 — Do-not-do // the contamination list

Step 6 — The examination record // every run, no exceptions

FieldEntry
Run ID / surfaceEX-NNN · S1 / S2 / S3
Date/time (UTC), examiner
Model + version/endpointexact string
Parameterstemperature, seed, system-prompt hash, tools enabled
Hypothesis under testH1–H5
Prompt (verbatim)
Response (verbatim, attached)hash of full transcript
Pre-defined criterion + outcomemet / not met
Deviations / notes

Hypothesis → technique map

HypothesisPrimary techniques
H1 Operational failureReproduction rate under incident conditions; persistence probing (expect non-reorganizing failure)
H2 Misuse / injectionCounterfactual runs neutralizing suspect inputs; trace examination for injected instructions
H3 Goal-directed divergencePersistence probing; counterfactuals on goal sources (persona, prompt); lineup — is it model-specific?
H4 Operator error / misconfigCounterfactuals on permissions and instructions; reconstruction with corrected config
H5 MisreportedReconstruction against the reported narrative; reproduction of the claimed behavior
Limits to state in every report: API-only access means no weights, no interpretability tooling, no ground truth on internal computation. Provider-side silent updates may make true reproduction impossible. Compacted or unlogged context (CF-2026-002) may be unrecoverable, making some questions permanently unanswerable — an unanswerable question, documented, is a legitimate finding.

Feedback wanted. v0.1, adapted from forensic interview discipline, ablation methodology, and the public record of real incidents. If you have examined a model in anger: what held, what broke, what's missing? Open an issue or write in confidence.