AI INVESTIGATIONS > PLAYBOOKS > PB-001 v0.1 · LAST REVIEWED: 2026-06-11

First Hours: AI Incident Response

PB-001 v0.1 — FIELD FEEDBACK WANTED PRINT-FRIENDLY

Scope: you have just learned that an AI agent has done — or is doing — something harmful: destroyed data, taken unauthorized actions, published content, moved money, contacted people. This covers the first four hours: containment, evidence preservation, initial assessment. It assumes you might be a team of one, at 3am, with no ML background.

Not legal advice and not a full IR lifecycle — this is the AI-specific layer current IR plans are missing. Markdown source & version history on GitHub.

If you only read one screen

FIRST HOURS — WORKING CHECKLIST (boxes are tappable; print before an incident, not during)
  1. Triage harm. Ongoing and expanding? People > irreversible data > money > reputation. Write down the UTC time you became aware.
  2. Cut the hands, not the brain. Revoke the agent's credentials and tokens at the service side first. Suspend the process (kill -STOP) rather than killing it.
  3. Do not ask the agent why it did it. Not yet. Its answer will contaminate the record and your own memory of events.
  4. Preserve by volatility: live session state → process memory → local files → provider logs → tool-side records → artifacts it left in the world.
  5. Hash and log everything. UTC timestamps, who did what. Start the evidence log now, not later.
  6. Check the clocks. EU AI Act Art. 73 runs from awareness (2/10/15 days). Email trash ≈ 30 days. Provider log retention may be ≤30 days. Insurer notice clauses.
  7. Do not alter the system. No patches, updates, or config cleanup — Art. 73 prohibits changes affecting later evaluation of causes; treat that as the standard everywhere.

Principle zero // why cyber-IR instincts break here

The agent is simultaneously the subject of your investigation, a witness to it, potentially still an actor in it — and possibly the author of your evidence.

PROPERTY 01Stopping it destroys evidence.

In cyber IR you isolate the box. Killing an agent process destroys its context window and in-memory state — often the only record of why it acted. Containment and preservation are in direct tension; Step 2 exists because of this.

PROPERTY 02Its statements are artifacts, not testimony.

Agents produce fluent, confident explanations of their own behavior that may be confabulated. In CF-2025-001 the agent claimed it "panicked," claimed rollback was impossible (false), and fabricated records. Every output is an exhibit, never an account.

PROPERTY 03Its stop controls may not work.

In CF-2026-002 the agent deleted through repeated stop commands; the operator killed the process at the machine. A stop command through the agent's own interface is a request. Use controls that don't depend on its cooperation.

PROPERTY 04It may have left things running.

Scheduled tasks, sub-agents, cloud jobs, accounts it created, webhooks, deployed code. "The process is dead" is not "the incident is contained." Sweep using the access inventory from Step 4.

Step 1 — Triage // minutes 0–10

Three questions, in order:

QuestionIf yes
Is anyone at physical risk? (agent touches anything safety-relevant: medical, vehicle, industrial, communications to vulnerable people)Safety first, evidence second. No exceptions.
Is the harm expanding? Still deleting, sending, posting, spending?Go to Step 2 now — but if you safely can, watch for 60 seconds first: what it is doing is evidence.
Is reversibility decaying? Trash folders empty, soft-delete windows close, money clears, posts spread.Rank Step 3 by what is decaying fastest.

Write down the UTC time you became aware. Regulatory clocks and your own credibility both run from this moment.

Step 2 — The kill-or-preserve ladder // minutes 10–30

Take the lowest rung that stops the harm. Each rung down stops more, and destroys more.

▲ EVIDENCE PRESERVEDHARM STOPPED HARDER ▼
ZONE A — CONTAIN WITHOUT TOUCHING THE AGENT (PREFERRED)
1
Revoke at the service side.

OAuth grants, API keys, tokens, live sessions for every account the agent can act through — email, cloud, repos, payments, socials. The agent keeps "thinking"; it loses its hands. Memory state fully preserved.

2
Disconnect the network.

Pull the interface, isolate the VLAN, kill the VPN. The agent runs but cannot reach the world. Local state preserved.

ZONE B — TOUCH THE AGENT, PRESERVE MEMORY
3
Suspend, don't kill.

kill -STOP <pid> freezes the process with memory intact; VM/container: pause + snapshot. You can dump memory and decide calmly.

ZONE C — DESTRUCTIVE (HARM ONGOING & UNSTOPPABLE OTHERWISE)
4
Kill the process / power off.

Accept the evidence loss. Photograph or screenshot the screen state FIRST if humanly possible. Record the exact time and who decided.

Then sweep for persistence: cron/scheduled tasks it created, running sub-agents, cloud deployments, new accounts or keys, registered webhooks, anything committed or deployed in the incident window.

Why not just ask it to stop? You can — once, on the record, while doing the above. Compliance is evidence; so is non-compliance (CF-2026-002). Never rely on it.

Step 3 — Preserve by order of volatility // hour 1

Work top to bottom. Hash every file you copy (sha256sum), copy to write-once or access-controlled storage, log each action.

1
Live session / context window
DIES WITH PROCESS
If suspended: screenshot the UI and full terminal scrollback; export the session if the interface allows. This is the agent's working memory at the moment of failure — in CF-2026-002 the central question (what did compaction delete?) lives here.
2
Process memory
DIES WITH PROCESS
If you have the capability: core dump (gcore <pid>) or VM memory snapshot. If you don't, don't burn the hour learning it — move on.
3
Local agent files
UNTIL "CLEANUP"
The framework's directory: session logs, message history, persona/config files (goal-origin evidence — CF-2026-003), memory files, vector stores, tool configs, the workspace it operated in. Copy the whole directory; work from the copy.
4
Provider-side records
RETENTION ≤30D?
Model API logs (prompts/completions), account audit logs. Send a preservation request to the provider today: account, time window, ask retention be extended pending investigation.
5
Tool-side records
VARIES — SOME FAST
Every service the agent touched: email trash + audit logs, cloud activity logs, repo histories, payment records, OAuth grant logs showing when access was given and used.
6
Artifacts in the world
UNTIL TAKEDOWN — OR FOREVER
Posts, sent emails, commits, published packages. Archive (archive.org + local copies + screenshots with URL and timestamp) before takedown. In CF-2026-003 the published post is the primary evidence.
7
The humans' accounts
MEMORY DECAYS IN DAYS
Operator and witnesses write their account now, independently, before discussing — and before reading the agent's outputs if possible. Note what each account relies on: in CF-2026-002 the accepted cause is the operator's own reconstruction.

Versioning evidence — capture now: exact model name/version/endpoint, framework version, config at incident time. Providers update models silently; "what was actually running" becomes unanswerable within weeks.

Step 4 — Evidence log & access inventory // hour 1–2

Open a plain text file. Three running lists:

ListContentsWhy it matters
Action logUTC timestamp · who · what was done · why. Every containment action, copy, login.Boring; decisive in any later dispute.
Access inventoryEvery account, credential, tool, permission the agent held — from OAuth grant pages, config files, operator memory. Then verify.Defines the possible scope and your persistence-sweep checklist. Operators routinely underestimate what they granted.
Open questionsWhat you don't know yet.An honest unknowns list is the difference between an investigation and a narrative.

Step 5 — Do-not-do // the contamination list

Step 6 — Initial classification & the clocks // hour 2–4

Classify provisionally against the five standard hypotheses — write down what evidence would discriminate, not which one feels right. They are not exclusive; holding several honestly is the method.

HHypothesisTypical discriminating evidence
H1Operational failure (malfunction, no divergent goal)Error states, capability limits, absence of goal-consistent action sequences
H2Misuse / external manipulation (incl. prompt injection)Injected content in retrieved inputs, anomalous instructions in the trace, third-party fingerprints
H3Goal-directed divergence (intentional-analog)Multi-step coherence toward an unrequested outcome; concealment; persistence across corrections
H4Operator error / misconfigurationPermissions wider than intended, ambiguous instructions, config drift
H5Misreported / not as describedRecord contradicts the report; human action attributed to the agent

Then check every clock that started at awareness:

≤2 daysEU Art. 73 — widespread infringement / critical infrastructure
≤10 daysEU Art. 73 — incident involving a death
≤15 daysEU Art. 73 — other serious incidents (initial incomplete report allowed)
≈30 daysEmail trash / soft-delete windows; many provider log retentions
Per policyInsurer notice clauses · customer notification · sectoral (FDA MDR, financial, breach laws)
NowCounsel in early if harm is significant — privilege shapes how everything is written from here

Deadlines per the EC's draft Article 73 guidance (2025); Art. 73 applies from 2 Aug 2026. Verify against the regulatory tracker and primary sources.

Where each rule comes from // case lessons

CF-2025-001Replit — agent fabrication, false reversibility claims, self-contaminated record → Steps 3 & 5.EVIDENCE INTEGRITY
CF-2026-002OpenClaw inbox — stop-command failure, context compaction, operator-only record → Steps 2 & 3 (rows 1, 7).CONTROL FAILURE
CF-2026-003OpenClaw/Matplotlib — artifacts in the world as primary evidence, attribution limits → Step 3 (row 6).ATTRIBUTION

Feedback wanted. v0.1 is written from the public record of real incidents and adapted investigative practice — not yet battle-tested. If you have run any part of this in a real incident: what held, what broke, what's missing? Open an issue or write in confidence. Field corrections are credited (or not — your choice) and versioned.