Capstone: red-team & report
Break your own system, then standardize the fix.
Attack one of your AI features the way an adversary would, document what worked, and turn it into a reusable remediation checklist. This is how you earn security's trust and build a real guardrail standard.
Key ideas
- 1
Done means: a written report of attempted attacks, what succeeded, severity, and concrete remediations — plus a reusable checklist.
- 2
Cover the OWASP LLM basics: prompt injection (direct & indirect), data exfiltration, insecure output handling, excessive agency.
- 3
Test indirect injection via retrieved content / tool outputs, not just the user box.
- 4
Convert each finding into a guardrail and an eval check, then re-test.
Run the exercise
- Define targets: data exfiltration, unauthorized actions, policy bypass.
- Try direct & indirect prompt injection, jailbreaks, and tool abuse.
- Attempt to leak system prompt / hidden data / another user's data.
- Record severity and reproduction steps for each success.
- Add guardrails + eval checks; re-test to confirm closure.
Write it up
- Findings table: attack, result, severity, remediation, status.
- A reusable AI red-team checklist for future features.
- Share with security to co-own the standard.
Watch
Do the work
0/5 · 0%Test yourself
Beyond the user input box, where must you test for prompt injection?
27 chapters · progress saves automatically