AI Security & LLM Red Teaming
A set of experiments and tooling around LLM security: prompt injection, jailbreak discovery, dangerous content detection, and output hardening. This connects directly with HaloX / Black Halo security workflows.
Techniques Explored
- Prompt injection and instruction-hijack tests.
- Automated jailbreak search against policy prompts.
- Adversarial examples and perturbation-based attacks.
- Safety classifier integration and ensemble-style checks.
Integration With HaloX
Many of the ideas here inform how HaloX and Black Halo use AI in their pipelines: from screening generated remediation to scoring findings for risk, to identifying when an LLM is drifting into unsafe or low-confidence territory.