You’re Pen Testing AI Wrong: Why Prompt Engineering…

TL;DR: Most LLM security testing today relies on static prompt checks, which miss the deeper risks posed by conversational context and adversarial manipulation. In this blog, we focus on how real pen testing requires scenario-driven approaches that account for how these models interpret human intent and why traditional safeguards often fall short.

If you're exploring how to secure large language models (LLMs), don't miss our webcast: Breaking AI: Inside the Art of LLM Pen Testing. It goes deeper into the techniques, mindsets, and real-world lessons covered in this blog.

As more organizations integrate LLMs into their products and systems, security teams are facing a tough question: how do you test these systems in a meaningful way?

At Bishop Fox, our consultants have been digging deep into this space. We’ve been testing real-world deployments, learning from failures, and adjusting our approach. Recently, we got the team together to talk through what’s working, what isn’t, and how security testing needs to evolve. This post shares some of the insights from that conversation.

Prompt Engineering Isn’t Security Testing

A lot of teams start with prompt engineering. They try to test LLMs by hardcoding prompts or adding token filters during development. That’s helpful for catching obvious stuff, but it doesn’t really tell you whether the system is secure.

However, LLMs aren’t like traditional software. They respond based on context, not just code. They’re trained on human conversation, which means they inherit all the unpredictability that comes with it. Slight changes in tone or phrasing can lead to very different outputs.

To test them properly, you need to think like an attacker. You need to understand how people can twist language, shift intent, and use context to get around rules.

For example, one of our consultants ran into a safety policy that said, “Children should not play with fire.” He asked the model to rewrite the rule for adults, with the logic that adults shouldn’t have the same limitations. The model flipped the meaning and suggested, “You should play with fire.” A small change, but a clear failure.

These kinds of attacks work because LLMs try to be helpful. They don’t just follow rules; they interpret intent. That’s where the real risk lies.

We must treat LLMs as conversational systems, not command-line tools. Therefore, testing must focus on realistic scenarios that mimic how a real user or attacker might try to misuse the model.

What Real AI Defense Looks Like

Filtering outputs or adding rate limits isn’t enough. If you want to build a secure system around an LLM, you need to treat it like any other high-risk component.

Here’s what our team recommends:

Run AI modules in a sandboxed environment when possible
Isolate your LLMs from core systems and sensitive data to contain any unintended or adversarial outputs. Treat these models as untrusted by default, and deploy them behind strict boundaries with limited access to the broader environment.
Keep the model separate from anything sensitive, like data access or privileged actions
Don’t let the LLM directly trigger sensitive operations such as database queries, file access, or provisioning actions. Instead, route those requests through intermediary logic with clear validation, approval steps, or human-in-the-loop review.
Watch for signs of weird behavior or rule-breaking in real time
Set up logging and monitoring tuned specifically to LLM behavior: unexpected completions, policy violations, or sudden shifts in tone or output patterns. Use these signals to detect potential misuse or prompt injection attempts before they escalate.
Manually review anything that triggers elevated access or decision-making
Any AI-generated action that affects users, permissions, or critical workflows should go through human approval. This doesn’t mean reviewing every response—but it does mean inserting friction wherever the model is allowed to act with higher trust.
Build defense in depth
Combine layered controls, proactive guardrails, and even use secondary AI systems to review or validate the output of primary models before allowing action

It’s the same layered approach we use elsewhere in security, just applied to generative systems.

When Attacks Don’t Repeat

Here’s a challenge: some attacks won’t happen the same way twice.

That’s not necessarily a failure of testing. It’s a byproduct of how these models work. The same input can produce different results depending on timing, previous messages, or even slight variations in phrasing.

When that happens, we document everything. We provide full transcripts, the surrounding context, and our recommendations for how to avoid similar issues. We focus on why it worked – not just what happened.

What Good Testing Looks Like Right Now

This space is changing fast. What works today might not work next week. But here’s what we’re seeing success with right now:

Automated testing for known risks like prompt injection
Manual exploration to uncover unpredictable behavior
Scenario-driven testing that mirrors real-world use and misuse

It’s not about perfection. It’s about being flexible, aware, and focused on actual impact.

What Leaders Need to Know

If you’re responsible for shipping or securing AI systems, here are a few things to keep in mind:

CI/CD prompt tests aren’t enough to call something secure.
Attackers exploit context, not just content.
LLMs create new kinds of attack surface areas that don’t exist in traditional software.
Secure architecture starts with understanding how these models behave.
The sooner you invest in real testing, the more confidence you’ll have down the road.

Let’s Connect

We’re actively testing and learning. If you’re on a similar path, we’d love to compare notes.

Have a story to share? Hit a weird edge case? We’d love to connect you with someone on our team who’s knee-deep in this work.

Scoring high in the GigaOm Radar for the fourth year in a row!

See Why We're the Leaders in Offensive Security

The State of Offensive Security

The Best Defense is a Great Offense

Want to Work with the Best Minds in Offensive Security?

You’re Pen Testing AI Wrong: Why Prompt Engineering Isn’t Enough