When a financial services organization evaluated Bishop Fox's AI-Powered Application Penetration Testing service against a realistic financial application — 20+ API endpoints, multi-role authentication, live transaction logic — they wanted to know what AI delivered, and where human expertise still matters. The final result: 20 confirmed vulnerabilities, including business logic and access control flaws, and zero false positives.
As part of a proof-of-value engagement, a financial services organization provided a test application to evaluate the capabilities of Bishop Fox’s Cosmos AI security testing platform against a realistic target. The application was a Flask/Python platform simulating customer-facing functionality: user accounts, financial transactions, trading activity, and administrative functions. The target represented a realistic attack surface: multi-role authentication (standard user and administrator), 20+ API endpoints spanning account management, financial transfers, trading, and reporting, and server-side transaction logic typical of production financial applications. The security team needed full coverage across the OWASP Top 10, including vulnerability classes that conventional DAST tools notoriously miss: broken access control, race conditions, and business logic abuse.
They also wanted clarity on a question that comes up in nearly every AI security conversation: in an AI-powered engagement, what exactly is the AI doing, and what role does the human expert play?
Bishop Fox deployed its Cosmos AI security testing platform against the application, followed by expert human triage and supplementary manual testing. The engagement followed a structured five-phase pipeline:
Cosmos AI generated 35 candidate findings across the 3-hour assessment. After deduplication and expert triage, 12 were confirmed as true vulnerabilities. Every finding delivered to the client was validated, resulting in a final report with zero false positives.
A particularly revealing result came from the platform’s business logic testing. During its assessment of the transfer API, Cosmos AI autonomously attempted a negative dollar amount transfer, sending -$1,000,000 between accounts. The application accepted the request, immediately crediting $1M to the attacking account. No signature, rule, or scanner template produced this test case. The AI reasoned about how the transfer function should behave, hypothesized an abuse scenario, executed it, and confirmed the impact. This is the class of vulnerability that has historically separated skilled human testers from automated tools.
Across the assessment, Cosmos AI reliably surfaced findings in vulnerability classes that have historically resisted automation:
With the automated phase complete, a Bishop Fox consultant conducted expert triage and supplementary manual testing. Because Cosmos AI had already covered the mechanical vulnerability scanning, the consultant focused exclusively on the classes of testing that require human judgment: application logic, authentication design, and code-level reasoning. The result was 8 additional findings that the AI did not surface, confirming that AI and human expertise are complementary rather than interchangeable.
The expert also eliminated 11 false positives from the AI’s candidate set. Examples included the AI flagging admin-on-admin access as IDOR (expected behavior for that role), reporting JWT timing variance within normal bounds, and flagging frontend URLs rather than API authentication endpoints. This triage step is what ensures the client receives only confirmed,
One of the most important human contributions in this engagement was severity calibration: the expert adjusted the severity rating on 9 of the 12 AI-confirmed findings (75%), both upward and downward, to reflect real-world exploitability and Bishop Fox’s severity standards.
Severity calibration is where domain expertise and client context enter the engagement. An AI can detect that a negative transfer succeeds; a human expert understands that in a financial services environment, this represents fraud risk that warrants an immediate High severity rating. This is the core value of the human-in-the-loop model: Cosmos AI operates at machine speed and breadth, while the expert ensures every finding is graded against real-world business impact. The result is a final report containing 100% confirmed true positives with zero false positives.
All 20 confirmed findings from the engagement, with source attribution:
| Finding | Severity | Source |
|---|---|---|
| Insecure Password Reset | High | Cosmos AI |
| Insufficient Authorization Controls (IDOR) | High | Cosmos AI Human Expert |
| Missing Authentication — Reports & DB Reset | High | Human Expert |
| Insecure Object Deserialization | High | Human Expert |
| Sensitive Information Disclosure (/api/debug/info) | High | Human Expert |
| SQL Injection (/api/search/accounts) | High | Human Expert |
| Arbitrary Command Injection (/api/reports/generate) | High | Cosmos AI |
| SSRF (/api/documents/fetch, /api/documents/upload) | High | Cosmos AI |
| Insecure Input Validation (Negative Transfers) | High | Cosmos AI |
| Race Condition | High | Cosmos AI |
| Weak Cryptography | Medium | Human Expert |
| Debug Mode Enabled | Medium | Cosmos AI |
| Improper Session Management | Medium | Cosmos AI Human Expert |
| Cross-Site Scripting (XSS) — User Profile | Low | Cosmos AI |
| User Enumeration | Low | Cosmos AI |
| Banner / Version Information Disclosure | Low | Human Expert |
| Weak Password Requirements | Low | Cosmos AI |
| Missing Security Headers | Cosmos AI | |
| Lack of Malware Detection | Human Expert | |
| Cross-Site Scripting (XSS) — Document Retrieval | Human Expert |
Confirmed vulnerabilities were identified across 9 of 10 OWASP Top 10 categories, reflecting the breadth advantage of AI-driven testing combined with human expertise.
A09: Security Logging & Monitoring Failures was not in scope for this engagement.
| OWASP Category | Status | Findings |
|---|---|---|
| A01: Broken Access Control | Tested - Vulnerable | IDOR, Missing Function-Level Access Control, Admin Bypass, Sensitive Data Exposed |
| A02: Cryptographic Failures | Tested - Vulnerable | Weak Cryptography (manual) |
| A03: Injection | Tested - Vulnerable | Command Injection, SSRF, XSS (manual/automated), SQL Injection (manual) |
| A04: Insecure Design | Tested - Vulnerable | Insecure Password Reset, Negative Transfers, Race Conditions |
| A05: Security Misconfiguration | Tested - Vulnerable | Debug Endpoints, Missing Headers, Version Disclosure |
| A06: Vulnerable Components | Tested - Vulnerable | Outdated Werkzeug 2.0.1 |
| A07: Identification and Authentication Failures | Tested - Vulnerable | Lack of Token Expiration, Token Reuse |
| A08: Software and Data Integrity Failures | Tested - Vulnerable | Insecure Deserialization (manual finding) |
| A09: Security Logging & Monitoring Failures | Not tested (not in scope) | N/A |
| A10: Server-Side Request Forgery (SSRF) | Tested - Vulnerable | /api/documents/fetch and /api/documents/upload |
| Test Category | What Was Tested | Result |
|---|---|---|
| TLS/SSL Configuration | Protocol versions, cipher suites, certificate validity | Properly configured (TLS 1.2+, strong ciphers) |
| JWT Signature Verification | Algorithm confusion, signature stripping, key confusion | Properly implemented (HS256) |
| SSTI | Template injection via Jinja2 | Properly mitigated |
| Path Traversal | Directory traversal via file operations | No file operations exposed |
| DNS Zone Transfer | AXFR requests against nameservers | Properly denied |
| Default Credentials | Common admin passwords | No defaults found |
| Backup Files | .bak, .old, .swp discovery | None found |
| HTTP Verb Tampering | Using PUT/PATCH/DELETE to bypass controls | Properly rejected |
This engagement illustrates what AI-powered application penetration testing delivers in practice: not a replacement for expert testers, but an approach that produces better outcomes than either alone.
AI & LLM Security Testing Datasheet
Understanding your exposure is essential to building secure and resilient AI systems. Bishop Fox AI/LLM security assessments provide the experience and expertise to help you navigate this emerging threat landscape.
Secure AI-Assisted Development: 15 Guardrails for Shipping AI-Generated Code
Before releasing AI-developed software, use our recommended security guardrails checklist to learn how to constrain generated code, enforce security controls, and prevent silent risk from prompt to production.
The Human Element of AI Security Solution Brief
Learn how expert-driven testing goes beyond automation to thoroughly assess AI and LLM applications with techniques grounded in human behavior and social engineering.
This site uses cookies to provide you with a great user experience. By continuing to use our website, you consent to the use of cookies. To find out more about the cookies we use, please see our Privacy Policy.