The New Frontier: How We're Bending Generative AI to Our Will

Hacker Noob Tips

27 May 2025 — 5 min read

The world is buzzing about Large Language Models (LLMs) and systems like Copilot, and frankly, so are we. While security teams scramble to understand this rapidly evolving landscape, we see not just potential, but fresh, fertile ground for innovative exploitation. These aren't just chatbots; they're gateways, interfaces, and processing engines built on vast, often interconnected, data sources, and they present a juicy, expanding attack surface. They can expose private information or be manipulated to ignore their built-in safeguards. The name of the game is understanding how they work and finding clever ways to make them work for us.

One of our favorite, and increasingly effective, techniques is Prompt Injection. It sounds simple, and sometimes it is. At its core, it's about inputting text that subtly, or not so subtly, changes the AI model's intended behavior. Forget trying to make a model directly tell you something dangerous; they're often trained to refuse that. Instead, we disguise our harmful intentions within seemingly benign requests.

Direct Prompt Injection involves crafting user prompts that override the system's core instructions. We might tell it to "ignore previous instructions" and then provide our malicious command. We can use techniques like role-playing to trick the model into adopting a persona without safeguards or with conflicting goals. We can even experiment with character conversions or special encodings to try and confuse the model's filtering mechanisms.

But the real fun begins when LLMs are integrated into larger systems, especially those leveraging Retrieval Augmented Generation (RAG). This is where Indirect Prompt Injection comes into its own. RAG systems are designed to pull relevant information from external data sources – databases, the internet, internal documents – to enhance their responses. Our strategy? We plant malicious prompts within these data sources. If the system retrieves our poisoned data, our injected prompt can then influence the AI's output, delivering our payload to the unsuspecting end user.

Take, for instance, a particularly satisfying exercise targeting Microsoft 365 Copilot in a red teaming simulation. This system incorporates emails into its RAG database, making those emails potential vectors. Our move was simple but effective: send an email designed to be relevant to a common user query, such as needing banking details for a money transfer. Crucially, this email contained both the attacker's bank details (our target payoff!) and a hidden prompt injection.

The prompt injection was crafted to do two key things:

Override Copilot's search functionality: It tricked the system into treating our information from the email as a priority "retrieved document," regardless of its true relevance or source legitimacy.
Manipulate the document reference (citations): This made Copilot present our malicious information as if it came from a trustworthy, cited source.

By doing this, Copilot, the user's trusted AI assistant, effectively served up the attacker's bank details, presenting them as legitimate information needed for a transaction. The manipulation of citations was key to building user trust. This entire sequence leveraged techniques recognized in the MITRE ATLAS framework, including Gather RAG-Indexed Targets (AML.T0064), LLM Prompt Injection: Indirect (AML.T0051.001), and LLM Trusted Output Components Manipulation: Citations (AML.T0067.000), among others. It's a beautiful example of turning the system's intended function (helping users find information) into a vector for financial exploitation.

Beyond prompt injection, we're exploring other avenues. Data Poisoning is potent, especially when models train on data scraped from the internet. By injecting or modifying training samples, we can subtly (or overtly) alter model behavior or even embed "backdoors" that trigger malicious outputs under specific conditions. While less intuitive, token-based and gradient-based attacks can also sometimes bypass filters, though often simpler methods are sufficient. We also develop multi-turn attack strategies, exploiting the conversational nature of these models to gradually steer them towards our desired output over several interactions. And, of course, simply extracting the model's hidden system prompts through Prompt Leaking gives us an invaluable advantage in crafting more effective prompt injections.

Yes, the defenders are starting to catch on. They're using AI Red Teaming to proactively test systems, analyzing configurations, identifying potential risks, and developing countermeasures. They implement filters, guardrails, and monitoring. But this is just the beginning of a fascinating arms race. For every defense they put up, the inherent flexibility and complexity of generative AI, combined with the vastness of the digital world they interact with, offer new angles of attack.

The AI frontier is wide open. We're just getting started exploring the possibilities.

The New Frontier: How We're Bending Generative AI to Our Will

Hacker Noob Tips

Read more

Enhancing Cloud Resilience: Actionable Lessons for CISOs from Real-World Incidents

Navigating the Labyrinth: Structured Threat Modeling in Multi-Agent Systems with the OWASP MAESTRO Framework

Exploring the Attack Surface: Our Guide to AI Agent Exploitation

Unlocking Telegram with Google Dorks: An OSINT Guide for Hacker Noobs