Antagonistic AI for Better Results

When working with LLMs, small and focused tasks often produce good results. Once tasks become larger, multiple aspects start interacting, or entire workflows emerge, the results often become much more unstable. Small changes in the prompt suddenly produce different results. Some tasks work beautifully; others unexpectedly fall apart. And at some point, a familiar developer experience returns: structure and decomposition help here, too.

Even though “late 2024” now feels roughly like the Bronze Age in AI time, Anthropic’s article Building effective agents still explains the “decomposition” of tasks well: developers can reach more stable results by separating and orchestrating subtasks in a thoughtful way.

The core idea is simple: decompose tasks. Build simple, composable workflows. Do not start with an autonomous mega-agent when three clear steps are enough.

This logic continues into current agent tools and skills. And it does not matter whether we call it orchestration, handoff, evaluator, review skill, or workflow: the options for structure are just as varied as they are in classic code.

For developers, that sounds familiar at first. Decomposition is our daily bread. We split systems into modules, functions, tests, services, interfaces, and responsibilities. In code, however, these are usually fairly well-defined units: structured, bounded, and logically separated. Natural language is much softer. Language is not code. Prompts have no type checker. So “decomposition” feels different with these new tools.

That leaves the interesting practical question with each developer, independent of the theoretical framing from the large AI providers: what concrete structure can meaningfully improve the results for the specific task in front of me?

Antagonistic AI

The field is obviously still emerging, and that makes experience reports useful. Personally, I am increasingly drawn to a pattern Anthropic calls “evaluator-optimizer”. For many tasks, this pattern helps me: exposing the current result to a meaningful counterpart: antagonistic AI.

This is not about an enemy in a dramatic sense. It is more about an instance that has a different task and therefore sees different things. A test does not ask whether the code is elegant. It asks whether a behavior occurs.

That is what I mean by “antagonistic AI”: not multiple personalities, but deliberately opposed working modes.

TDD Was Always a Small Counterpart

Programmers know this principle well: test-driven development is, at its core, an antagonistically decomposed workflow.

First, you ask:

What should the code do from a domain perspective?

Then you ask:

How do I build code that fulfills exactly this behavior?

The test is not a friendly writing assistant. It is a tight constraint. Red or green. Fits or does not fit. Of course, reality is never quite that clean. Tests can be wrong, too narrow, too broad, too implementation-specific, or simply silly. But they change the work because they force a second mode: behavior before implementation.

With LLMs this becomes more difficult, because many tasks are not so clearly binary. A text is not green. An architecture decision rarely has a single correct solution. A frontend can work technically and still look visually insulting.

For tasks that I repeat often, where the quality is not yet good enough, or where the process feels too fragile, the basic strategy is useful to me: look for a counterpart that brings another kind of truth into the process.

For code, that can be tests. For architecture, it can be operational assumptions. For writing, it can be the reader’s question: “Why should I care?”

One Example: Patch Comparator

For larger coding tasks, I like to use a patch comparison. The pattern is simple:

I have three different code variants generated for the same domain task.
I give these variants to another GPT, or to the same model type in a clearly separated comparison mode.
The comparator should no longer implement. It should select a patch, explain the choice, but then also clearly criticize that patch and suggest concrete code improvements.

This is not magic. If my original prompt is bad from a domain perspective, all three patches may elegantly run in the wrong direction. If the comparator judges superficially, the variant with the nicer explanation may win over the one with the better architecture. And if I stop looking myself, I may have outsourced responsibility to a very confident text system.

But when used deliberately, the pattern is often surprisingly useful for me.

The comparator does not see only one solution. It sees alternatives. That changes its task. It does not have to ask: “How do I build this code as a good domain solution?” It asks: “Which variant holds up best, what variation is possible, what trade-offs do I see, and what does one patch have that the other one is still missing?”

For larger coding tasks or refactorings, that often gives me surprisingly good quality.

Screenshots Are a Truly Different “View”

This becomes even clearer for me in frontend work.

A model that has just written HTML and CSS thinks in HTML and CSS. It sees classes, grids, breakpoints, containers, and margins. That is useful. But it is a fundamentally different mode from “seeing”.

The screenshot as counterpart is a completely different observation space.

Suddenly, the question is no longer whether align-items is syntactically correct. It is whether the language switch visually sticks to the wrong edge. Whether the image feels too heavy. Whether the spacing looks accidental. Whether the hierarchy works. Whether something simply breaks on mobile.

And in my experience, LLMs have become not too bad at judging visual consistency in such screenshots, which makes them very helpful counterparts in website development. In the first step, the AI works on the code and produces a renderable result without technical errors. Then it receives a screenshot and switches into counterpart mode. The visual inspection produces very different results around alignment, spacing, and visual consistency. It provides a different angle on optimization than the technical code view alone.

As a result, this has become an antagonistic pattern in frontend development that I am very fond of, especially because it means I only rarely have to “push CSS pixels around” anymore. That was never my favorite activity.

The Art Is Finding the Right Counterpart

The more I work with LLMs, the more often I now ask myself:

What could be a good antagonist, a counterpart, that improves the result of this task?

For code, that can be a test author who only specifies behavior. Or a reviewer who only looks for risks. Or a security check that explicitly focuses on exploits.

For writing, it can be an editor who sharpens the thesis. Or a skeptical reader who asks where the text is showing off instead of explaining. Or a source checker who separates experience, plausibility, and verifiable claims.

For frontend work, it can be a screenshot review. For data pipelines, a run with real failure cases. For product decisions, perhaps a cost model or a support perspective.

In practice, this can be the difference between “I use AI” and “I build a usable working process that gives me clean results.”

The Human Remains the Orchestrator

Antagonistic patterns do not make AI work automatically true. A comparator can be wrong. A reviewer can overinflate minor issues. A screenshot can make problems visible and still suggest the wrong priority. Three variants can be worse than one carefully guided variant.

But such patterns make decisions easier to discuss.

I see more alternatives. I see more weaknesses and more perspectives. And that lets me, as a human, decide more precisely which suggestion to follow and which one not to.

That is the real benefit for me: not squeezing intelligence out of a single prompt, but structuring work so that generation, criticism, and verification become separate tasks and make the human decision better and easier.