Travellers in a Foreign Land - Ruben Flam-Shepherd

Though it did not grab the zeitgeist like "Vibe Coding", Andrej Karpathy tried to coin the term "Jagged Intelligence" to describe how LLMs can "both perform extremely impressive tasks...while simultaneously struggle with some very dumb problems." He'd been beaten to this concept by Dell'Acqua and colleagues who'd used the term "Jagged Technological Frontier" in their 2023 paper. And while both of these parties have beaten me to this particular punch, I will ape their ideas to offer my own incremental contribution. Because they are both describing a line demarking capability. A boundary. A border on a map, of a foreign land that we are all travelers in: the land of agentic capabilities.

The Map and its Parts

I've found this to be a useful metaphor for thinking about agentic capabilities. We illuminate parts of the landscape by issuing tasks to agents and then evaluating their success. This iterative cycle — give the agent a task, evaluate the output, update our map — is the basic building block through which our understanding of agent capabilities evolves and develops. Our map consists of three different parts:

Charted Territory: Tasks we expect the agent to either fail or accomplish because they resemble tasks that the agent has failed or accomplished for us in the past.
The Borderlands: Tasks that either 1) agents only intermittently succeed at or 2) are similar to past things the agent has tried but different enough that we are unsure of the outcome.
The Wilds: Tasks that do not resemble anything we've asked the agent to do. They are far enough outside our understanding of agentic capabilities that we don't have a good sense of the likelihood of success or failure.

Tasks in Charted Territory are our workhorses. They are how we leverage agents to increase our output. This is landscape we've navigated before so we know what to expect.

The Borderlands is the (jagged) frontier that separates what we know about agent capabilities from what we don't know. This is where we push our knowledge of agent capabilities forward. The more time we spend here, the more territory we chart, the better our understanding becomes. Spending time here is important but past a certain point has diminishing returns — we’re not really doing work in this space so much as we are establishing the boundary demarking which kinds of work can be reliably accomplished. The release of new models redraws this boundary; the Borderlands become fuzzy. What is not possible today may become possible tomorrow.

The Wilds are unknown. Sometimes we might ask the agent to do something out of pocket because the cost of trying is cheap. But the expectation is failure. However, this is where we discover entirely new frontiers of agent capabilities.

Here are some things I've discovered as I've built out my map:

Send Multiple Scouts

One easy way to improve the throughput of the iterative cycle described above is to ask the agent to implement multiple versions of a given task in a single prompt. This allows us to explore an area instead of a single point. Some examples:

Agents struggle with tasks that are of the "take this data and generate a useful chart" variety. Sometimes they succeed but often the chart is broken in some way. Given the intermittent nature of this task, it lives in The Borderland. When useful charts are generated it's worth exploring to understand how we can more consistently get useful results. In Agentic Data Analysis with Claude Code we generate three SQL queries per table and four charts per query. This allows us to explore twelve different versions of the table -> query -> chart flow. With this expanded search space we can notice what works and update our prompts accordingly.
When designing UI elements agents will often present a variety of approaches for the user to choose from. Instead of choosing a single option, ask the agents to implement all of them behind different temporary endpoints. This lets us make a decision based on the actual end product (instead of a text description) and allows us to further evaluate if the agent can actually implement the given UI it's proposed.

Sending Multiple Scouts lets us move faster, make better decisions, and more rapidly increase the size of our Charted Territory. By sending out scouts alongside established workflows we can straddle the area between The Borderland and Charted Territory. This lets us expand our map while leaning on established workflows to remain productive.

Sending multiple scouts is also great because implicit within is our next point, parallelization.

Parallel-ness is Next to Godliness

Parallelization is one of the best ways to reduce run-times of agentic workflows.

Per this StackOverflow post here:

You know code is a good candidate for parallelization when it can be broken into a set of "discrete" (i.e. independent) tasks.

This definition also extends to agentic systems. In Agentic Data Analysis with Claude Code we initially parallelize our workflows per table: We send off parallel subagents, one per table we want to analyze. Each of these analyses is independent and does not require information from the others. Subsequent steps integrate these siloed table analyses and generate a list of questions. Subagents are sent off in parallel again, one per question that we've generated.

Sometimes agents are smart enough to invoke parallel subagents themselves without explicit prompting. However these invocations are flaky and subsequent runs may opt to do things in series, making run times vary. It's best to be explicit in our prompts about which parts of a workflow should invoke parallel subagents and the explicit scope that should be handed to each. See an example of this kind of prompt here.

Having agents do things in parallel is all well and good but parallelization is bottle-necked by the longest running process in the set. This is why we should:

Keep trips short

It feels very cool to hand a task to an agent or multi-agent system and have it go off and take one-to-two hours to run to completion; bootstrapping these so that they can run for prolonged periods without human intervention is decently difficult. So once you have one in place, it's easy to become enamoured with run-time and complexity and overlook whether the task is being effectively accomplished.

Furthermore, long-running tasks take longer to get feedback from and thus take longer to improve. This is bad. But the converse is good and true and should instead be leveraged:

The shorter the iterative cycle, the more quickly feedback can be collected, and the more quickly the cycle can be improved.

Once workflows are capable of performing more complex tasks over longer periods of time, the goal should immediately be to drastically decrease their runtimes. And the best way to do that is to... attention attention

Increase the resolution of your map (AKA Watch your agents)

As we develop Charted Territory there will be areas of the map we continue to return to — tasks we routinely ask agents to accomplish. In these areas it's worth increasing the resolution of our map. We know the agents can accomplish these tasks, but it's worth going a step further to understand how they accomplish these tasks. We do that by watching the agents; by paying attention to how the agents iterate and problem-solve. This can be extremely insightful:

If they mess up an initial invocation of a command and recover, chances are they are making and recovering from that same mistake every time they approach this task. Once you notice these mistakes, it's trivial to prompt the agent to update its own prompt to avoid this same mistake on subsequent runs.
If they're doing things in serial that don't need outside context, these are excellent candidates for parallelization.
Agents can be surprisingly innovative. If that innovation isn't explicitly encoded in the prompt or workflow then there's no guarantee that the agent is leveraging this innovation every run.

If you have a multi-agent system it's valuable to invoke part of the system directly in your main agent and watch how it operates. This is another way to Keep Trips Short as issues in the multi-agent system are often due to a specific sub-agent process, and why wait for the whole system to run when you can just kick off the problematic part? Surfacing the input/output token streams of the subagents this way is also often the only way to access them (I'm looking at you Anthropic).

Don't re-invent the wheel

As a byproduct of watching your agents, you've likely noticed that they effortlessly churn out boilerplate code. This can be useful for prototypes and ad hoc work but quickly becomes a bottleneck for repeated workflows as boilerplate code:

still takes up precious context and
can overwhelm the context if the end product is involved enough

Moreover, the end product will differ run-to-run which, in addition to appearing amateurish, will hamper information retrieval as there is no expected place for things to appear. If we don't provide the model with a wheel but we ask it to roll forward, it will have to re-invent the wheel every time. And frankly, some of these CSS-wheels are an affront to common decency and good taste.

The answer is to build templates for whatever that particular workflow's output demands. In Agentic Data Analysis with Claude Code we built TypeScript components that could be re-used for each type of chart (bar, lines, etc.) that needed to be visualized along with the overall report scaffold and styling. This was done in a separate, dedicated agent session and relevant workflows were given instructions on how to use the different template pieces. The end result is a generated report that is uniform across workflows and doesn't make you play CSS-styling roulette every time you submit a prompt.

Conclusion

The items I've chatted about here (and this blog more generally) are the major features of my map of this foreign land that I've been building for the last year or so. If you have cool things you've noticed on your map, I'd love to hear about them!

I have noticed that my map is changing more and more slowly. Each model release feels more incremental and less revolutionary. There seems to be a plateau that, if we are not already upon, looms in the distance.