Boosting Claude Code performance

with prompt learning

Simple, Effective, Data-Driven Prompt Optimization

Arize Builders Meetup NYC - 2026-03-12

A quick shout-out

What are we doing today?

Prompt Learning on Claude Code
SWE-Bench Lite: up to 11% improvement
No fine-tuning. No new tools. No architecture changes.

A tweet

The Memento problem

System prompts are bigger than you think

CLAUDE.md is yours

Most people's CLAUDE.md

is terrible

A benchmark to measure against

SWE-Bench Lite

300 real GitHub issues
Popular open-source Python repositories
Ground-truth patches + test suites

Why SWE-Bench?

Why SWE-Bench is hard

Our starting point

What is Prompt Learning?

Reinforcement Learning

Standard RL:

effective but expensive

Sample inefficient
Slow and expensive
Opaque: what do weight changes mean?
Overkill when LLMs are already great

Prompt Learning: the same loop, different algorithm

Why English feedback beats a score

What the meta-prompt does

We did this with Cline first

Why GPT-4.1 for Cline?

The Cline optimization loop

Cline results

The GPT-4.1 story

Now: Claude Code

Part 1: Rollouts

Part 2: Generate English feedback

Evals make all the difference

Part 3: Meta-prompting

The before: an empty file

The after: twenty rules

Rule one

Fix code at the correct hierarchy level
so all code paths benefit, not just downstream consumers.

Rule two

Maintain backward compatibility and consistency
with test expectations in error/warning behavior.

Rule three

Warn before raising errors
when deprecating usage
to allow user code transitions.

Rule four

Ensure correct dependency and execution order
in combined or chained operations.

The pattern in these rules

Cross-repo results

The Django result

20% better!

The honest framing

What this is really doing

But that's overfit!

Your git history as a training set

Sample efficiency

What you can do right now

The manual version

Even without closed issues

What makes a good rule

Applies to all coding agents

Cursor: .cursorrules
Cline: .clinerules
Windsurf: .windsurfrules
Claude Code: CLAUDE.md

Claude Code already does this for itself

The open source

github.com/Arize-ai/prompt-learning

Six rules to take away

1. Your CLAUDE.md is underutilized
2. Let your failures tell you what to write
3. Repo-specific beats generic
4. Your git history is a training set
5. The automation is optional
6. This works on any coding agent

Boosting Claude Code performance

with prompt learning

A quick shout-out

What are we doing today?

A tweet

The Memento problem

System prompts are bigger than you think

CLAUDE.md is yours

Most people's CLAUDE.md

is terrible

A benchmark to measure against

SWE-Bench Lite

Why SWE-Bench?

Why SWE-Bench is hard

Our starting point

What is Prompt Learning?

Reinforcement Learning

Standard RL:

effective but expensive

Prompt Learning: the same loop, different algorithm

Why English feedback beats a score

What the meta-prompt does

We did this with Cline first

Why GPT-4.1 for Cline?

The Cline optimization loop

Cline results

The GPT-4.1 story

Now: Claude Code

Part 1: Rollouts

Part 2: Generate English feedback

Evals make all the difference

Part 3: Meta-prompting

The before: an empty file

The after: twenty rules

Rule one

Rule two

Rule three

Rule four

The pattern in these rules

Cross-repo results

The Django result

The honest framing

What this is really doing

But that's overfit!

Your git history as a training set

Sample efficiency

What you can do right now

The manual version

Even without closed issues

What makes a good rule

Applies to all coding agents

Claude Code already does this for itself

The open source

Six rules to take away

A bigger picture:

Self-improving softwarae

Thank you

Boosting Claude Code performance with prompt learning

More from Laurie Voss