Commit graph

45 commits

Author SHA1 Message Date
Danilo Leal
8366cd0b52
agent: Render diffs for the edit file tool (#29234)
This PR implements the `ToolCard` for the edit file tool, which allow us
to display an editor with a diff in the thread view with the changes
performed by the model.

- [x] Fix buffer sometimes displaying empty
- [x] Stop buffer from scrolling together with the thread
- [x] Fix multibuffer header sometimes appearing
- [x] Fix buffer height issue
- [x] Implement "full height" expand button
- [x] Add "Jump To File" functionality
- [x] Polish and refine styles

Release Notes:

- agent: Added diff preview cards in the thread view for edits performed
by the agent.

---------

Co-authored-by: João Marcos <marcospb19@hotmail.com>
Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Agus Zubiaga <hi@aguz.me>
Co-authored-by: Conrad Irwin <conrad.irwin@gmail.com>
2025-04-23 15:43:33 -03:00
Oleksiy Syvokon
f69aeb6311
Do not log unfinished tools use that are in the middle of streaming (#29275)
Release Notes:

- N/A
2025-04-23 13:19:01 +00:00
Oleksiy Syvokon
76a78b550b
eval: Write JSON-serialized thread (#29271)
This adds `last.message.json` file that contains the full request plus
response (serialized as a message from assistant for consistency with
other messages).

Motivation: to capture more info and to make analysis of finished runs
easier.

Release Notes:

- N/A
2025-04-23 15:22:19 +03:00
Agus Zubiaga
ce1a674eba
eval: Fine-grained assertions (#29246)
- Support programmatic examples
([example](17feb260a0/crates/eval/src/examples/file_search.rs))
- Combine data-driven example declarations into a single `.toml` file
([example](17feb260a0/crates/eval/src/examples/find_and_replace_diff_card.toml))
- Run judge on individual assertions (previously called "criteria")
- Report judge and programmatic assertions in one combined table

Note: We still need to work on concept naming 

<img width=400
src="https://github.com/user-attachments/assets/fc719c93-467f-412b-8d47-68821bd8a5f5">

Release Notes:

- N/A

---------

Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
2025-04-22 23:58:58 -03:00
Max Brunsfeld
36d02de784
Rework eval to support interpretable scores and efficient repetitions (#29197)
### Problem

We want to start continuously tracking our progress on agent evals over
time. As part of this, we'd like the *score* to have a clear,
interpretable meaning. Right now, it's a number from 0 to 5, but it's
not clear what any particular number works. In addition, scores vary
widely from run to run, because the agent's output is deterministic. We
try to stabilize the score using a panel of judges, but the behavior of
the agent itself varies much more widely than the judges' scores for a
given run.

### Solution

* **explicit meanings of scores** - In this PR, we're prescribing the
diff and thread criteria files so that they *must* be unordered lists of
assertions. For both the thread and the diff, rather than providing an
abstract score, the judge's task is simply to count how many of these
assertions are satisfied. A percentage score can be derived from this
number, divided by the total number of assertions.
* **repetitions** - Rather than running each example once, and judging
it N times, we'll **run** the example N times. Right now, I'm just
judging the output once per run, because I believe that with these more
clear scoring criteria, the main source of non-determinism will be the
*agent's* behavior, not the judge's

### Questions

* **accounting for diagnostic errors** - Previously, the judge was asked
to incorporate diagnostics into their abstract scores. Now that the
"score" is determined directly from the criteria, the diagnostic will
not be captured in the score. How should the diagnostics be accounted
for in the eval? One thought is - let's simply count and report the
number of errors remaining after the agent finishes, as a separate field
of the run (along with diff score and thread score). We could consider
normalizing it using the total lines of added code (like errors per 100
lines of code added) in order to give it some semblance of stability
between examples.

* **repetitions** - How many repetitions should we run on CI? Each
repetition takes significant time, but I think running more than one
repetition will make the scores significantly less volatile.

### Todo

* [x] Fix `--concurrency` implementation so that only N tasks are
spawned
* [x] Support `--repetitions` efficiently (re-using the same worktree)
* [x] Restructure judge prompts to count passing criteria, not compute
abstract score
* [x] Report total number of diagnostics in some way
* [x] Format output nicely

Release Notes:

- N/A *or* Added/Fixed/Improved ...

---------

Co-authored-by: Antonio Scandurra <me@as-cii.com>
2025-04-22 14:00:09 +00:00
Nathan Sobo
458ffaa134
Add new action to run agent eval (#29158)
The old one wasn't linking, and
https://github.com/zed-industries/zed/pull/29081 has a bunch of merge
conflicts. Wanted to start simple/small.

## Todo

* [x] Remove low-signal examples
* [x] Make the eval run on a cron, on main, and on any PR with the
`run-eval` label
* [x] Noise in logs about failure to write settings
    ```
[2025-04-21T20:45:04Z ERROR settings] Failed to write settings to file
"/home/runner/.config/zed/settings.json"
    
       Caused by:
No such file or directory (os error 2) at path
"/home/runner/.config/zed/.tmpLewFEs"
    ```
* [x] `Agentic loop stalled`
(https://github.com/zed-industries/zed/actions/runs/14581044243/job/40897622894)
* [x] Make sure that events are recorded in snowflake
* [ ] Change judge criteria to be more explicit about meanings of scores

Release Notes:

- N/A

---------

Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Agus Zubiaga <hi@aguz.me>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
2025-04-21 21:30:21 -07:00
Michael Sloan
70c51b513b
agent eval: Default to also running typescript examples (#29185)
Release Notes:

- N/A
2025-04-21 23:59:35 +00:00
Michael Sloan
9249919b7a
Write {result_count}.diff and last.diff eval run outputs (#29181)
These are only written when the diff has changed. `patch.diff` has been
removed as its redundant with `last.diff`.

It can be convenient to open `last.diff` and use undo/redo to navigate
its history.

Release Notes:

- N/A
2025-04-21 23:19:07 +00:00
Richard Feldman
4f2f9ff762
Streaming tool calls (#29179)
https://github.com/user-attachments/assets/7854a737-ef83-414c-b397-45122e4f32e8



Release Notes:

- Create file and edit file tools now stream their tool descriptions, so
you can see what they're doing sooner.

---------

Co-authored-by: Marshall Bowers <git@maxdeviant.com>
2025-04-21 22:28:32 +00:00
Thomas Mickley-Doyle
733cd6b68c
agent: Remove non-rust examples from evals (#29139)
Release Notes:

- N/A
2025-04-21 12:55:24 -07:00
Michael Sloan
0f3ac38332
Agent eval: Copy .rules file into eval worktree for examples based on Zed (#29116)
Also reverts #29108, which cherry-picked the rules file for an eval
example.

Release Notes:

- N/A
2025-04-21 12:02:44 -06:00
Conrad Irwin
9d35f0389d
debugger: More tidy up for SSH (#28993)
Split `locator` out of DebugTaskDefinition to make it clearer when
location needs to happen.

Release Notes:

- N/A

---------

Co-authored-by: Anthony Eid <hello@anthonyeid.me>
Co-authored-by: Anthony <anthony@zed.dev>
Co-authored-by: Cole Miller <m@cole-miller.net>
2025-04-21 16:00:03 +00:00
Antonio Scandurra
97ab0980d1
Start tracking tool failure rates in eval (#29122)
This pull request will print all the used tools and their failure rates.
The objective goal should be to minimize that failure rate.

@tmickleydoyle: this also changes the telemetry event to report
`tool_metrics` as opposed to `tool_use_counts`. Ideally I'd love to be
able to plot failure rates by tool and hopefully see that percentage go
down. Can we do that with the data we're tracking with this pull
request?

Release Notes:

- N/A
2025-04-21 16:16:43 +02:00
Agus Zubiaga
ceeae790b7
eval: Improve lang server idle detection (#29135)
Brings back #29013 after it was accidentally reverted by
https://github.com/zed-industries/zed/pull/28961/commits/e9bb15b9063615762c866c30aaf646acb12af1f3.

Release Notes:

- N/A
2025-04-21 00:17:28 +00:00
Nathan Sobo
107d8ca483
Rename regex search tool to grep and accept an include glob pattern (#29100)
This PR renames the `regex_search` tool to `grep` because I think it
conveys more meaning to the model, the idea of searching the filesystem
with a regular expression. It's also one word and the model seems to be
using it effectively after some additional prompt tuning.

It also takes an include pattern to filter on the specific files we try
to search. I'd like to encourage the model to scope its searches more
aggressively, as in my testing, I'm only seeing it filter on file
extension.

Release Notes:

- N/A
2025-04-20 00:53:30 +00:00
Michael Sloan
a91948aeb4
agent: Reorder some linux keybindings to match mac keybindings (#29107)
Release Notes:

- Made keybindings for agent panel closer to the precedence order used
on Mac. This fixes use of `enter` to add context from the menu triggered
by `@` referencing.
2025-04-20 00:01:43 +00:00
Michael Sloan
fbf7caf93e
Default to fast model for thread summaries and titles + don't include system prompt / context / thinking segments (#29102)
* Adds a fast / cheaper model to providers and defaults thread
summarization to this model. Initial motivation for this was that
https://github.com/zed-industries/zed/pull/29099 would cause these
requests to fail when used with a thinking model. It doesn't seem
correct to use a thinking model for summarization.

* Skips system prompt, context, and thinking segments.

* If tool use is happening, allows 2 tool uses + one more agent response
before summarizing.

Downside of this is that there was potential for some prefix cache reuse
before, especially for title summarization (thread summarization omitted
tool results and so would not share a prefix for those). This seems fine
as these requests should typically be fairly small. Even for full thread
summarization, skipping all tool use / context should greatly reduce the
token use.

Release Notes:

- N/A
2025-04-19 23:26:29 +00:00
Bennet Bo Fenner
bafc086d27
agent: Preserve thinking blocks between requests (#29055)
Looks like the required backend component of this was deployed.

https://github.com/zed-industries/monorepo/actions/runs/14541199197

Release Notes:

- N/A

---------

Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Agus Zubiaga <hi@aguz.me>
Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Nathan Sobo <nathan@zed.dev>
2025-04-19 20:12:03 +00:00
Michael Sloan
d88b06a5dc
Simplify language model registry + only emit change events on change (#29086)
* Now only does default fallback logic in the registry

* Only emits change events when there is actually a change

Release Notes:

- N/A
2025-04-19 08:26:42 +00:00
Michael Sloan
98ceffe026
Pretty tool inputs in eval output markdown + numbered assistant messages (#29082)
Release Notes:

- N/A
2025-04-19 06:59:22 +00:00
Nathan Sobo
bab28560ef
Systematically optimize agentic editing performance (#28961)
Now that we've established a proper eval in tree, this PR is reboots of
our agent loop back to a set of minimal tools and simpler prompts. We
should aim to get this branch feeling subjectively competitive with
what's on main and then merge it, and build from there.

Let's invest in our eval and use it to drive better performance of the
agent loop. How you can help: Pick an example, and then make the outcome
faster or better. It's fine to even use your own subjective judgment, as
our evaluation criteria likely need tuning as well at this point. Focus
on making the agent work better in your own subjective experience first.
Let's focus on simple/practical improvements to make this thing work
better, then determine how we can craft our judgment criteria to lock
those improvements in.

Release Notes:

- N/A

---------

Co-authored-by: Max <max@zed.dev>
Co-authored-by: Antonio <antonio@zed.dev>
Co-authored-by: Agus <agus@zed.dev>
Co-authored-by: Richard <richard@zed.dev>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Michael Sloan <mgsloan@gmail.com>
2025-04-19 02:47:59 +00:00
Marshall Bowers
7abe2c9c31
agent: Attach thread ID and prompt ID to telemetry events (#29069)
This PR attaches the thread ID and the new prompt ID to telemetry events
for completions in the Agent panel.

Release Notes:

- N/A

---------

Co-authored-by: Mikayla Maki <mikayla.c.maki@gmail.com>
2025-04-18 20:41:02 +00:00
Michael Sloan
327fee4d22
Init prompt store in agent eval (#29068)
Needed after #28915

Release Notes:

- N/A
2025-04-18 20:06:34 +00:00
Michael Sloan
502a0f6535
agent: Use default prompts from prompt library in system prompt (#28915)
Related to #28490.

- Default prompts from the prompt library are now included as "user
rules" in the system prompt.
- Presence of these user rules is shown at the beginning of the thread
in the UI.
_ Now uses an `Entity<PromptStore>` instead of an `Arc<PromptStore>`.
Motivation for this is emitting a `PromptsUpdatedEvent`.
- Now disallows concurrent reloading of the system prompt. Before this
change it was possible for reloads to race.

Release Notes:

- agent: Added support for including default prompts from the Prompt
Library as "user rules" in the system prompt.

---------

Co-authored-by: Danilo Leal <daniloleal09@gmail.com>
2025-04-18 09:32:35 -06:00
Marshall Bowers
c2cd4fd7a1
agent: Show request usage in the panel (#29006)
This PR adds a banner showing request usage in the Agent panel:

<img width="640" alt="Screenshot 2025-04-17 at 5 51 46 PM"
src="https://github.com/user-attachments/assets/e0eb036c-57c1-441c-bbab-7dab1c6e56d9"
/>

Only visible to users on the new billing.

Note to Joseph: Doesn't need to be cherry-picked to Preview.

Release Notes:

- N/A

---------

Co-authored-by: Nate <nate@zed.dev>
2025-04-17 22:16:57 +00:00
Oleksiy Syvokon
6dd622d6c3
eval: Fix git revision existence check (#28959)
This change fixes a bug in the worktree initialization.

Details: `git ref-parse --verify $HASH` just checks that $HASH is a
well-formed hash and will successfully return even if $HASH doesn't
exist.

Release Notes:

- N/A
2025-04-17 19:57:37 +03:00
Thomas Mickley-Doyle
8de53bd89f
agent: Add git commit ID to the eval telemetry data (#28895)
Release Notes:

- N/A
2025-04-16 14:13:43 -05:00
Michael Sloan
320abe9b22
Agent Eval: Check if SHA already fetched (#28846)
Release Notes:

- N/A
2025-04-16 06:54:22 +00:00
Michael Sloan
9a9f2e71ca
Agent Eval: Initial support for running examples repeatedly (#28844)
Not ideal as it creates a separate worktree for each repetition

Release Notes:

- N/A
2025-04-16 06:35:55 +00:00
Michael Sloan
609895d95f
Agent Eval: bounded concurrency (#28843)
Release Notes:

- N/A
2025-04-16 00:05:46 -06:00
Michael Sloan
da2d8bd845
Agent Eval: Distinguish tool successes and failures in log (#28839)
Release Notes:

- N/A
2025-04-15 22:51:33 -06:00
Thomas Mickley-Doyle
222d4a2546
agent: Add telemetry for eval runs (#28816)
Release Notes:

- N/A

---------

Co-authored-by: Joseph <joseph@zed.dev>
2025-04-16 02:54:26 +00:00
Michael Sloan
102ea6ac79
Add support for judge repetitions in eval (#28811)
Release Notes:

- N/A

---------

Co-authored-by: Thomas <thomas@zed.dev>
2025-04-15 23:18:02 +00:00
Agus Zubiaga
0182e09e33
eval: Do not create run files for skipped examples (#28800)
Release Notes:

- N/A
2025-04-15 18:00:04 +00:00
Agus Zubiaga
ff4334efc7
eval: Fix stalling on tool confirmation (#28786)
The `always_allow_tool_actions` setting would get overridden with the
default when we loaded each example project, leading to examples
stalling when they run a tool that needed confirmation. There's now a
separate `runner_settings.json` file where we can configure the
environment for the eval.

Release Notes:

- N/A

---------

Co-authored-by: Oleksiy <oleksiy@zed.dev>
2025-04-15 16:53:45 +00:00
Thomas Mickley-Doyle
b1e4e6048a
agent: Add more Rust code examples, update TODO check (#28737)
Release Notes:

- N/A
2025-04-15 16:52:08 +00:00
Agus Zubiaga
e4cf7fe8f5
eval: Improve readability with colors and alignment (#28761)
![CleanShot 2025-04-15 at 10 35
39@2x](https://github.com/user-attachments/assets/495d96fb-fe2f-478b-a9d6-678c1184db9a)


Release Notes:

- N/A
2025-04-15 13:50:01 +00:00
Bennet Bo Fenner
e26f0a331f
agent: Make ToolWorkingSet an Entity (#28757)
Motivation is to emit events when enabled tools change, want to use this
in #28755

Release Notes:

- N/A
2025-04-15 14:42:31 +02:00
Michael Sloan
0d6e455bf6
Agent eval: output paths to log files at the end (#28724)
Release Notes:

- N/A
2025-04-14 23:04:07 +00:00
Michael Sloan
5f897b0e00
Agent Eval: Fail example when there are no events in 2 minutes (#28725)
Release Notes:

- N/A
2025-04-14 23:01:21 +00:00
Thomas Mickley-Doyle
d74f0735c2
Add more eval examples + filtering examples by language + fix git concurrent usage (#28719)
Release Notes:

- N/A

---------

Co-authored-by: michael <michael@zed.dev>
Co-authored-by: agus <agus@zed.dev>
2025-04-14 22:05:46 +00:00
Michael Sloan
c8ccc472b5
Track tool use counts (#28722)
Release Notes:

- N/A
2025-04-14 21:45:36 +00:00
Michael Sloan
6b80eb556c
Add judge to new eval + provide LSP diagnostics (#28713)
Release Notes:

- N/A

---------

Co-authored-by: Antonio Scandurra <antonio@zed.dev>
Co-authored-by: agus <agus@zed.dev>
2025-04-14 20:18:47 +00:00
Antonio Scandurra
2440faf4b2
Actually run the eval and fix a hang when retrieving outline (#28547)
Release Notes:

- Fixed a regression that caused the agent to hang sometimes.

---------

Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Michael Sloan <mgsloan@gmail.com>
2025-04-11 00:01:33 +00:00
Antonio Scandurra
8ac378b86e
Lay the groundwork for a Rust-based eval (#28488)
Also, we moved the logic for driving the agentic loop into `Thread` so
that we don't have to re-implement it.

Release Notes:

- N/A

---------

Co-authored-by: Nathan Sobo <nathan@zed.dev>
2025-04-10 04:45:27 +00:00