Commit graph

18 commits

Author SHA1 Message Date
Michael Sloan
70c51b513b
agent eval: Default to also running typescript examples (#29185)
Release Notes:

- N/A
2025-04-21 23:59:35 +00:00
Antonio Scandurra
97ab0980d1
Start tracking tool failure rates in eval (#29122)
This pull request will print all the used tools and their failure rates.
The objective goal should be to minimize that failure rate.

@tmickleydoyle: this also changes the telemetry event to report
`tool_metrics` as opposed to `tool_use_counts`. Ideally I'd love to be
able to plot failure rates by tool and hopefully see that percentage go
down. Can we do that with the data we're tracking with this pull
request?

Release Notes:

- N/A
2025-04-21 16:16:43 +02:00
Michael Sloan
d88b06a5dc
Simplify language model registry + only emit change events on change (#29086)
* Now only does default fallback logic in the registry

* Only emits change events when there is actually a change

Release Notes:

- N/A
2025-04-19 08:26:42 +00:00
Nathan Sobo
bab28560ef
Systematically optimize agentic editing performance (#28961)
Now that we've established a proper eval in tree, this PR is reboots of
our agent loop back to a set of minimal tools and simpler prompts. We
should aim to get this branch feeling subjectively competitive with
what's on main and then merge it, and build from there.

Let's invest in our eval and use it to drive better performance of the
agent loop. How you can help: Pick an example, and then make the outcome
faster or better. It's fine to even use your own subjective judgment, as
our evaluation criteria likely need tuning as well at this point. Focus
on making the agent work better in your own subjective experience first.
Let's focus on simple/practical improvements to make this thing work
better, then determine how we can craft our judgment criteria to lock
those improvements in.

Release Notes:

- N/A

---------

Co-authored-by: Max <max@zed.dev>
Co-authored-by: Antonio <antonio@zed.dev>
Co-authored-by: Agus <agus@zed.dev>
Co-authored-by: Richard <richard@zed.dev>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Michael Sloan <mgsloan@gmail.com>
2025-04-19 02:47:59 +00:00
Michael Sloan
327fee4d22
Init prompt store in agent eval (#29068)
Needed after #28915

Release Notes:

- N/A
2025-04-18 20:06:34 +00:00
Thomas Mickley-Doyle
8de53bd89f
agent: Add git commit ID to the eval telemetry data (#28895)
Release Notes:

- N/A
2025-04-16 14:13:43 -05:00
Michael Sloan
9a9f2e71ca
Agent Eval: Initial support for running examples repeatedly (#28844)
Not ideal as it creates a separate worktree for each repetition

Release Notes:

- N/A
2025-04-16 06:35:55 +00:00
Michael Sloan
609895d95f
Agent Eval: bounded concurrency (#28843)
Release Notes:

- N/A
2025-04-16 00:05:46 -06:00
Thomas Mickley-Doyle
222d4a2546
agent: Add telemetry for eval runs (#28816)
Release Notes:

- N/A

---------

Co-authored-by: Joseph <joseph@zed.dev>
2025-04-16 02:54:26 +00:00
Michael Sloan
102ea6ac79
Add support for judge repetitions in eval (#28811)
Release Notes:

- N/A

---------

Co-authored-by: Thomas <thomas@zed.dev>
2025-04-15 23:18:02 +00:00
Agus Zubiaga
0182e09e33
eval: Do not create run files for skipped examples (#28800)
Release Notes:

- N/A
2025-04-15 18:00:04 +00:00
Agus Zubiaga
ff4334efc7
eval: Fix stalling on tool confirmation (#28786)
The `always_allow_tool_actions` setting would get overridden with the
default when we loaded each example project, leading to examples
stalling when they run a tool that needed confirmation. There's now a
separate `runner_settings.json` file where we can configure the
environment for the eval.

Release Notes:

- N/A

---------

Co-authored-by: Oleksiy <oleksiy@zed.dev>
2025-04-15 16:53:45 +00:00
Agus Zubiaga
e4cf7fe8f5
eval: Improve readability with colors and alignment (#28761)
![CleanShot 2025-04-15 at 10 35
39@2x](https://github.com/user-attachments/assets/495d96fb-fe2f-478b-a9d6-678c1184db9a)


Release Notes:

- N/A
2025-04-15 13:50:01 +00:00
Michael Sloan
0d6e455bf6
Agent eval: output paths to log files at the end (#28724)
Release Notes:

- N/A
2025-04-14 23:04:07 +00:00
Thomas Mickley-Doyle
d74f0735c2
Add more eval examples + filtering examples by language + fix git concurrent usage (#28719)
Release Notes:

- N/A

---------

Co-authored-by: michael <michael@zed.dev>
Co-authored-by: agus <agus@zed.dev>
2025-04-14 22:05:46 +00:00
Michael Sloan
6b80eb556c
Add judge to new eval + provide LSP diagnostics (#28713)
Release Notes:

- N/A

---------

Co-authored-by: Antonio Scandurra <antonio@zed.dev>
Co-authored-by: agus <agus@zed.dev>
2025-04-14 20:18:47 +00:00
Antonio Scandurra
2440faf4b2
Actually run the eval and fix a hang when retrieving outline (#28547)
Release Notes:

- Fixed a regression that caused the agent to hang sometimes.

---------

Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Michael Sloan <mgsloan@gmail.com>
2025-04-11 00:01:33 +00:00
Antonio Scandurra
8ac378b86e
Lay the groundwork for a Rust-based eval (#28488)
Also, we moved the logic for driving the agentic loop into `Thread` so
that we don't have to re-implement it.

Release Notes:

- N/A

---------

Co-authored-by: Nathan Sobo <nathan@zed.dev>
2025-04-10 04:45:27 +00:00