This pull request should be idempotent, but lays the groundwork for
avoiding to connect to collab in order to interact with AI features
provided by Zed.
Release Notes:
- N/A
---------
Co-authored-by: Marshall Bowers <git@maxdeviant.com>
Co-authored-by: Richard Feldman <oss@rtfeldman.com>
This PR updates the Agent panel to work with the `CloudUserStore`
instead of the `UserStore`, reducing its reliance on being connected to
Collab to function.
Release Notes:
- N/A
---------
Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Fixes an issue that caused Windows to fail when removing extension's
directories, as Zed had never stop any related processes.
Now:
* Zed shuts down and waits until the end when the language servers are
shut down
* Adds `impl Drop for WasmExtension` where does
`self.tx.close_channel();` to stop a receiver loop that holds the "lock"
on the extension's work dir.
The extension was dropped, but the channel was not closed for some
reason.
* Does more unregistration to ensure `Arc<WasmExtension>` with the `tx`
does not leak further
* Tidies up the related errors which had never reported a problematic
path before
Release Notes:
- N/A
---------
Co-authored-by: Smit Barmase <heysmitbarmase@gmail.com>
Co-authored-by: Smit <smit@zed.dev>
This cleans up our settings to not include any `version` fields, as we
have an actual settings migrator now.
This PR removes `language_models > anthropic > version`,
`language_models > openai > version` and `agent > version`.
We had migration paths in the code for a long time, so in practice
almost everyone should be using the latest version of these settings.
Release Notes:
- Remove `version` fields in settings for `agent`, `language_models >
anthropic`, `language_models > openai`. Your settings will automatically
be migrated. If you're running into issues with this open an issue
[here](https://github.com/zed-industries/zed/issues)
This PR moves the UI-dependent logic in the `agent` crate into its own
crate, `agent_ui`. The remaining `agent` crate no longer depends on
`editor`, `picker`, `ui`, `workspace`, etc.
This has compile time benefits, but the main motivation is to isolate
our core agentic logic, so that we can make agents more
pluggable/configurable.
Release Notes:
- N/A
The `async-watch` crate doesn't seem to be maintained and we noticed
several panics coming from it, such as:
```
[bug] failed to observe change after notificaton.
zed::reliability::init_panic_hook::{{closure}}::hea8cdcb6299fad6b+154543526
std::panicking::rust_panic_with_hook::h33b18b24045abff4+127578547
std::panicking::begin_panic_handler::{{closure}}::hf8313cc2fd0126bc+127577770
std::sys::backtrace::__rust_end_short_backtrace::h57fe07c8aea5c98a+127571385
__rustc[95feac21a9532783]::rust_begin_unwind+127576909
core::panicking::panic_fmt::hd54fb667be51beea+9433328
core::option::expect_failed::h8456634a3dada3e4+9433291
assistant_tools::edit_agent::EditAgent::apply_edit_chunks::{{closure}}::habe2e1a32b267fd4+26921553
gpui::app::async_context::AsyncApp::spawn::{{closure}}::h12f5f25757f572ea+25923441
async_task::raw::RawTask<F,T,S,M>::run::h3cca0d402690ccba+25186815
<gpui::platform::linux::x11::client::X11Client as gpui::platform::linux::platform::LinuxClient>::run::h26264aefbcfbc14b+73961666
gpui::platform::linux::platform::<impl gpui::platform::Platform for P>::run::hb12dcd4abad715b5+73562509
gpui::app::Application::run::h0f936a5f855a3f9f+150676820
zed::main::ha17f9a25fe257d35+154788471
std::sys::backtrace::__rust_begin_short_backtrace::h1edd02429370b2bd+154624579
std::rt::lang_start::{{closure}}::h3d2e300f10059b0a+154264777
std::rt::lang_start_internal::h418648f91f5be3a1+127502049
main+154806636
__libc_start_main+46051972301573
_start+12358494
```
I didn't find an executor-agnostic watch crate that was well maintained
(we already tried postage and async-watch), so decided to implement it
our own version.
Release Notes:
- Fixed a panic that could sometimes occur when the agent performed
edits.
This is needed for apples-to-apples comparison of different agent
models.
Another change is that now `cargo -p eval` accepts model names as
`provider_id/model_id` instead of separate `--provider` and `--model`
params.
Release Notes:
- N/A
This PR fixes an issue where the eval was incorrectly pulling the
provider/model from the user settings, which could cause problems when
running certain evals.
Was introduced in #30168 due to the restructuring after the removal of
the `assistant` crate.
Release Notes:
- N/A
https://github.com/zed-industries/zed/issues/30972 brought up another
case where our context is not enough to track the actual source of the
issue: we get a general top-level error without inner error.
The reason for this was `.ok_or_else(|| anyhow!("failed to read HEAD
SHA"))?; ` on the top level.
The PR finally reworks the way we use anyhow to reduce such issues (or
at least make it simpler to bubble them up later in a fix).
On top of that, uses a few more anyhow methods for better readability.
* `.ok_or_else(|| anyhow!("..."))`, `map_err` and other similar error
conversion/option reporting cases are replaced with `context` and
`with_context` calls
* in addition to that, various `anyhow!("failed to do ...")` are
stripped with `.context("Doing ...")` messages instead to remove the
parasitic `failed to` text
* `anyhow::ensure!` is used instead of `if ... { return Err(...); }`
calls
* `anyhow::bail!` is used instead of `return Err(anyhow!(...));`
Release Notes:
- N/A
- Evals returning an error (e.g., LLM API format mismatch) were silently
skipped in the aggregated results. Now we count them as a failure (0%
success score).
- Setting the `VERBOSE` environment variable to something non-empty
disables string truncation
Release Notes:
- N/A
1. The `edit_file` tool tended to use `create_or_overwrite` a bit too
often, leading to corruption of long files. This change replaces the
boolean flag with an `EditFileMode` enum, which helps Agent make a more
deliberate choice when overwriting files.
With this change, the pass rate of the new eval increased from 10% to
100%.
2. eval: Added ability to run eval on top of an existing thread. Threads
can now be loaded from JSON files in the `SerializedThread` format,
which makes it easy to use real threads as starting points for
tests/evals.
3. Don't try to restore tool cards when running in headless or eval mode
-- we don't have a window to properly do this.
Release Notes:
- N/A
Release Notes:
- Fixed a race condition that sometimes prevented a system-installed
`node` binary from being detected.
- Fixed a bug where the `node.path` setting was not respected when
invoking npm.
Because we instantiated `ContextServerManager` both in `agent` and
`assistant-context-editor`, and these two entities track the running MCP
servers separately, we were effectively running every MCP server twice.
This PR moves the `ContextServerManager` into the project crate (now
called `ContextServerStore`). The store can be accessed via a project
instance. This ensures that we only instantiate one `ContextServerStore`
per project.
Also, this PR adds a bunch of tests to ensure that the
`ContextServerStore` behaves correctly (Previously there were none).
Closes#28714Closes#29530
Release Notes:
- N/A
This pull request introduces a new tool for streaming edits. The
short-term goal is for this tool to replace the existing `EditFileTool`,
but we want to get this out the door as soon as possible so that we can
start testing it.
`StreamingEditFileTool` is mutually exclusive with `EditFileTool`. It
will be enabled by default for anyone who has the `agent-stream-edits`
feature flag, as well as people that set `assistant.stream_edits` to
`true` in their settings.
### Implementation
Streaming is achieved by requesting a completion while the `edit_file`
tool gets called. We invoke the model by taking the existing
conversation with the agent and appending a prompt specifically tailored
for editing. In that prompt, we ask the model to produce a stream of
`<old_text>`/`<new_text>` tags. As the model streams text in, we
incrementally parse it and start editing as soon as we can.
### Evals
Note that, as part of this pull request, I also defined some new evals
that I used to drive the behavior of the recursive LLM call. To run
them, use this command:
```bash
cargo test --package=assistant_tools --features eval -- eval_extract_handle_command_output
```
Or comment out the `#[cfg_attr(not(feature = "eval"), ignore)]` macro.
I recommend running them one at a time, because right now we don't
really have a way of orchestrating of all these evals. I think we should
invest into that effort once the new agent panel goes live.
Release Notes:
- N/A
---------
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Bennet Bo Fenner <bennetbo@gmx.de>
Co-authored-by: Oleksiy Syvokon <oleksiy.syvokon@gmail.com>
This is based on having observed that there is a lot of variation
between runs on `n=1` and `n=3`.
* With `n=8` two runs on the same branch give answers that seem close
enough to be reasonably consistent.
* With higher concurrency, trying to run this many repetitions seems to
lead language servers to time out a lot, causing evals to fail.
Release Notes:
- N/A
Closes#27641
This PR fixes invalid proxy URIs being registered despite the URI not
being a valid proxy URI.
Whilst investigating #27641 , I noticed that currently any proxy URI
passed to `RequestClient::proxy_and_user_agent` will be assigned to the
created client, even if the URI is not a valid proxy URI. Given a test
as an example:
We create an URI here and pass it as a proxy to
`ReqwestClient::proxy_and_user_agent`:
https://github.com/zed-industries/zed/blob/main/crates/reqwest_client/src/reqwest_client.rs#L272-L273
In `ReqwestClient::proxy_and_user_agent`we take the proxy parameter here
9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L46)
and set it unconditionally here:
9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L62)
, not considering at all whether the proxy was successfully created
above. Concluding, we currently do not actually check whether a proxy
was successfully created, but rather whether an URI is equal to itself,
which trivially holds. The existing test for a malformed proxy URI
9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L293-L297)
does not check whether invalid proxies cause an error, but rather checks
whether `http::Uri::from_static` panics on an invalid URI, [which it
does as
documented](https://docs.rs/http/latest/http/uri/struct.Uri.html#panics).
Thus, the tests currently do not really check anything proxy-related and
invalid proxies are assigned as valid proxies.
---
This PR fixes the behaviour by considering whether the proxy was
actually properly parsed and only assigning it if that is the case.
Furthermore, it improves logging in case of errors so issues like the
linked one are easier to debug (for the linked issue, the log will now
include that the proxy schema is not supported in the logs).
Lastly, it also updates the test for a malformed proxy URI. The test now
actually checks that malformed proxy URIs are not registered for the
client rather than testing the `http` crate.
The update also initially caused the [test for a `socks4a`
proxy](9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L280C1-L282C50))
to fail. This happened because the reqwest-library introduced supports
for `socks4a` proxies in [version
0.12.13](https://github.com/seanmonstar/reqwest/blob/master/CHANGELOG.md#v01213).
Thus, this PR includes a bump of the reqwest library to add proper
support for socks4a proxies.
Release Notes:
- Added support for socks4a proxies.
---------
Co-authored-by: Peter Tripp <peter@zed.dev>
`App::http_client` and `Client::http_client` both return an owned `Arc`
which it clones internally. This means we can remove unnecessary clones
when calling these methods.
Release Notes:
- N/A
This PR adds support for the eval to read environment variables from a
`.env` file located in the `crates/eval` directory.
For instance, you can use it to set your Anthropic API key:
```
ANTHROPIC_API_KEY=<secret>
```
Release Notes:
- N/A
This update generates a single self-contained .html file that shows an
overview of evaluation threads in the browser. It's useful for:
- Quickly reviewing results
- Sharing evaluation runs
- Debugging
- Comparing models (TBD)
Features:
- Export thread JSON from the UI
- Keyboard navigation (j/k or Ctrl + ←/→)
- Toggle between compact and full views
Generating the overview:
- `cargo run -p eval` will write this file in the run dir's root.
- Or you can call `cargo run -p eval --bin explorer` to generate it
without running evals.
Screenshot:

Release Notes:
- N/A
### Problem
We want to start continuously tracking our progress on agent evals over
time. As part of this, we'd like the *score* to have a clear,
interpretable meaning. Right now, it's a number from 0 to 5, but it's
not clear what any particular number works. In addition, scores vary
widely from run to run, because the agent's output is deterministic. We
try to stabilize the score using a panel of judges, but the behavior of
the agent itself varies much more widely than the judges' scores for a
given run.
### Solution
* **explicit meanings of scores** - In this PR, we're prescribing the
diff and thread criteria files so that they *must* be unordered lists of
assertions. For both the thread and the diff, rather than providing an
abstract score, the judge's task is simply to count how many of these
assertions are satisfied. A percentage score can be derived from this
number, divided by the total number of assertions.
* **repetitions** - Rather than running each example once, and judging
it N times, we'll **run** the example N times. Right now, I'm just
judging the output once per run, because I believe that with these more
clear scoring criteria, the main source of non-determinism will be the
*agent's* behavior, not the judge's
### Questions
* **accounting for diagnostic errors** - Previously, the judge was asked
to incorporate diagnostics into their abstract scores. Now that the
"score" is determined directly from the criteria, the diagnostic will
not be captured in the score. How should the diagnostics be accounted
for in the eval? One thought is - let's simply count and report the
number of errors remaining after the agent finishes, as a separate field
of the run (along with diff score and thread score). We could consider
normalizing it using the total lines of added code (like errors per 100
lines of code added) in order to give it some semblance of stability
between examples.
* **repetitions** - How many repetitions should we run on CI? Each
repetition takes significant time, but I think running more than one
repetition will make the scores significantly less volatile.
### Todo
* [x] Fix `--concurrency` implementation so that only N tasks are
spawned
* [x] Support `--repetitions` efficiently (re-using the same worktree)
* [x] Restructure judge prompts to count passing criteria, not compute
abstract score
* [x] Report total number of diagnostics in some way
* [x] Format output nicely
Release Notes:
- N/A *or* Added/Fixed/Improved ...
---------
Co-authored-by: Antonio Scandurra <me@as-cii.com>
The old one wasn't linking, and
https://github.com/zed-industries/zed/pull/29081 has a bunch of merge
conflicts. Wanted to start simple/small.
## Todo
* [x] Remove low-signal examples
* [x] Make the eval run on a cron, on main, and on any PR with the
`run-eval` label
* [x] Noise in logs about failure to write settings
```
[2025-04-21T20:45:04Z ERROR settings] Failed to write settings to file
"/home/runner/.config/zed/settings.json"
Caused by:
No such file or directory (os error 2) at path
"/home/runner/.config/zed/.tmpLewFEs"
```
* [x] `Agentic loop stalled`
(https://github.com/zed-industries/zed/actions/runs/14581044243/job/40897622894)
* [x] Make sure that events are recorded in snowflake
* [ ] Change judge criteria to be more explicit about meanings of scores
Release Notes:
- N/A
---------
Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Agus Zubiaga <hi@aguz.me>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
This pull request will print all the used tools and their failure rates.
The objective goal should be to minimize that failure rate.
@tmickleydoyle: this also changes the telemetry event to report
`tool_metrics` as opposed to `tool_use_counts`. Ideally I'd love to be
able to plot failure rates by tool and hopefully see that percentage go
down. Can we do that with the data we're tracking with this pull
request?
Release Notes:
- N/A
Now that we've established a proper eval in tree, this PR is reboots of
our agent loop back to a set of minimal tools and simpler prompts. We
should aim to get this branch feeling subjectively competitive with
what's on main and then merge it, and build from there.
Let's invest in our eval and use it to drive better performance of the
agent loop. How you can help: Pick an example, and then make the outcome
faster or better. It's fine to even use your own subjective judgment, as
our evaluation criteria likely need tuning as well at this point. Focus
on making the agent work better in your own subjective experience first.
Let's focus on simple/practical improvements to make this thing work
better, then determine how we can craft our judgment criteria to lock
those improvements in.
Release Notes:
- N/A
---------
Co-authored-by: Max <max@zed.dev>
Co-authored-by: Antonio <antonio@zed.dev>
Co-authored-by: Agus <agus@zed.dev>
Co-authored-by: Richard <richard@zed.dev>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Michael Sloan <mgsloan@gmail.com>
The `always_allow_tool_actions` setting would get overridden with the
default when we loaded each example project, leading to examples
stalling when they run a tool that needed confirmation. There's now a
separate `runner_settings.json` file where we can configure the
environment for the eval.
Release Notes:
- N/A
---------
Co-authored-by: Oleksiy <oleksiy@zed.dev>
Release Notes:
- Fixed a regression that caused the agent to hang sometimes.
---------
Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Michael Sloan <mgsloan@gmail.com>