This problem seems to be specific to Opus 4. Eval shows improvement from
89% to 97%.
Closes: https://github.com/zed-industries/zed/issues/32060
Release Notes:
- N/A
Co-authored-by: Ben Brandt <benjamin.j.brandt@gmail.com>
When reaching the consecutive tool call limit, the agent gets blocked
and without a notification, you wouldn't know that. This PR adds the
ability to be notified when that happens, and you can use either sound
_and_ toast, or just one of them.
Release Notes:
- agent: Added support for getting notified (via toast and/or sound)
when reaching the consecutive tool call limit.
This PR adds a new `intent` field to completion requests to assist in
categorizing them correctly.
Release Notes:
- N/A
---------
Co-authored-by: Ben Brandt <benjamin.j.brandt@gmail.com>
This expands our deserialization of JSON from models to be more tolerant
of different variations that the model may send, including
capitalization, wrapping things in objects vs. being plain strings, etc.
Also when deserialization fails, it reports the entire error in the JSON
so we can see what failed to deserialize. (Previously these errors were
very unhelpful at diagnosing the problem.)
Finally, also removes the `WrappedText` variant since the custom
deserializer just turns that style of JSON into a normal `Text` variant.
Release Notes:
- N/A
This is needed for apples-to-apples comparison of different agent
models.
Another change is that now `cargo -p eval` accepts model names as
`provider_id/model_id` instead of separate `--provider` and `--model`
params.
Release Notes:
- N/A
This PR fixes an issue where the eval was incorrectly pulling the
provider/model from the user settings, which could cause problems when
running certain evals.
Was introduced in #30168 due to the restructuring after the removal of
the `assistant` crate.
Release Notes:
- N/A
https://github.com/zed-industries/zed/issues/30972 brought up another
case where our context is not enough to track the actual source of the
issue: we get a general top-level error without inner error.
The reason for this was `.ok_or_else(|| anyhow!("failed to read HEAD
SHA"))?; ` on the top level.
The PR finally reworks the way we use anyhow to reduce such issues (or
at least make it simpler to bubble them up later in a fix).
On top of that, uses a few more anyhow methods for better readability.
* `.ok_or_else(|| anyhow!("..."))`, `map_err` and other similar error
conversion/option reporting cases are replaced with `context` and
`with_context` calls
* in addition to that, various `anyhow!("failed to do ...")` are
stripped with `.context("Doing ...")` messages instead to remove the
parasitic `failed to` text
* `anyhow::ensure!` is used instead of `if ... { return Err(...); }`
calls
* `anyhow::bail!` is used instead of `return Err(anyhow!(...));`
Release Notes:
- N/A
Some providers sometimes send `{ "type": "text", "text": ... }` instead
of just the text as a string. Now we accept those instead of erroring.
Release Notes:
- N/A
- Evals returning an error (e.g., LLM API format mismatch) were silently
skipped in the aggregated results. Now we count them as a failure (0%
success score).
- Setting the `VERBOSE` environment variable to something non-empty
disables string truncation
Release Notes:
- N/A
1. The `edit_file` tool tended to use `create_or_overwrite` a bit too
often, leading to corruption of long files. This change replaces the
boolean flag with an `EditFileMode` enum, which helps Agent make a more
deliberate choice when overwriting files.
With this change, the pass rate of the new eval increased from 10% to
100%.
2. eval: Added ability to run eval on top of an existing thread. Threads
can now be loaded from JSON files in the `SerializedThread` format,
which makes it easy to use real threads as starting points for
tests/evals.
3. Don't try to restore tool cards when running in headless or eval mode
-- we don't have a window to properly do this.
Release Notes:
- N/A
This is very basic support for them. There are a number of other TODOs
before this is really a first-class supported feature, so not adding any
release notes for it; for now, this PR just makes it so that if
read_file tries to read a PNG (which has come up in practice), it at
least correctly sends it to Anthropic instead of messing up.
This also lays the groundwork for future PRs for more first-class
support for images in tool calls across more image file formats and LLM
providers.
Release Notes:
- N/A
---------
Co-authored-by: Agus Zubiaga <hi@aguz.me>
Co-authored-by: Agus Zubiaga <agus@zed.dev>
This is our first eval of the Minimal tool profile. Right now they're
all passing; the value of having it is to catch regressions in the
system prompt (which has special logic in it for the case where no tools
are enabled).
Release Notes:
- N/A
Release Notes:
- Fixed a race condition that sometimes prevented a system-installed
`node` binary from being detected.
- Fixed a bug where the `node.path` setting was not respected when
invoking npm.
This allows us to debug the raw edits that were generated when people
report feedback, when running evals and when opening the thread as
Markdown.
Release Notes:
- Improved debug output for agent threads.
Release Notes:
- Fixed a bug that would prevent the agent from working over SSH.
---------
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Cole Miller <m@cole-miller.net>
WIP
- On macOS/Linux, run the command in bash instead of the user's shell
- Try to prevent the agent from running commands that expect interaction
Release Notes:
- Agent Beta: Switched to using `bash` (if available) instead of the
user's shell when calling the terminal tool.
- Agent Beta: Prevented the agent from hanging when trying to run
interactive commands.
---------
Co-authored-by: WeetHet <stas.ale66@gmail.com>
Because we instantiated `ContextServerManager` both in `agent` and
`assistant-context-editor`, and these two entities track the running MCP
servers separately, we were effectively running every MCP server twice.
This PR moves the `ContextServerManager` into the project crate (now
called `ContextServerStore`). The store can be accessed via a project
instance. This ensures that we only instantiate one `ContextServerStore`
per project.
Also, this PR adds a bunch of tests to ensure that the
`ContextServerStore` behaves correctly (Previously there were none).
Closes#28714Closes#29530
Release Notes:
- N/A
This change:
1. Catches attempts to use missing tools. If this happens, we now send
Agent a message listing available tools, after which Agent can
gracefully recover. Prior behavior: thread would stop in a broken state.
Example of a hallucinated call and a message we send back:

2. Adds evals for hallucinated tool use and imagined edits
3. Adds ability to configure a profile name in evals.
Release Notes:
- N/A
I also introduced a new eval to prove the encouragement actually makes a
difference.
Release Notes:
- Improved agent behavior when streaming edits, encouraging it to
editing files as opposed to creating them from scratch
Enables reviewing agent edits from single-file editors in addition to
the multibuffer experience we already had.
https://github.com/user-attachments/assets/a2c287f0-51d6-43a1-8537-821498b91983
This feature can be turned off by setting `assistant.single_file_review:
false`.
Release Notes:
- agent: Review edits in single-file editors
This sets us up to display queue position information to the user, once
our language model backend is updated to support request queuing.
The JSON returned by the LLM backend will need to look like this:
```json
{"queue": {"status": "queued", "position": 1}}
{"queue": {"status": "started"}}
{"event": {"THE_UPSTREAM_MODEL_PROVIDER_EVENT": "..."}}
```
Release Notes:
- N/A
---------
Co-authored-by: Marshall Bowers <git@maxdeviant.com>
Closes #ISSUE
Co-authored-by: Bennet <bennet@zed.dev>
Release Notes:
- Added support for context `@mentions` in the inline prompt editor and
when editing past messages in the agent panel.
---------
Co-authored-by: Bennet Bo Fenner <bennet@zed.dev>
Co-authored-by: Ben Brandt <benjamin.j.brandt@gmail.com>
This pull request introduces a new tool for streaming edits. The
short-term goal is for this tool to replace the existing `EditFileTool`,
but we want to get this out the door as soon as possible so that we can
start testing it.
`StreamingEditFileTool` is mutually exclusive with `EditFileTool`. It
will be enabled by default for anyone who has the `agent-stream-edits`
feature flag, as well as people that set `assistant.stream_edits` to
`true` in their settings.
### Implementation
Streaming is achieved by requesting a completion while the `edit_file`
tool gets called. We invoke the model by taking the existing
conversation with the agent and appending a prompt specifically tailored
for editing. In that prompt, we ask the model to produce a stream of
`<old_text>`/`<new_text>` tags. As the model streams text in, we
incrementally parse it and start editing as soon as we can.
### Evals
Note that, as part of this pull request, I also defined some new evals
that I used to drive the behavior of the recursive LLM call. To run
them, use this command:
```bash
cargo test --package=assistant_tools --features eval -- eval_extract_handle_command_output
```
Or comment out the `#[cfg_attr(not(feature = "eval"), ignore)]` macro.
I recommend running them one at a time, because right now we don't
really have a way of orchestrating of all these evals. I think we should
invest into that effort once the new agent panel goes live.
Release Notes:
- N/A
---------
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Bennet Bo Fenner <bennetbo@gmx.de>
Co-authored-by: Oleksiy Syvokon <oleksiy.syvokon@gmail.com>
I saw a slice panic (for begin > end) in a debug build of the eval. This
should just be a failed assertion, not a panic that takes out the whole
eval run!
Release Notes:
- N/A
This is based on having observed that there is a lot of variation
between runs on `n=1` and `n=3`.
* With `n=8` two runs on the same branch give answers that seem close
enough to be reasonably consistent.
* With higher concurrency, trying to run this many repetitions seems to
lead language servers to time out a lot, causing evals to fail.
Release Notes:
- N/A
Fixed issue where eval thread judges were not considering the last
response in the thread.
The problem was that they were getting the full list of messages from
`last_request`, which (being a request!) did not have the response yet.
Release Notes:
- N/A
The `grep` tool used to include 4 lines of context around the match, but
the lines included would often be unhelpful. This PR improves this
behavior by using the range of the parent syntax node that contains the
full line(s) matched.
The match headers will also now include symbol breadcrumbs so that the
model can already gather code structure before/without reading files.
````md
### impl GitRepository for RealGitRepository › fn compare_checkpoints › L1278-1284
```rust
let result = git
.run(&[
"diff-tree",
"--quiet",
&left.commit_sha.to_string(),
&right.commit_sha.to_string(),
])
```
````
This positively impacts the `add_arg_to_trait_method` eval example with
better diff output, fewer tool failures, and reduced total turns.
Note: We have some plans to use a an "elision" approach where we would
combine all matches for a given file, skipping lines between them while
keeping symbol declaration lines. The theory is that this would be map
more closely to the expected input for edits. For now, this PR is a
significant improvement.
Release Notes:
- Agent: Enrich `grep` tool output with syntax information
Previously, if you clicked on a user message to edit it, and then, while
the user message has the editor pending, sent a new message via the
textarea, the whole thread would be grayed out because we hadn't
dismissed the to-be-edited pending user message. That's now fixed.
Release Notes:
- agent: Fixed a bug that would make the whole thread be grayed out upon
sending a new message while a user message had a pending edit.
This PR adds a "max mode" toggle to the Agent panel, for models that
support it.
Only visible to folks in the `new-billing` feature flag.
Icon is just a placeholder.
Release Notes:
- N/A
Closes#27641
This PR fixes invalid proxy URIs being registered despite the URI not
being a valid proxy URI.
Whilst investigating #27641 , I noticed that currently any proxy URI
passed to `RequestClient::proxy_and_user_agent` will be assigned to the
created client, even if the URI is not a valid proxy URI. Given a test
as an example:
We create an URI here and pass it as a proxy to
`ReqwestClient::proxy_and_user_agent`:
https://github.com/zed-industries/zed/blob/main/crates/reqwest_client/src/reqwest_client.rs#L272-L273
In `ReqwestClient::proxy_and_user_agent`we take the proxy parameter here
9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L46)
and set it unconditionally here:
9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L62)
, not considering at all whether the proxy was successfully created
above. Concluding, we currently do not actually check whether a proxy
was successfully created, but rather whether an URI is equal to itself,
which trivially holds. The existing test for a malformed proxy URI
9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L293-L297)
does not check whether invalid proxies cause an error, but rather checks
whether `http::Uri::from_static` panics on an invalid URI, [which it
does as
documented](https://docs.rs/http/latest/http/uri/struct.Uri.html#panics).
Thus, the tests currently do not really check anything proxy-related and
invalid proxies are assigned as valid proxies.
---
This PR fixes the behaviour by considering whether the proxy was
actually properly parsed and only assigning it if that is the case.
Furthermore, it improves logging in case of errors so issues like the
linked one are easier to debug (for the linked issue, the log will now
include that the proxy schema is not supported in the logs).
Lastly, it also updates the test for a malformed proxy URI. The test now
actually checks that malformed proxy URIs are not registered for the
client rather than testing the `http` crate.
The update also initially caused the [test for a `socks4a`
proxy](9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L280C1-L282C50))
to fail. This happened because the reqwest-library introduced supports
for `socks4a` proxies in [version
0.12.13](https://github.com/seanmonstar/reqwest/blob/master/CHANGELOG.md#v01213).
Thus, this PR includes a bump of the reqwest library to add proper
support for socks4a proxies.
Release Notes:
- Added support for socks4a proxies.
---------
Co-authored-by: Peter Tripp <peter@zed.dev>
`App::http_client` and `Client::http_client` both return an owned `Arc`
which it clones internally. This means we can remove unnecessary clones
when calling these methods.
Release Notes:
- N/A
This PR adds support for the eval to read environment variables from a
`.env` file located in the `crates/eval` directory.
For instance, you can use it to set your Anthropic API key:
```
ANTHROPIC_API_KEY=<secret>
```
Release Notes:
- N/A