Looks like I accidentally touched a line of code in an eval fixture in
#31479, despite intentionally trying to avoid that code. Thanks
@cole-miller!
Release Notes:
- N/A
This change instructs models to wrap new file content in Markdown fences
and introduces a parser for this format. The reasons are:
1. This is the format we put a lot of effort into explaining in the
system prompt.
2. Gemini really prefers to do it.
3. It adds an option for a model to think before writing the content
The `eval_zode` pass rate for GEmini models goes from 0% to 100%. Other
models were already at 100%, this hasn't changed.
Release Notes:
- N/A
Open inspector with `dev: toggle inspector` from command palette or
`cmd-alt-i` on mac or `ctrl-alt-i` on linux.
https://github.com/user-attachments/assets/54c43034-d40b-414e-ba9b-190bed2e6d2f
* Picking of elements via the mouse, with scroll wheel to inspect
occluded elements.
* Temporary manipulation of the selected element.
* Layout info and JSON-based style manipulation for `Div`.
* Navigation to code that constructed the element.
Big thanks to @as-cii and @maxdeviant for sorting out how to implement
the core of an inspector.
Release Notes:
- N/A
---------
Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Marshall Bowers <git@maxdeviant.com>
Co-authored-by: Federico Dionisi <code@fdionisi.me>
This change improves `eval_extract_handle_command_output` results for
all models:
Model | Pass rate before | Pass rate after
----------------------------|------------------|----------------
claude-3.7-sonnet | 0.96 | 0.98
gemini-2.5-pro | 0.35 | 0.86
gpt-4.1 | 0.81 | 1.00
Part of this improvement comes from more robust evaluation, which now
accepts multiple possible outcomes. Another part is from the prompt
adaptation: addressing common Gemini failure modes, adding a few-shot
example, and, in the final commit, auto-rewriting instructions for
clarity and conciseness.
This change still needs validation from larger end-to-end evals.
Release Notes:
- N/A
1. Add system prompt: this is how it's called from threads. Previously,
we were sending
2. Fix an issue with writing agent thought into a newly created empty
file.
Release Notes:
- N/A
---------
Co-authored-by: Ben Brandt <benjamin.j.brandt@gmail.com>
Co-authored-by: Antonio Scandurra <me@as-cii.com>
https://github.com/zed-industries/zed/issues/30972 brought up another
case where our context is not enough to track the actual source of the
issue: we get a general top-level error without inner error.
The reason for this was `.ok_or_else(|| anyhow!("failed to read HEAD
SHA"))?; ` on the top level.
The PR finally reworks the way we use anyhow to reduce such issues (or
at least make it simpler to bubble them up later in a fix).
On top of that, uses a few more anyhow methods for better readability.
* `.ok_or_else(|| anyhow!("..."))`, `map_err` and other similar error
conversion/option reporting cases are replaced with `context` and
`with_context` calls
* in addition to that, various `anyhow!("failed to do ...")` are
stripped with `.context("Doing ...")` messages instead to remove the
parasitic `failed to` text
* `anyhow::ensure!` is used instead of `if ... { return Err(...); }`
calls
* `anyhow::bail!` is used instead of `return Err(anyhow!(...));`
Release Notes:
- N/A
This eval checks that Edit Agent can create an empty file without
writing its thoughts into it. This issue is not specific to empty files,
but it's easier to reproduce with them.
For some mysterious reason, I could easily reproduce this issue roughly
90% of the time in actual Zed. However, once I extract the exact LLM
request before the failure point and generate from that, the
reproduction rate drops to 2%!
Things I've tried to make sure it's not a fluke: disabling prompt
caching, capturing the LLM request via a proxy server, running the
prompt on Claude separately from evals. Every time it was mostly giving
good outcomes, which doesn't match my actual experience in Zed.
At some point I discovered that simply adding one insignificant space or
a newline to the prompt suddenly results in an outcome I tried to
reproduce almost perfectly.
This weirdness happens even outside the Zed code base and even when
using a different subscription. The result is the same: an extra newline
or space changes the model behavior significantly enough, so that the
pass rate drops from 99% to 0-3%
I have no explanation to this.
Release Notes:
- N/A
1. The `edit_file` tool tended to use `create_or_overwrite` a bit too
often, leading to corruption of long files. This change replaces the
boolean flag with an `EditFileMode` enum, which helps Agent make a more
deliberate choice when overwriting files.
With this change, the pass rate of the new eval increased from 10% to
100%.
2. eval: Added ability to run eval on top of an existing thread. Threads
can now be loaded from JSON files in the `SerializedThread` format,
which makes it easy to use real threads as starting points for
tests/evals.
3. Don't try to restore tool cards when running in headless or eval mode
-- we don't have a window to properly do this.
Release Notes:
- N/A
This is very basic support for them. There are a number of other TODOs
before this is really a first-class supported feature, so not adding any
release notes for it; for now, this PR just makes it so that if
read_file tries to read a PNG (which has come up in practice), it at
least correctly sends it to Anthropic instead of messing up.
This also lays the groundwork for future PRs for more first-class
support for images in tool calls across more image file formats and LLM
providers.
Release Notes:
- N/A
---------
Co-authored-by: Agus Zubiaga <hi@aguz.me>
Co-authored-by: Agus Zubiaga <agus@zed.dev>
This allows us to debug the raw edits that were generated when people
report feedback, when running evals and when opening the thread as
Markdown.
Release Notes:
- Improved debug output for agent threads.
Release Notes:
- Fixed a bug that would prevent the agent from working over SSH.
---------
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Cole Miller <m@cole-miller.net>
This improves the new eval scenario by ~80% (`0.29` vs `0.525`) without
decreasing performance in the other evals.
Release Notes:
- Improved the performance of the `edit_file` tool.
Release Notes:
- Fixed a bug that would cause rejecting a hunk from the agent to delete
the file if the agent had decided to rewrite that file from scratch.
Nathan here: I also tacked on a bunch of UI refinement.
Release Notes:
- Introduced the ability to follow the agent around as it reads and
edits files.
---------
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Refs #29733
This pull request introduces a new field to the `StreamingEditFileTool`
that lets the model create or overwrite a file in a streaming way. When
one of the `assistant.stream_edits` setting / `agent-stream-edits`
feature flag is enabled, we are going to disable the `CreateFileTool` so
that the agent model can only use `StreamingEditFileTool` for file
creation.
Release Notes:
- N/A
---------
Co-authored-by: Ben Brandt <benjamin.j.brandt@gmail.com>
Co-authored-by: Oleksiy Syvokon <oleksiy.syvokon@gmail.com>
This pull request introduces a new tool for streaming edits. The
short-term goal is for this tool to replace the existing `EditFileTool`,
but we want to get this out the door as soon as possible so that we can
start testing it.
`StreamingEditFileTool` is mutually exclusive with `EditFileTool`. It
will be enabled by default for anyone who has the `agent-stream-edits`
feature flag, as well as people that set `assistant.stream_edits` to
`true` in their settings.
### Implementation
Streaming is achieved by requesting a completion while the `edit_file`
tool gets called. We invoke the model by taking the existing
conversation with the agent and appending a prompt specifically tailored
for editing. In that prompt, we ask the model to produce a stream of
`<old_text>`/`<new_text>` tags. As the model streams text in, we
incrementally parse it and start editing as soon as we can.
### Evals
Note that, as part of this pull request, I also defined some new evals
that I used to drive the behavior of the recursive LLM call. To run
them, use this command:
```bash
cargo test --package=assistant_tools --features eval -- eval_extract_handle_command_output
```
Or comment out the `#[cfg_attr(not(feature = "eval"), ignore)]` macro.
I recommend running them one at a time, because right now we don't
really have a way of orchestrating of all these evals. I think we should
invest into that effort once the new agent panel goes live.
Release Notes:
- N/A
---------
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Bennet Bo Fenner <bennetbo@gmx.de>
Co-authored-by: Oleksiy Syvokon <oleksiy.syvokon@gmail.com>