This problem seems to be specific to Opus 4. Eval shows improvement from
89% to 97%.
Closes: https://github.com/zed-industries/zed/issues/32060
Release Notes:
- N/A
Co-authored-by: Ben Brandt <benjamin.j.brandt@gmail.com>
1. The `edit_file` tool tended to use `create_or_overwrite` a bit too
often, leading to corruption of long files. This change replaces the
boolean flag with an `EditFileMode` enum, which helps Agent make a more
deliberate choice when overwriting files.
With this change, the pass rate of the new eval increased from 10% to
100%.
2. eval: Added ability to run eval on top of an existing thread. Threads
can now be loaded from JSON files in the `SerializedThread` format,
which makes it easy to use real threads as starting points for
tests/evals.
3. Don't try to restore tool cards when running in headless or eval mode
-- we don't have a window to properly do this.
Release Notes:
- N/A
- Adds a new smoke test for the use of the read_file tool by the agent
in an SSH project
- Fixes the SSH shutdown sequence to use a timer from the app's executor
instead of always using a real timer
- Changes the main executor loop for tests to advance the clock
automatically instead of panicking with `parked with nothing left to
run` when there is a delayed task
Release Notes:
- N/A
Release Notes:
- Fixed a bug that would prevent the agent from working over SSH.
---------
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Cole Miller <m@cole-miller.net>
This PR removes all of the feature flag checks related to the Agent.
Tried to do this in the least invasive way possible; we can follow up
with a full removal.
Release Notes:
- N/A
WIP
- On macOS/Linux, run the command in bash instead of the user's shell
- Try to prevent the agent from running commands that expect interaction
Release Notes:
- Agent Beta: Switched to using `bash` (if available) instead of the
user's shell when calling the terminal tool.
- Agent Beta: Prevented the agent from hanging when trying to run
interactive commands.
---------
Co-authored-by: WeetHet <stas.ale66@gmail.com>
I also introduced a new eval to prove the encouragement actually makes a
difference.
Release Notes:
- Improved agent behavior when streaming edits, encouraging it to
editing files as opposed to creating them from scratch
Some changes in the LanguageModelRegistry caused the web search tool not
to show up, because the `DefaultModelChanged` event is not emitted at
startup anymore.
Release Notes:
- agent: Fixed an issue where the web search tool would not be available
after starting Zed (only when using zed.dev as a provider).
Refs #29733
This pull request introduces a new field to the `StreamingEditFileTool`
that lets the model create or overwrite a file in a streaming way. When
one of the `assistant.stream_edits` setting / `agent-stream-edits`
feature flag is enabled, we are going to disable the `CreateFileTool` so
that the agent model can only use `StreamingEditFileTool` for file
creation.
Release Notes:
- N/A
---------
Co-authored-by: Ben Brandt <benjamin.j.brandt@gmail.com>
Co-authored-by: Oleksiy Syvokon <oleksiy.syvokon@gmail.com>
This pull request introduces a new tool for streaming edits. The
short-term goal is for this tool to replace the existing `EditFileTool`,
but we want to get this out the door as soon as possible so that we can
start testing it.
`StreamingEditFileTool` is mutually exclusive with `EditFileTool`. It
will be enabled by default for anyone who has the `agent-stream-edits`
feature flag, as well as people that set `assistant.stream_edits` to
`true` in their settings.
### Implementation
Streaming is achieved by requesting a completion while the `edit_file`
tool gets called. We invoke the model by taking the existing
conversation with the agent and appending a prompt specifically tailored
for editing. In that prompt, we ask the model to produce a stream of
`<old_text>`/`<new_text>` tags. As the model streams text in, we
incrementally parse it and start editing as soon as we can.
### Evals
Note that, as part of this pull request, I also defined some new evals
that I used to drive the behavior of the recursive LLM call. To run
them, use this command:
```bash
cargo test --package=assistant_tools --features eval -- eval_extract_handle_command_output
```
Or comment out the `#[cfg_attr(not(feature = "eval"), ignore)]` macro.
I recommend running them one at a time, because right now we don't
really have a way of orchestrating of all these evals. I think we should
invest into that effort once the new agent panel goes live.
Release Notes:
- N/A
---------
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Bennet Bo Fenner <bennetbo@gmx.de>
Co-authored-by: Oleksiy Syvokon <oleksiy.syvokon@gmail.com>
This PR removes two fields from JSON schemas (`$schema` and `title`),
which are not expected by any model provider, but were spuriously
included by our JSON schema library, `schemars`.
These added noise to requests and cost wasted input tokens.
### Old
```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "FetchToolInput",
"type": "object",
"required": [
"url"
],
"properties": {
"url": {
"description": "The URL to fetch.",
"type": "string"
}
}
}
```
### New:
```json
{
"properties": {
"url": {
"description": "The URL to fetch.",
"type": "string"
}
},
"required": [
"url"
],
"type": "object"
}
```
- N/A
This PR significantly improves the quality of the initial file search
that occurs when the model doesn't yet know the full path to a file it
needs to read/edit.
Previously, the assertions in file_search often failed on main as the
model attempted to guess full file paths. On this branch, it reliably
calls `find_path` (previously `path_search`) before reading files.
After getting the model to find paths first, I noticed it would try
using `grep` instead of `path_search`. This motivated renaming
`path_search` to `find_path` (continuing the analogy to unix commands)
and adding system prompt instructions about proper tool selection.
Note: I know the command is just called `find`, but that seemed too
general.
In my eval runs, the `file_search` example improved from 40% ± 10% to
98% ± 2%. The only assertion I'm seeing occasionally fail is "glob
starts with `**` or project". We can probably add some instructions in
that regard.
Release Notes:
- N/A
This PR refines a bit the web search tool UI by introducing a component
(`ToolCallCardHeader`) that aims to standardize the heading element of
tool calls in the thread.
In terms of next steps, I plan to evolve this component further soon
(e.g., building a full-blown "tool call card" component), and even move
it to a place where I can re-use it in the active_thread as well without
making the `assistant_tools` a dependency of it.
Release Notes:
- N/A
This PR renames the `regex_search` tool to `grep` because I think it
conveys more meaning to the model, the idea of searching the filesystem
with a regular expression. It's also one word and the model seems to be
using it effectively after some additional prompt tuning.
It also takes an include pattern to filter on the specific files we try
to search. I'd like to encourage the model to scope its searches more
aggressively, as in my testing, I'm only seeing it filter on file
extension.
Release Notes:
- N/A
Now that we've established a proper eval in tree, this PR is reboots of
our agent loop back to a set of minimal tools and simpler prompts. We
should aim to get this branch feeling subjectively competitive with
what's on main and then merge it, and build from there.
Let's invest in our eval and use it to drive better performance of the
agent loop. How you can help: Pick an example, and then make the outcome
faster or better. It's fine to even use your own subjective judgment, as
our evaluation criteria likely need tuning as well at this point. Focus
on making the agent work better in your own subjective experience first.
Let's focus on simple/practical improvements to make this thing work
better, then determine how we can craft our judgment criteria to lock
those improvements in.
Release Notes:
- N/A
---------
Co-authored-by: Max <max@zed.dev>
Co-authored-by: Antonio <antonio@zed.dev>
Co-authored-by: Agus <agus@zed.dev>
Co-authored-by: Richard <richard@zed.dev>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Michael Sloan <mgsloan@gmail.com>
Staff only for now. We'll work on making this usable for non zed.dev
users later
Release Notes:
- N/A
---------
Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Danilo Leal <daniloleal09@gmail.com>
Co-authored-by: Marshall Bowers <git@maxdeviant.com>
This is a combination of the "read file" and "list directory contents"
tools as part of a push to reduce our quantity of builtin tools by
combining some of them.
The functionality is all there for this tool, although there's room for
improvement on the visuals side: it currently always shows the same icon
and always says "Read" - so you can't tell at a glance when it's reading
a directory vs an individual file. Changing this will require a change
to the `Tool` trait, which can be in a separate PR. (FYI @danilo-leal!)
<img width="606" alt="Screenshot 2025-04-14 at 11 56 27 PM"
src="https://github.com/user-attachments/assets/bded72af-6476-4469-97c6-2f344629b0e4"
/>
Release Notes:
- Added `contents` tool
This ensures that we respect the `LanguageModelToolSchemaFormat` value
when we call `tool.input_schema`. This prevents us from breaking Gemini
compatibility when adding/changing built-in tools. See #28634.
The test suite will now fail with an error message like this, when
providing an incompatible input_schema:
```
thread 'tests::test_tool_schema_compatibility' panicked at crates/assistant_tools/src/assistant_tools.rs:108:17:
Tool schema for `code_actions` is not compatible with `language_model::LanguageModelToolSchemaFormat::JsonSchemaSubset` (Gemini Models).
Are you using `schema::json_schema_for<T>(format)` to generate the schema?
```
Release Notes:
- N/A
Release Notes:
- agent: Replace `bash` tool with `terminal` tool which uses the current
shell
---------
Co-authored-by: Bennet <bennet@zed.dev>
Co-authored-by: Antonio <antonio@zed.dev>
Having a separate rename tool seems to make the agent more likely to use
it compared to having it be part of the code actions tool.
Release Notes:
- Added code action tool and rename tool.
This also increases the threshold for when we return an outline during
`read_file`.
Release Notes:
- Fixed an issue that caused the agent to fail reading large files if
the LSP hadn't started yet.
Lets you get all the code symbols in the project (like the Code Symbols
panel) or in a particular file (like the Outline panel), optionally
paginated and filtering results by regex. The tool gives the files,
lines, and numbers of all of these, which means they can be used in
conjunction with the read file tool to read subsets of large files
without having to open the entire large file and poke around in it.
<img width="621" alt="Screenshot 2025-03-29 at 12 00 21 PM"
src="https://github.com/user-attachments/assets/d78259d7-2746-44c0-ac18-2e21f2505c0a"
/>
Release Notes:
- N/A
<img width="620" alt="Screenshot 2025-03-27 at 2 29 13 PM"
src="https://github.com/user-attachments/assets/dd023507-61bc-4722-a095-f65f4b6c746a"
/>
We'll iterate on the UI, but first the goal is to just get it to work at
all so we can see if it's useful in terms of getting correct output
faster.
Release Notes:
- N/A
---------
Co-authored-by: Agus Zubiaga <hi@aguz.me>
@agu-z and paired on trying out a "one tool call per edit" approach for
editing files. (The previous approach is still available, it's just
unchecked by default for now.)
Release Notes:
- N/A
---------
Co-authored-by: Agus <agus@zed.dev>
Adds an `debug: edit tool` action that opens a new view which will help
us debug the edit tool internals. As the edit tool runs, the log
displays:
- Instructions provided by the main model
- Response stream from the editor model
- Parsed edit blocks
- Tool output provided back to main model
The log automatically records all edit tool interactions for staff, so
if you notice something weird, you can debug it retroactively without
having to open the debug tool first. We may want to limit the number of
recorded requests later.
I have a few more ideas for it, but this seems like a good starting
point.
https://github.com/user-attachments/assets/c61f5ce8-08b1-4500-accb-db2a480eb3ab
Release Notes:
- N/A