Commit graph

128 commits

Author SHA1 Message Date
Oleksiy Syvokon
ac007139ab
evals: Enable Python LSP (#29987)
We now have one eval that uses a Python repo


Release Notes:

- N/A
2025-05-06 10:28:59 +00:00
Cole Miller
c12e6376b8
Terminal tool improvements (#29924)
WIP

- On macOS/Linux, run the command in bash instead of the user's shell
- Try to prevent the agent from running commands that expect interaction

Release Notes:

- Agent Beta: Switched to using `bash` (if available) instead of the
user's shell when calling the terminal tool.
- Agent Beta: Prevented the agent from hanging when trying to run
interactive commands.

---------

Co-authored-by: WeetHet <stas.ale66@gmail.com>
2025-05-05 15:57:03 -04:00
Bennet Bo Fenner
9cb5ffac25
context_store: Refactor state management (#29910)
Because we instantiated `ContextServerManager` both in `agent` and
`assistant-context-editor`, and these two entities track the running MCP
servers separately, we were effectively running every MCP server twice.

This PR moves the `ContextServerManager` into the project crate (now
called `ContextServerStore`). The store can be accessed via a project
instance. This ensures that we only instantiate one `ContextServerStore`
per project.

Also, this PR adds a bunch of tests to ensure that the
`ContextServerStore` behaves correctly (Previously there were none).

Closes #28714
Closes #29530

Release Notes:

- N/A
2025-05-05 21:36:12 +02:00
Oleksiy Syvokon
8199664a5a
agent: Handle attempts to use hallucinated tools (#29946)
This change:

1. Catches attempts to use missing tools. If this happens, we now send
Agent a message listing available tools, after which Agent can
gracefully recover. Prior behavior: thread would stop in a broken state.

Example of a hallucinated call and a message we send back: 

![image](https://github.com/user-attachments/assets/92a8f700-b192-4038-8c7e-0a74ca2e0146)

2. Adds evals for hallucinated tool use and imagined edits
3. Adds ability to configure a profile name in evals.



Release Notes:

- N/A
2025-05-05 19:31:11 +00:00
Marshall Bowers
3db4744e18
agent: Remove unneeded tracking of request usage (#29894)
This PR removes some unneeded tracking of the model request usage in the
`ActiveThread` and `ThreadEvent::UsageUpdated` events.

Release Notes:

- N/A
2025-05-05 01:16:53 +00:00
Michael Sloan
bb82d9ca82
agent eval: Fix --model arg and add --provider (#29883)
Release Notes:

- N/A
2025-05-04 13:43:57 -06:00
Max Brunsfeld
c3d9cdecab
Change cloud language model provider JSON protocol to surface errors and usage information (#29830)
Release Notes:

- N/A

---------

Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Marshall Bowers <git@maxdeviant.com>
2025-05-04 17:37:42 +00:00
Antonio Scandurra
4d51602e7b
Encourage editing over re-creating a file from scratch (#29870)
I also introduced a new eval to prove the encouragement actually makes a
difference.

Release Notes:

- Improved agent behavior when streaming edits, encouraging it to
editing files as opposed to creating them from scratch
2025-05-04 13:18:28 +00:00
Agus Zubiaga
64316309aa
agent: Review edits in single-file editors (#29820)
Enables reviewing agent edits from single-file editors in addition to
the multibuffer experience we already had.


https://github.com/user-attachments/assets/a2c287f0-51d6-43a1-8537-821498b91983


This feature can be turned off by setting `assistant.single_file_review:
false`.

Release Notes:

- agent: Review edits in single-file editors
2025-05-02 17:57:16 -03:00
Max Brunsfeld
04772bf17d
Add support for queuing status updates in cloud language model provider (#29818)
This sets us up to display queue position information to the user, once
our language model backend is updated to support request queuing.

The JSON returned by the LLM backend will need to look like this:

```json
{"queue": {"status": "queued", "position": 1}}
{"queue": {"status": "started"}}
{"event": {"THE_UPSTREAM_MODEL_PROVIDER_EVENT": "..."}} 
```

Release Notes:

- N/A

---------

Co-authored-by: Marshall Bowers <git@maxdeviant.com>
2025-05-02 20:36:39 +00:00
Cole Miller
9547d42b15
Support @-mentions in inline assists and when editing old agent panel messages (#29734)
Closes #ISSUE

Co-authored-by: Bennet <bennet@zed.dev>

Release Notes:

- Added support for context `@mentions` in the inline prompt editor and
when editing past messages in the agent panel.

---------

Co-authored-by: Bennet Bo Fenner <bennet@zed.dev>
Co-authored-by: Ben Brandt <benjamin.j.brandt@gmail.com>
2025-05-02 20:08:53 +00:00
Richard Feldman
9efc09c5a6
Add eval for open_tool (#29801)
Also have its description say it should only be used on request

Release Notes:

- N/A
2025-05-02 15:56:07 +00:00
Bennet Bo Fenner
24eb039752
context servers: Show configuration modal when extension is installed (#29309)
WIP

Release Notes:

- N/A

---------

Co-authored-by: Danilo Leal <67129314+danilo-leal@users.noreply.github.com>
Co-authored-by: Danilo Leal <daniloleal09@gmail.com>
Co-authored-by: Marshall Bowers <git@maxdeviant.com>
Co-authored-by: Cole Miller <m@cole-miller.net>
Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Oleksiy Syvokon <oleksiy.syvokon@gmail.com>
2025-05-01 20:02:14 +02:00
Antonio Scandurra
f891dfb358
Introduce a new StreamingEditFileTool (#29733)
This pull request introduces a new tool for streaming edits. The
short-term goal is for this tool to replace the existing `EditFileTool`,
but we want to get this out the door as soon as possible so that we can
start testing it.

`StreamingEditFileTool` is mutually exclusive with `EditFileTool`. It
will be enabled by default for anyone who has the `agent-stream-edits`
feature flag, as well as people that set `assistant.stream_edits` to
`true` in their settings.

### Implementation

Streaming is achieved by requesting a completion while the `edit_file`
tool gets called. We invoke the model by taking the existing
conversation with the agent and appending a prompt specifically tailored
for editing. In that prompt, we ask the model to produce a stream of
`<old_text>`/`<new_text>` tags. As the model streams text in, we
incrementally parse it and start editing as soon as we can.

### Evals

Note that, as part of this pull request, I also defined some new evals
that I used to drive the behavior of the recursive LLM call. To run
them, use this command:

```bash
cargo test --package=assistant_tools --features eval -- eval_extract_handle_command_output
```

Or comment out the `#[cfg_attr(not(feature = "eval"), ignore)]` macro.

I recommend running them one at a time, because right now we don't
really have a way of orchestrating of all these evals. I think we should
invest into that effort once the new agent panel goes live.

Release Notes:

- N/A

---------

Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Bennet Bo Fenner <bennetbo@gmx.de>
Co-authored-by: Oleksiy Syvokon <oleksiy.syvokon@gmail.com>
2025-05-01 17:37:43 +02:00
Richard Feldman
afeb3d4fd9
Make eval more resilient to bad input from LLM (#29703)
I saw a slice panic (for begin > end) in a debug build of the eval. This
should just be a failed assertion, not a panic that takes out the whole
eval run!

Release Notes:

- N/A
2025-04-30 18:13:45 -04:00
Richard Feldman
04c68dc0cf
Make the default repetitions be 8, and concurrency 4 (#29576)
This is based on having observed that there is a lot of variation
between runs on `n=1` and `n=3`.

* With `n=8` two runs on the same branch give answers that seem close
enough to be reasonably consistent.
* With higher concurrency, trying to run this many repetitions seems to
lead language servers to time out a lot, causing evals to fail.

Release Notes:

- N/A
2025-04-30 15:21:19 -04:00
Richard Feldman
c8685dc90f
Fix eval judging missing final response (#29638)
Fixed issue where eval thread judges were not considering the last
response in the thread.

The problem was that they were getting the full list of messages from
`last_request`, which (being a request!) did not have the response yet.

Release Notes:

- N/A
2025-04-29 23:02:46 -04:00
Richard Feldman
d566864891
Make code block eval resilient to indentation (#29633)
This reduces spurious failures in the eval.

Release Notes:

- N/A
2025-04-30 02:13:13 +00:00
Richard Feldman
d7004030b3
Code block evals (#29619)
Add a targeted eval for code block formatting, and revise the system
prompt accordingly.

### Eval before, n=8

<img width="728" alt="eval before"
src="https://github.com/user-attachments/assets/552b6146-3d26-4eaa-86f9-9fc36c0cadf2"
/>

### Eval after prompt change, n=8 (excluding the new evals, so just
testing the prompt change)

<img width="717" alt="eval after"
src="https://github.com/user-attachments/assets/c78c7a54-4c65-470c-b135-8691584cd73e"
/>

Release Notes:

- N/A
2025-04-29 18:52:09 -04:00
Agus Zubiaga
fd17f2d8ae
agent: Enrich grep tool output with syntax information (#29601)
The `grep` tool used to include 4 lines of context around the match, but
the lines included would often be unhelpful. This PR improves this
behavior by using the range of the parent syntax node that contains the
full line(s) matched.

The match headers will also now include symbol breadcrumbs so that the
model can already gather code structure before/without reading files.

````md
### impl GitRepository for RealGitRepository › fn compare_checkpoints › L1278-1284
```rust
                let result = git
                    .run(&[
                        "diff-tree",
                        "--quiet",
                        &left.commit_sha.to_string(),
                        &right.commit_sha.to_string(),
                    ])
```
````

This positively impacts the `add_arg_to_trait_method` eval example with
better diff output, fewer tool failures, and reduced total turns.

Note: We have some plans to use a an "elision" approach where we would
combine all matches for a given file, skipping lines between them while
keeping symbol declaration lines. The theory is that this would be map
more closely to the expected input for edits. For now, this PR is a
significant improvement.

Release Notes:

- Agent: Enrich `grep` tool output with syntax information
2025-04-29 17:03:02 +00:00
Danilo Leal
bbe8d6a654
agent: Cancel pending in-edit user message upon new message submit (#29565)
Previously, if you clicked on a user message to edit it, and then, while
the user message has the editor pending, sent a new message via the
textarea, the whole thread would be grayed out because we hadn't
dismissed the to-be-edited pending user message. That's now fixed.

Release Notes:

- agent: Fixed a bug that would make the whole thread be grayed out upon
sending a new message while a user message had a pending edit.
2025-04-28 18:51:41 -03:00
Marshall Bowers
ce93961fe0
agent: Add "max mode" toggle (#29549)
This PR adds a "max mode" toggle to the Agent panel, for models that
support it.

Only visible to folks in the `new-billing` feature flag.

Icon is just a placeholder.

Release Notes:

- N/A
2025-04-28 16:50:47 +00:00
Finn Evers
3a1bd38503
reqwest_client: Only register proxies with valid proxy URIs (#27773)
Closes #27641

This PR fixes invalid proxy URIs being registered despite the URI not
being a valid proxy URI.

Whilst investigating #27641 , I noticed that currently any proxy URI
passed to `RequestClient::proxy_and_user_agent` will be assigned to the
created client, even if the URI is not a valid proxy URI. Given a test
as an example:

We create an URI here and pass it as a proxy to
`ReqwestClient::proxy_and_user_agent`:

https://github.com/zed-industries/zed/blob/main/crates/reqwest_client/src/reqwest_client.rs#L272-L273

In `ReqwestClient::proxy_and_user_agent`we take the proxy parameter here

9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L46)

and set it unconditionally here:

9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L62)

, not considering at all whether the proxy was successfully created
above. Concluding, we currently do not actually check whether a proxy
was successfully created, but rather whether an URI is equal to itself,
which trivially holds. The existing test for a malformed proxy URI


9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L293-L297)

does not check whether invalid proxies cause an error, but rather checks
whether `http::Uri::from_static` panics on an invalid URI, [which it
does as
documented](https://docs.rs/http/latest/http/uri/struct.Uri.html#panics).
Thus, the tests currently do not really check anything proxy-related and
invalid proxies are assigned as valid proxies.

---

This PR fixes the behaviour by considering whether the proxy was
actually properly parsed and only assigning it if that is the case.
Furthermore, it improves logging in case of errors so issues like the
linked one are easier to debug (for the linked issue, the log will now
include that the proxy schema is not supported in the logs).
Lastly, it also updates the test for a malformed proxy URI. The test now
actually checks that malformed proxy URIs are not registered for the
client rather than testing the `http` crate.

The update also initially caused the [test for a `socks4a`
proxy](9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L280C1-L282C50))
to fail. This happened because the reqwest-library introduced supports
for `socks4a` proxies in [version
0.12.13](https://github.com/seanmonstar/reqwest/blob/master/CHANGELOG.md#v01213).
Thus, this PR includes a bump of the reqwest library to add proper
support for socks4a proxies.

Release Notes:

- Added support for socks4a proxies.

---------

Co-authored-by: Peter Tripp <peter@zed.dev>
2025-04-28 11:12:16 -04:00
tidely
f060918b57
zed: Remove unnecessary clones (#29513)
`App::http_client` and `Client::http_client` both return an owned `Arc`
which it clones internally. This means we can remove unnecessary clones
when calling these methods.

Release Notes:

- N/A
2025-04-27 19:23:37 -07:00
Michael Sloan
609c528ceb
Refactor markdown formatting utilities to avoid building intermediate strings (#29511)
These were nearly always used when using `format!` / `write!` etc, so it
makes sense to not have an intermediate `String`.

Release Notes:

- N/A
2025-04-27 19:04:51 +00:00
Marshall Bowers
b28756ae3f
eval: Use workspace dependencies (#29430)
This PR updates the `eval` crate to use workspace dependencies.

Also did a bit of cleanup of the `Cargo.toml`.

Release Notes:

- N/A
2025-04-25 16:11:26 +00:00
Marshall Bowers
a5405fcbd7
eval: Add support for reading from a .env file (#29426)
This PR adds support for the eval to read environment variables from a
`.env` file located in the `crates/eval` directory.

For instance, you can use it to set your Anthropic API key:

```
ANTHROPIC_API_KEY=<secret>
```

Release Notes:

- N/A
2025-04-25 15:53:02 +00:00
Oleksiy Syvokon
3389327df5
eval: Add HTML overview for evaluation runs (#29413)
This update generates a single self-contained .html file that shows an
overview of evaluation threads in the browser. It's useful for:

- Quickly reviewing results
- Sharing evaluation runs
- Debugging
- Comparing models (TBD)

Features:

- Export thread JSON from the UI
- Keyboard navigation (j/k or Ctrl + ←/→)
- Toggle between compact and full views

Generating the overview:

- `cargo run -p eval` will write this file in the run dir's root.
- Or you can call `cargo run -p eval --bin explorer` to generate it
without running evals.


Screenshot:

![image](https://github.com/user-attachments/assets/4ead71f6-da08-48ea-8fcb-2148d2e4b4db)


Release Notes:

- N/A
2025-04-25 17:49:05 +03:00
Michael Sloan
17ecf94f6f
Restructure agent context (#29233)
Simplifies the data structures involved in agent context by removing
caching and limiting the use of ContextId:

* `AssistantContext` enum is now like an ID / handle to context that
does not need to be updated. `ContextId` still exists but is only used
for generating unique `ElementId`.
* `ContextStore` has a `IndexMap<ContextSetEntry>`. Only need to keep a
`HashSet<ThreadId>` consistent with it. `ContextSetEntry` is a newtype
wrapper around `AssistantContext` which implements eq / hash on a subset
of fields.
* Thread `Message` directly stores its context.

Fixes the following bugs:

* If a context entry is removed from the strip and added again, it was
reincluded in the next message.
* Clicking file context in the thread that has been removed from the
context strip didn't jump to the file.
* Refresh of directory context didn't reflect added / removed files.
* Deleted directories would remain in the message editor context strip.
* Token counting requests didn't include image context.
* File, directory, and symbol context deduplication relied on
`ProjectPath` for identity, and so didn't handle renames.
* Symbol context line numbers didn't update when shifted

Known bugs (not fixed):

* Deleting a directory causes it to disappear from messages in threads.
Fixing this in a nice way is tricky. One easy fix is to store the
original path and show that on deletion. It's weird that deletion would
cause the name to "revert", though. Another possibility would be to
snapshot context metadata on add (ala `AddedContext`), and keep that
around despite deletion.

Release Notes:

- N/A
2025-04-24 21:29:33 +00:00
Richard Feldman
720dfee803
Treat invalid JSON in tool calls as failed tool calls (#29375)
Release Notes:

- N/A

---------

Co-authored-by: Max <max@zed.dev>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
2025-04-24 16:54:27 -04:00
Max Brunsfeld
f125353b6f
Add tree-sitter example to the eval (#29321)
Interesting things about this example:
* It's a useful, non-trivial change I made with the agent in Tree-sitter
* It runs fast
* It frequently showcases edit file errors
* It occasionally completely errors out due to errors parsing tool call
input JSON

Release Notes:

- N/A
2025-04-23 18:46:38 -07:00
Agus Zubiaga
8b5835de17
agent: Improve initial file search quality (#29317)
This PR significantly improves the quality of the initial file search
that occurs when the model doesn't yet know the full path to a file it
needs to read/edit.

Previously, the assertions in file_search often failed on main as the
model attempted to guess full file paths. On this branch, it reliably
calls `find_path` (previously `path_search`) before reading files.

After getting the model to find paths first, I noticed it would try
using `grep` instead of `path_search`. This motivated renaming
`path_search` to `find_path` (continuing the analogy to unix commands)
and adding system prompt instructions about proper tool selection.

Note: I know the command is just called `find`, but that seemed too
general.

In my eval runs, the `file_search` example improved from 40% ± 10% to
98% ± 2%. The only assertion I'm seeing occasionally fail is "glob
starts with `**` or project". We can probably add some instructions in
that regard.

Release Notes:

- N/A
2025-04-23 21:24:41 -03:00
Agus Zubiaga
45d3f5168a
eval: New add_arg_to_trait_method example (#29297)
Release Notes:

- N/A

---------

Co-authored-by: Richard Feldman <oss@rtfeldman.com>
2025-04-23 18:46:39 +00:00
Danilo Leal
8366cd0b52
agent: Render diffs for the edit file tool (#29234)
This PR implements the `ToolCard` for the edit file tool, which allow us
to display an editor with a diff in the thread view with the changes
performed by the model.

- [x] Fix buffer sometimes displaying empty
- [x] Stop buffer from scrolling together with the thread
- [x] Fix multibuffer header sometimes appearing
- [x] Fix buffer height issue
- [x] Implement "full height" expand button
- [x] Add "Jump To File" functionality
- [x] Polish and refine styles

Release Notes:

- agent: Added diff preview cards in the thread view for edits performed
by the agent.

---------

Co-authored-by: João Marcos <marcospb19@hotmail.com>
Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Agus Zubiaga <hi@aguz.me>
Co-authored-by: Conrad Irwin <conrad.irwin@gmail.com>
2025-04-23 15:43:33 -03:00
Oleksiy Syvokon
f69aeb6311
Do not log unfinished tools use that are in the middle of streaming (#29275)
Release Notes:

- N/A
2025-04-23 13:19:01 +00:00
Oleksiy Syvokon
76a78b550b
eval: Write JSON-serialized thread (#29271)
This adds `last.message.json` file that contains the full request plus
response (serialized as a message from assistant for consistency with
other messages).

Motivation: to capture more info and to make analysis of finished runs
easier.

Release Notes:

- N/A
2025-04-23 15:22:19 +03:00
Agus Zubiaga
ce1a674eba
eval: Fine-grained assertions (#29246)
- Support programmatic examples
([example](17feb260a0/crates/eval/src/examples/file_search.rs))
- Combine data-driven example declarations into a single `.toml` file
([example](17feb260a0/crates/eval/src/examples/find_and_replace_diff_card.toml))
- Run judge on individual assertions (previously called "criteria")
- Report judge and programmatic assertions in one combined table

Note: We still need to work on concept naming 

<img width=400
src="https://github.com/user-attachments/assets/fc719c93-467f-412b-8d47-68821bd8a5f5">

Release Notes:

- N/A

---------

Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
2025-04-22 23:58:58 -03:00
Max Brunsfeld
36d02de784
Rework eval to support interpretable scores and efficient repetitions (#29197)
### Problem

We want to start continuously tracking our progress on agent evals over
time. As part of this, we'd like the *score* to have a clear,
interpretable meaning. Right now, it's a number from 0 to 5, but it's
not clear what any particular number works. In addition, scores vary
widely from run to run, because the agent's output is deterministic. We
try to stabilize the score using a panel of judges, but the behavior of
the agent itself varies much more widely than the judges' scores for a
given run.

### Solution

* **explicit meanings of scores** - In this PR, we're prescribing the
diff and thread criteria files so that they *must* be unordered lists of
assertions. For both the thread and the diff, rather than providing an
abstract score, the judge's task is simply to count how many of these
assertions are satisfied. A percentage score can be derived from this
number, divided by the total number of assertions.
* **repetitions** - Rather than running each example once, and judging
it N times, we'll **run** the example N times. Right now, I'm just
judging the output once per run, because I believe that with these more
clear scoring criteria, the main source of non-determinism will be the
*agent's* behavior, not the judge's

### Questions

* **accounting for diagnostic errors** - Previously, the judge was asked
to incorporate diagnostics into their abstract scores. Now that the
"score" is determined directly from the criteria, the diagnostic will
not be captured in the score. How should the diagnostics be accounted
for in the eval? One thought is - let's simply count and report the
number of errors remaining after the agent finishes, as a separate field
of the run (along with diff score and thread score). We could consider
normalizing it using the total lines of added code (like errors per 100
lines of code added) in order to give it some semblance of stability
between examples.

* **repetitions** - How many repetitions should we run on CI? Each
repetition takes significant time, but I think running more than one
repetition will make the scores significantly less volatile.

### Todo

* [x] Fix `--concurrency` implementation so that only N tasks are
spawned
* [x] Support `--repetitions` efficiently (re-using the same worktree)
* [x] Restructure judge prompts to count passing criteria, not compute
abstract score
* [x] Report total number of diagnostics in some way
* [x] Format output nicely

Release Notes:

- N/A *or* Added/Fixed/Improved ...

---------

Co-authored-by: Antonio Scandurra <me@as-cii.com>
2025-04-22 14:00:09 +00:00
Nathan Sobo
458ffaa134
Add new action to run agent eval (#29158)
The old one wasn't linking, and
https://github.com/zed-industries/zed/pull/29081 has a bunch of merge
conflicts. Wanted to start simple/small.

## Todo

* [x] Remove low-signal examples
* [x] Make the eval run on a cron, on main, and on any PR with the
`run-eval` label
* [x] Noise in logs about failure to write settings
    ```
[2025-04-21T20:45:04Z ERROR settings] Failed to write settings to file
"/home/runner/.config/zed/settings.json"
    
       Caused by:
No such file or directory (os error 2) at path
"/home/runner/.config/zed/.tmpLewFEs"
    ```
* [x] `Agentic loop stalled`
(https://github.com/zed-industries/zed/actions/runs/14581044243/job/40897622894)
* [x] Make sure that events are recorded in snowflake
* [ ] Change judge criteria to be more explicit about meanings of scores

Release Notes:

- N/A

---------

Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Agus Zubiaga <hi@aguz.me>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
2025-04-21 21:30:21 -07:00
Michael Sloan
70c51b513b
agent eval: Default to also running typescript examples (#29185)
Release Notes:

- N/A
2025-04-21 23:59:35 +00:00
Michael Sloan
9249919b7a
Write {result_count}.diff and last.diff eval run outputs (#29181)
These are only written when the diff has changed. `patch.diff` has been
removed as its redundant with `last.diff`.

It can be convenient to open `last.diff` and use undo/redo to navigate
its history.

Release Notes:

- N/A
2025-04-21 23:19:07 +00:00
Richard Feldman
4f2f9ff762
Streaming tool calls (#29179)
https://github.com/user-attachments/assets/7854a737-ef83-414c-b397-45122e4f32e8



Release Notes:

- Create file and edit file tools now stream their tool descriptions, so
you can see what they're doing sooner.

---------

Co-authored-by: Marshall Bowers <git@maxdeviant.com>
2025-04-21 22:28:32 +00:00
Thomas Mickley-Doyle
733cd6b68c
agent: Remove non-rust examples from evals (#29139)
Release Notes:

- N/A
2025-04-21 12:55:24 -07:00
Michael Sloan
0f3ac38332
Agent eval: Copy .rules file into eval worktree for examples based on Zed (#29116)
Also reverts #29108, which cherry-picked the rules file for an eval
example.

Release Notes:

- N/A
2025-04-21 12:02:44 -06:00
Conrad Irwin
9d35f0389d
debugger: More tidy up for SSH (#28993)
Split `locator` out of DebugTaskDefinition to make it clearer when
location needs to happen.

Release Notes:

- N/A

---------

Co-authored-by: Anthony Eid <hello@anthonyeid.me>
Co-authored-by: Anthony <anthony@zed.dev>
Co-authored-by: Cole Miller <m@cole-miller.net>
2025-04-21 16:00:03 +00:00
Antonio Scandurra
97ab0980d1
Start tracking tool failure rates in eval (#29122)
This pull request will print all the used tools and their failure rates.
The objective goal should be to minimize that failure rate.

@tmickleydoyle: this also changes the telemetry event to report
`tool_metrics` as opposed to `tool_use_counts`. Ideally I'd love to be
able to plot failure rates by tool and hopefully see that percentage go
down. Can we do that with the data we're tracking with this pull
request?

Release Notes:

- N/A
2025-04-21 16:16:43 +02:00
Agus Zubiaga
ceeae790b7
eval: Improve lang server idle detection (#29135)
Brings back #29013 after it was accidentally reverted by
https://github.com/zed-industries/zed/pull/28961/commits/e9bb15b9063615762c866c30aaf646acb12af1f3.

Release Notes:

- N/A
2025-04-21 00:17:28 +00:00
Nathan Sobo
107d8ca483
Rename regex search tool to grep and accept an include glob pattern (#29100)
This PR renames the `regex_search` tool to `grep` because I think it
conveys more meaning to the model, the idea of searching the filesystem
with a regular expression. It's also one word and the model seems to be
using it effectively after some additional prompt tuning.

It also takes an include pattern to filter on the specific files we try
to search. I'd like to encourage the model to scope its searches more
aggressively, as in my testing, I'm only seeing it filter on file
extension.

Release Notes:

- N/A
2025-04-20 00:53:30 +00:00
Michael Sloan
a91948aeb4
agent: Reorder some linux keybindings to match mac keybindings (#29107)
Release Notes:

- Made keybindings for agent panel closer to the precedence order used
on Mac. This fixes use of `enter` to add context from the menu triggered
by `@` referencing.
2025-04-20 00:01:43 +00:00
Michael Sloan
fbf7caf93e
Default to fast model for thread summaries and titles + don't include system prompt / context / thinking segments (#29102)
* Adds a fast / cheaper model to providers and defaults thread
summarization to this model. Initial motivation for this was that
https://github.com/zed-industries/zed/pull/29099 would cause these
requests to fail when used with a thinking model. It doesn't seem
correct to use a thinking model for summarization.

* Skips system prompt, context, and thinking segments.

* If tool use is happening, allows 2 tool uses + one more agent response
before summarizing.

Downside of this is that there was potential for some prefix cache reuse
before, especially for title summarization (thread summarization omitted
tool results and so would not share a prefix for those). This seems fine
as these requests should typically be fairly small. Even for full thread
summarization, skipping all tool use / context should greatly reduce the
token use.

Release Notes:

- N/A
2025-04-19 23:26:29 +00:00