Commit graph

28 commits

Author SHA1 Message Date
Marshall Bowers
7be1f2418d
Replace zed_llm_client with cloud_llm_client (#35309)
This PR replaces the usage of the `zed_llm_client` with the
`cloud_llm_client`.

It was ported into this repo in #35307.

Release Notes:

- N/A
2025-07-30 00:09:14 +00:00
Julia Ryan
0068de0386
debugger: Handle the envFile setting for Go (#33666)
Fixes #32984

Release Notes:

- The Go debugger now respects the `envFile` setting.
2025-07-01 09:14:59 -07:00
Max Brunsfeld
2283ec5de2
Extract an agent_ui crate from agent (#33284)
This PR moves the UI-dependent logic in the `agent` crate into its own
crate, `agent_ui`. The remaining `agent` crate no longer depends on
`editor`, `picker`, `ui`, `workspace`, etc.

This has compile time benefits, but the main motivation is to isolate
our core agentic logic, so that we can make agents more
pluggable/configurable.

Release Notes:

- N/A
2025-06-23 18:00:28 -07:00
Antonio Scandurra
019a14bcde
Replace async-watch with a custom watch (#32245)
The `async-watch` crate doesn't seem to be maintained and we noticed
several panics coming from it, such as:

```
[bug] failed to observe change after notificaton.
zed::reliability::init_panic_hook::{{closure}}::hea8cdcb6299fad6b+154543526
std::panicking::rust_panic_with_hook::h33b18b24045abff4+127578547
std::panicking::begin_panic_handler::{{closure}}::hf8313cc2fd0126bc+127577770
std::sys::backtrace::__rust_end_short_backtrace::h57fe07c8aea5c98a+127571385
__rustc[95feac21a9532783]::rust_begin_unwind+127576909
core::panicking::panic_fmt::hd54fb667be51beea+9433328
core::option::expect_failed::h8456634a3dada3e4+9433291
assistant_tools::edit_agent::EditAgent::apply_edit_chunks::{{closure}}::habe2e1a32b267fd4+26921553
gpui::app::async_context::AsyncApp::spawn::{{closure}}::h12f5f25757f572ea+25923441
async_task::raw::RawTask<F,T,S,M>::run::h3cca0d402690ccba+25186815
<gpui::platform::linux::x11::client::X11Client as gpui::platform::linux::platform::LinuxClient>::run::h26264aefbcfbc14b+73961666
gpui::platform::linux::platform::<impl gpui::platform::Platform for P>::run::hb12dcd4abad715b5+73562509
gpui::app::Application::run::h0f936a5f855a3f9f+150676820
zed::main::ha17f9a25fe257d35+154788471
std::sys::backtrace::__rust_begin_short_backtrace::h1edd02429370b2bd+154624579
std::rt::lang_start::{{closure}}::h3d2e300f10059b0a+154264777
std::rt::lang_start_internal::h418648f91f5be3a1+127502049
main+154806636
__libc_start_main+46051972301573
_start+12358494
```

I didn't find an executor-agnostic watch crate that was well maintained
(we already tried postage and async-watch), so decided to implement it
our own version.

Release Notes:

- Fixed a panic that could sometimes occur when the agent performed
edits.
2025-06-06 16:00:09 +00:00
Marshall Bowers
a23ee61a4b
Pass up intent with completion requests (#31710)
This PR adds a new `intent` field to completion requests to assist in
categorizing them correctly.

Release Notes:

- N/A

---------

Co-authored-by: Ben Brandt <benjamin.j.brandt@gmail.com>
2025-05-29 20:43:12 +00:00
Marshall Bowers
8faeb34367
Rename assistant_settings to agent_settings (#31513)
This PR renames the `assistant_settings` crate to `agent_settings`, as
well a number of constructs within it.

Release Notes:

- N/A
2025-05-27 15:16:55 +00:00
Piotr Osiewicz
77dadfedfe
chore: Make terminal_view own the TerminalSlashCommand (#31070)
This reduces 'touch crates/editor/src/editor.rs && cargo +nightly build'
from 8.9s to 8.5s. That same scenario used to take 8s less than a week
ago. :)
I'm measuring with nightly rustc, because it's compile times are better
than those of stable thanks to
https://github.com/rust-lang/rust/pull/138522

main (8.2s total):

![image](https://github.com/user-attachments/assets/767a2ac4-7bba-4147-bd16-9b09eed5b433)

[cargo-timing.html.zip](https://github.com/user-attachments/files/20364175/cargo-timing.html.zip)

#22be776 (7.5s total):

[cargo-timing-20250521T085303.892834Z.html.zip](https://github.com/user-attachments/files/20364391/cargo-timing-20250521T085303.892834Z.html.zip)

![image](https://github.com/user-attachments/assets/c4476df9-cb6e-4403-b0db-de00521f1fd0)


Release Notes:

- N/A
2025-05-21 09:27:54 +00:00
Piotr Osiewicz
a092e2dc03
extension: Add debug_adapters to extension manifest (#30676)
Also pass worktree to the get_dap_binary.

Release Notes:

- N/A
2025-05-20 11:01:33 +02:00
Bennet Bo Fenner
9cb5ffac25
context_store: Refactor state management (#29910)
Because we instantiated `ContextServerManager` both in `agent` and
`assistant-context-editor`, and these two entities track the running MCP
servers separately, we were effectively running every MCP server twice.

This PR moves the `ContextServerManager` into the project crate (now
called `ContextServerStore`). The store can be accessed via a project
instance. This ensures that we only instantiate one `ContextServerStore`
per project.

Also, this PR adds a bunch of tests to ensure that the
`ContextServerStore` behaves correctly (Previously there were none).

Closes #28714
Closes #29530

Release Notes:

- N/A
2025-05-05 21:36:12 +02:00
Oleksiy Syvokon
8199664a5a
agent: Handle attempts to use hallucinated tools (#29946)
This change:

1. Catches attempts to use missing tools. If this happens, we now send
Agent a message listing available tools, after which Agent can
gracefully recover. Prior behavior: thread would stop in a broken state.

Example of a hallucinated call and a message we send back: 

![image](https://github.com/user-attachments/assets/92a8f700-b192-4038-8c7e-0a74ca2e0146)

2. Adds evals for hallucinated tool use and imagined edits
3. Adds ability to configure a profile name in evals.



Release Notes:

- N/A
2025-05-05 19:31:11 +00:00
Antonio Scandurra
f891dfb358
Introduce a new StreamingEditFileTool (#29733)
This pull request introduces a new tool for streaming edits. The
short-term goal is for this tool to replace the existing `EditFileTool`,
but we want to get this out the door as soon as possible so that we can
start testing it.

`StreamingEditFileTool` is mutually exclusive with `EditFileTool`. It
will be enabled by default for anyone who has the `agent-stream-edits`
feature flag, as well as people that set `assistant.stream_edits` to
`true` in their settings.

### Implementation

Streaming is achieved by requesting a completion while the `edit_file`
tool gets called. We invoke the model by taking the existing
conversation with the agent and appending a prompt specifically tailored
for editing. In that prompt, we ask the model to produce a stream of
`<old_text>`/`<new_text>` tags. As the model streams text in, we
incrementally parse it and start editing as soon as we can.

### Evals

Note that, as part of this pull request, I also defined some new evals
that I used to drive the behavior of the recursive LLM call. To run
them, use this command:

```bash
cargo test --package=assistant_tools --features eval -- eval_extract_handle_command_output
```

Or comment out the `#[cfg_attr(not(feature = "eval"), ignore)]` macro.

I recommend running them one at a time, because right now we don't
really have a way of orchestrating of all these evals. I think we should
invest into that effort once the new agent panel goes live.

Release Notes:

- N/A

---------

Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Bennet Bo Fenner <bennetbo@gmx.de>
Co-authored-by: Oleksiy Syvokon <oleksiy.syvokon@gmail.com>
2025-05-01 17:37:43 +02:00
Richard Feldman
d7004030b3
Code block evals (#29619)
Add a targeted eval for code block formatting, and revise the system
prompt accordingly.

### Eval before, n=8

<img width="728" alt="eval before"
src="https://github.com/user-attachments/assets/552b6146-3d26-4eaa-86f9-9fc36c0cadf2"
/>

### Eval after prompt change, n=8 (excluding the new evals, so just
testing the prompt change)

<img width="717" alt="eval after"
src="https://github.com/user-attachments/assets/c78c7a54-4c65-470c-b135-8691584cd73e"
/>

Release Notes:

- N/A
2025-04-29 18:52:09 -04:00
Marshall Bowers
b28756ae3f
eval: Use workspace dependencies (#29430)
This PR updates the `eval` crate to use workspace dependencies.

Also did a bit of cleanup of the `Cargo.toml`.

Release Notes:

- N/A
2025-04-25 16:11:26 +00:00
Marshall Bowers
a5405fcbd7
eval: Add support for reading from a .env file (#29426)
This PR adds support for the eval to read environment variables from a
`.env` file located in the `crates/eval` directory.

For instance, you can use it to set your Anthropic API key:

```
ANTHROPIC_API_KEY=<secret>
```

Release Notes:

- N/A
2025-04-25 15:53:02 +00:00
Oleksiy Syvokon
3389327df5
eval: Add HTML overview for evaluation runs (#29413)
This update generates a single self-contained .html file that shows an
overview of evaluation threads in the browser. It's useful for:

- Quickly reviewing results
- Sharing evaluation runs
- Debugging
- Comparing models (TBD)

Features:

- Export thread JSON from the UI
- Keyboard navigation (j/k or Ctrl + ←/→)
- Toggle between compact and full views

Generating the overview:

- `cargo run -p eval` will write this file in the run dir's root.
- Or you can call `cargo run -p eval --bin explorer` to generate it
without running evals.


Screenshot:

![image](https://github.com/user-attachments/assets/4ead71f6-da08-48ea-8fcb-2148d2e4b4db)


Release Notes:

- N/A
2025-04-25 17:49:05 +03:00
Agus Zubiaga
45d3f5168a
eval: New add_arg_to_trait_method example (#29297)
Release Notes:

- N/A

---------

Co-authored-by: Richard Feldman <oss@rtfeldman.com>
2025-04-23 18:46:39 +00:00
Agus Zubiaga
ce1a674eba
eval: Fine-grained assertions (#29246)
- Support programmatic examples
([example](17feb260a0/crates/eval/src/examples/file_search.rs))
- Combine data-driven example declarations into a single `.toml` file
([example](17feb260a0/crates/eval/src/examples/find_and_replace_diff_card.toml))
- Run judge on individual assertions (previously called "criteria")
- Report judge and programmatic assertions in one combined table

Note: We still need to work on concept naming 

<img width=400
src="https://github.com/user-attachments/assets/fc719c93-467f-412b-8d47-68821bd8a5f5">

Release Notes:

- N/A

---------

Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
2025-04-22 23:58:58 -03:00
Max Brunsfeld
36d02de784
Rework eval to support interpretable scores and efficient repetitions (#29197)
### Problem

We want to start continuously tracking our progress on agent evals over
time. As part of this, we'd like the *score* to have a clear,
interpretable meaning. Right now, it's a number from 0 to 5, but it's
not clear what any particular number works. In addition, scores vary
widely from run to run, because the agent's output is deterministic. We
try to stabilize the score using a panel of judges, but the behavior of
the agent itself varies much more widely than the judges' scores for a
given run.

### Solution

* **explicit meanings of scores** - In this PR, we're prescribing the
diff and thread criteria files so that they *must* be unordered lists of
assertions. For both the thread and the diff, rather than providing an
abstract score, the judge's task is simply to count how many of these
assertions are satisfied. A percentage score can be derived from this
number, divided by the total number of assertions.
* **repetitions** - Rather than running each example once, and judging
it N times, we'll **run** the example N times. Right now, I'm just
judging the output once per run, because I believe that with these more
clear scoring criteria, the main source of non-determinism will be the
*agent's* behavior, not the judge's

### Questions

* **accounting for diagnostic errors** - Previously, the judge was asked
to incorporate diagnostics into their abstract scores. Now that the
"score" is determined directly from the criteria, the diagnostic will
not be captured in the score. How should the diagnostics be accounted
for in the eval? One thought is - let's simply count and report the
number of errors remaining after the agent finishes, as a separate field
of the run (along with diff score and thread score). We could consider
normalizing it using the total lines of added code (like errors per 100
lines of code added) in order to give it some semblance of stability
between examples.

* **repetitions** - How many repetitions should we run on CI? Each
repetition takes significant time, but I think running more than one
repetition will make the scores significantly less volatile.

### Todo

* [x] Fix `--concurrency` implementation so that only N tasks are
spawned
* [x] Support `--repetitions` efficiently (re-using the same worktree)
* [x] Restructure judge prompts to count passing criteria, not compute
abstract score
* [x] Report total number of diagnostics in some way
* [x] Format output nicely

Release Notes:

- N/A *or* Added/Fixed/Improved ...

---------

Co-authored-by: Antonio Scandurra <me@as-cii.com>
2025-04-22 14:00:09 +00:00
Michael Sloan
9249919b7a
Write {result_count}.diff and last.diff eval run outputs (#29181)
These are only written when the diff has changed. `patch.diff` has been
removed as its redundant with `last.diff`.

It can be convenient to open `last.diff` and use undo/redo to navigate
its history.

Release Notes:

- N/A
2025-04-21 23:19:07 +00:00
Conrad Irwin
9d35f0389d
debugger: More tidy up for SSH (#28993)
Split `locator` out of DebugTaskDefinition to make it clearer when
location needs to happen.

Release Notes:

- N/A

---------

Co-authored-by: Anthony Eid <hello@anthonyeid.me>
Co-authored-by: Anthony <anthony@zed.dev>
Co-authored-by: Cole Miller <m@cole-miller.net>
2025-04-21 16:00:03 +00:00
Michael Sloan
98ceffe026
Pretty tool inputs in eval output markdown + numbered assistant messages (#29082)
Release Notes:

- N/A
2025-04-19 06:59:22 +00:00
Nathan Sobo
bab28560ef
Systematically optimize agentic editing performance (#28961)
Now that we've established a proper eval in tree, this PR is reboots of
our agent loop back to a set of minimal tools and simpler prompts. We
should aim to get this branch feeling subjectively competitive with
what's on main and then merge it, and build from there.

Let's invest in our eval and use it to drive better performance of the
agent loop. How you can help: Pick an example, and then make the outcome
faster or better. It's fine to even use your own subjective judgment, as
our evaluation criteria likely need tuning as well at this point. Focus
on making the agent work better in your own subjective experience first.
Let's focus on simple/practical improvements to make this thing work
better, then determine how we can craft our judgment criteria to lock
those improvements in.

Release Notes:

- N/A

---------

Co-authored-by: Max <max@zed.dev>
Co-authored-by: Antonio <antonio@zed.dev>
Co-authored-by: Agus <agus@zed.dev>
Co-authored-by: Richard <richard@zed.dev>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Michael Sloan <mgsloan@gmail.com>
2025-04-19 02:47:59 +00:00
Thomas Mickley-Doyle
222d4a2546
agent: Add telemetry for eval runs (#28816)
Release Notes:

- N/A

---------

Co-authored-by: Joseph <joseph@zed.dev>
2025-04-16 02:54:26 +00:00
Agus Zubiaga
ff4334efc7
eval: Fix stalling on tool confirmation (#28786)
The `always_allow_tool_actions` setting would get overridden with the
default when we loaded each example project, leading to examples
stalling when they run a tool that needed confirmation. There's now a
separate `runner_settings.json` file where we can configure the
environment for the eval.

Release Notes:

- N/A

---------

Co-authored-by: Oleksiy <oleksiy@zed.dev>
2025-04-15 16:53:45 +00:00
Michael Sloan
c8ccc472b5
Track tool use counts (#28722)
Release Notes:

- N/A
2025-04-14 21:45:36 +00:00
Michael Sloan
6b80eb556c
Add judge to new eval + provide LSP diagnostics (#28713)
Release Notes:

- N/A

---------

Co-authored-by: Antonio Scandurra <antonio@zed.dev>
Co-authored-by: agus <agus@zed.dev>
2025-04-14 20:18:47 +00:00
Antonio Scandurra
2440faf4b2
Actually run the eval and fix a hang when retrieving outline (#28547)
Release Notes:

- Fixed a regression that caused the agent to hang sometimes.

---------

Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Michael Sloan <mgsloan@gmail.com>
2025-04-11 00:01:33 +00:00
Antonio Scandurra
8ac378b86e
Lay the groundwork for a Rust-based eval (#28488)
Also, we moved the logic for driving the agentic loop into `Thread` so
that we don't have to re-implement it.

Release Notes:

- N/A

---------

Co-authored-by: Nathan Sobo <nathan@zed.dev>
2025-04-10 04:45:27 +00:00