Commit graph

15 commits

Author SHA1 Message Date
Oleksiy Syvokon
255d8f7cf8
agent: Overwrite files more cautiously (#30649)
1. The `edit_file` tool tended to use `create_or_overwrite` a bit too
often, leading to corruption of long files. This change replaces the
boolean flag with an `EditFileMode` enum, which helps Agent make a more
deliberate choice when overwriting files.

With this change, the pass rate of the new eval increased from 10% to
100%.

2. eval: Added ability to run eval on top of an existing thread. Threads
can now be loaded from JSON files in the `SerializedThread` format,
which makes it easy to use real threads as starting points for
tests/evals.

3. Don't try to restore tool cards when running in headless or eval mode
-- we don't have a window to properly do this.

Release Notes:

- N/A
2025-05-14 10:40:44 +03:00
Richard Feldman
49887d6934
Add no_tools_enabled eval (#30537)
This is our first eval of the Minimal tool profile. Right now they're
all passing; the value of having it is to catch regressions in the
system prompt (which has special logic in it for the case where no tools
are enabled).

Release Notes:

- N/A
2025-05-12 08:52:03 +00:00
Antonio Scandurra
89430a019c
Fix agent reading and editing files over SSH (#30144)
Release Notes:

- Fixed a bug that would prevent the agent from working over SSH.

---------

Co-authored-by: Nathan Sobo <nathan@zed.dev>
Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Cole Miller <m@cole-miller.net>
2025-05-07 17:07:01 +00:00
Cole Miller
c12e6376b8
Terminal tool improvements (#29924)
WIP

- On macOS/Linux, run the command in bash instead of the user's shell
- Try to prevent the agent from running commands that expect interaction

Release Notes:

- Agent Beta: Switched to using `bash` (if available) instead of the
user's shell when calling the terminal tool.
- Agent Beta: Prevented the agent from hanging when trying to run
interactive commands.

---------

Co-authored-by: WeetHet <stas.ale66@gmail.com>
2025-05-05 15:57:03 -04:00
Oleksiy Syvokon
8199664a5a
agent: Handle attempts to use hallucinated tools (#29946)
This change:

1. Catches attempts to use missing tools. If this happens, we now send
Agent a message listing available tools, after which Agent can
gracefully recover. Prior behavior: thread would stop in a broken state.

Example of a hallucinated call and a message we send back: 

![image](https://github.com/user-attachments/assets/92a8f700-b192-4038-8c7e-0a74ca2e0146)

2. Adds evals for hallucinated tool use and imagined edits
3. Adds ability to configure a profile name in evals.



Release Notes:

- N/A
2025-05-05 19:31:11 +00:00
Antonio Scandurra
4d51602e7b
Encourage editing over re-creating a file from scratch (#29870)
I also introduced a new eval to prove the encouragement actually makes a
difference.

Release Notes:

- Improved agent behavior when streaming edits, encouraging it to
editing files as opposed to creating them from scratch
2025-05-04 13:18:28 +00:00
Richard Feldman
9efc09c5a6
Add eval for open_tool (#29801)
Also have its description say it should only be used on request

Release Notes:

- N/A
2025-05-02 15:56:07 +00:00
Richard Feldman
afeb3d4fd9
Make eval more resilient to bad input from LLM (#29703)
I saw a slice panic (for begin > end) in a debug build of the eval. This
should just be a failed assertion, not a panic that takes out the whole
eval run!

Release Notes:

- N/A
2025-04-30 18:13:45 -04:00
Richard Feldman
d566864891
Make code block eval resilient to indentation (#29633)
This reduces spurious failures in the eval.

Release Notes:

- N/A
2025-04-30 02:13:13 +00:00
Richard Feldman
d7004030b3
Code block evals (#29619)
Add a targeted eval for code block formatting, and revise the system
prompt accordingly.

### Eval before, n=8

<img width="728" alt="eval before"
src="https://github.com/user-attachments/assets/552b6146-3d26-4eaa-86f9-9fc36c0cadf2"
/>

### Eval after prompt change, n=8 (excluding the new evals, so just
testing the prompt change)

<img width="717" alt="eval after"
src="https://github.com/user-attachments/assets/c78c7a54-4c65-470c-b135-8691584cd73e"
/>

Release Notes:

- N/A
2025-04-29 18:52:09 -04:00
Agus Zubiaga
fd17f2d8ae
agent: Enrich grep tool output with syntax information (#29601)
The `grep` tool used to include 4 lines of context around the match, but
the lines included would often be unhelpful. This PR improves this
behavior by using the range of the parent syntax node that contains the
full line(s) matched.

The match headers will also now include symbol breadcrumbs so that the
model can already gather code structure before/without reading files.

````md
### impl GitRepository for RealGitRepository › fn compare_checkpoints › L1278-1284
```rust
                let result = git
                    .run(&[
                        "diff-tree",
                        "--quiet",
                        &left.commit_sha.to_string(),
                        &right.commit_sha.to_string(),
                    ])
```
````

This positively impacts the `add_arg_to_trait_method` eval example with
better diff output, fewer tool failures, and reduced total turns.

Note: We have some plans to use a an "elision" approach where we would
combine all matches for a given file, skipping lines between them while
keeping symbol declaration lines. The theory is that this would be map
more closely to the expected input for edits. For now, this PR is a
significant improvement.

Release Notes:

- Agent: Enrich `grep` tool output with syntax information
2025-04-29 17:03:02 +00:00
Max Brunsfeld
f125353b6f
Add tree-sitter example to the eval (#29321)
Interesting things about this example:
* It's a useful, non-trivial change I made with the agent in Tree-sitter
* It runs fast
* It frequently showcases edit file errors
* It occasionally completely errors out due to errors parsing tool call
input JSON

Release Notes:

- N/A
2025-04-23 18:46:38 -07:00
Agus Zubiaga
8b5835de17
agent: Improve initial file search quality (#29317)
This PR significantly improves the quality of the initial file search
that occurs when the model doesn't yet know the full path to a file it
needs to read/edit.

Previously, the assertions in file_search often failed on main as the
model attempted to guess full file paths. On this branch, it reliably
calls `find_path` (previously `path_search`) before reading files.

After getting the model to find paths first, I noticed it would try
using `grep` instead of `path_search`. This motivated renaming
`path_search` to `find_path` (continuing the analogy to unix commands)
and adding system prompt instructions about proper tool selection.

Note: I know the command is just called `find`, but that seemed too
general.

In my eval runs, the `file_search` example improved from 40% ± 10% to
98% ± 2%. The only assertion I'm seeing occasionally fail is "glob
starts with `**` or project". We can probably add some instructions in
that regard.

Release Notes:

- N/A
2025-04-23 21:24:41 -03:00
Agus Zubiaga
45d3f5168a
eval: New add_arg_to_trait_method example (#29297)
Release Notes:

- N/A

---------

Co-authored-by: Richard Feldman <oss@rtfeldman.com>
2025-04-23 18:46:39 +00:00
Agus Zubiaga
ce1a674eba
eval: Fine-grained assertions (#29246)
- Support programmatic examples
([example](17feb260a0/crates/eval/src/examples/file_search.rs))
- Combine data-driven example declarations into a single `.toml` file
([example](17feb260a0/crates/eval/src/examples/find_and_replace_diff_card.toml))
- Run judge on individual assertions (previously called "criteria")
- Report judge and programmatic assertions in one combined table

Note: We still need to work on concept naming 

<img width=400
src="https://github.com/user-attachments/assets/fc719c93-467f-412b-8d47-68821bd8a5f5">

Release Notes:

- N/A

---------

Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
2025-04-22 23:58:58 -03:00