evals: Eval for creating an empty file (#31034)

This eval checks that Edit Agent can create an empty file without
writing its thoughts into it. This issue is not specific to empty files,
but it's easier to reproduce with them.

For some mysterious reason, I could easily reproduce this issue roughly
90% of the time in actual Zed. However, once I extract the exact LLM
request before the failure point and generate from that, the
reproduction rate drops to 2%!

Things I've tried to make sure it's not a fluke: disabling prompt
caching, capturing the LLM request via a proxy server, running the
prompt on Claude separately from evals. Every time it was mostly giving
good outcomes, which doesn't match my actual experience in Zed.

At some point I discovered that simply adding one insignificant space or
a newline to the prompt suddenly results in an outcome I tried to
reproduce almost perfectly.

This weirdness happens even outside the Zed code base and even when
using a different subscription. The result is the same: an extra newline
or space changes the model behavior significantly enough, so that the
pass rate drops from 99% to 0-3%

I have no explanation to this.


Release Notes:

- N/A
This commit is contained in:
Oleksiy Syvokon 2025-05-20 20:03:08 +03:00 committed by GitHub
parent 65e751ca33
commit 5e5a124ae1
No known key found for this signature in database
GPG key ID: B5690EEEBB952194
3 changed files with 206 additions and 68 deletions

View file

@ -38,7 +38,7 @@ use workspace::Workspace;
pub struct EditFileTool;
#[derive(Debug, Serialize, Deserialize, JsonSchema)]
#[derive(Clone, Debug, Serialize, Deserialize, JsonSchema)]
pub struct EditFileToolInput {
/// A one-line, user-friendly markdown description of the edit. This will be
/// shown in the UI and also passed to another model to perform the edit.