agent: Overwrite files more cautiously (#30649)

1. The `edit_file` tool tended to use `create_or_overwrite` a bit too often, leading to corruption of long files. This change replaces the boolean flag with an `EditFileMode` enum, which helps Agent make a more deliberate choice when overwriting files. With this change, the pass rate of the new eval increased from 10% to 100%. 2. eval: Added ability to run eval on top of an existing thread. Threads can now be loaded from JSON files in the `SerializedThread` format, which makes it easy to use real threads as starting points for tests/evals. 3. Don't try to restore tool cards when running in headless or eval mode -- we don't have a window to properly do this. Release Notes: - N/A
2025-05-14 10:40:44 +03:00 · 2025-05-14 10:40:44 +03:00 · 255d8f7cf8
commit 255d8f7cf8
parent 22f76ac1a7
18 changed files with 425 additions and 37 deletions
--- a/crates/eval/src/example.rs
+++ b/crates/eval/src/example.rs
@ -48,6 +48,7 @@ pub struct ExampleMetadata {
    pub language_server: Option<LanguageServer>,
    pub max_assertions: Option<usize>,
    pub profile_id: AgentProfileId,
+    pub existing_thread_json: Option<String>,
 }

 #[derive(Clone, Debug)]
@ -477,12 +478,16 @@ impl Response {
        tool_name: &'static str,
        cx: &mut ExampleContext,
    ) -> Result<&ToolUse> {
-        let result = self.messages.iter().find_map(|msg| {
+        let result = self.find_tool_call(tool_name);
+        cx.assert_some(result, format!("called `{}`", tool_name))
+    }
+
+    pub fn find_tool_call(&self, tool_name: &str) -> Option<&ToolUse> {
+        self.messages.iter().rev().find_map(|msg| {
            msg.tool_use
                .iter()
                .find(|tool_use| tool_use.name == tool_name)
-        });
-        cx.assert_some(result, format!("called `{}`", tool_name))
+        })
    }

    #[allow(dead_code)]