ZIm/crates/eval/src
Oleksiy Syvokon 255d8f7cf8
agent: Overwrite files more cautiously (#30649)
1. The `edit_file` tool tended to use `create_or_overwrite` a bit too
often, leading to corruption of long files. This change replaces the
boolean flag with an `EditFileMode` enum, which helps Agent make a more
deliberate choice when overwriting files.

With this change, the pass rate of the new eval increased from 10% to
100%.

2. eval: Added ability to run eval on top of an existing thread. Threads
can now be loaded from JSON files in the `SerializedThread` format,
which makes it easy to use real threads as starting points for
tests/evals.

3. Don't try to restore tool cards when running in headless or eval mode
-- we don't have a window to properly do this.

Release Notes:

- N/A
2025-05-14 10:40:44 +03:00
..
examples agent: Overwrite files more cautiously (#30649) 2025-05-14 10:40:44 +03:00
assertions.rs eval: Fine-grained assertions (#29246) 2025-04-22 23:58:58 -03:00
eval.rs agent: Overwrite files more cautiously (#30649) 2025-05-14 10:40:44 +03:00
example.rs agent: Overwrite files more cautiously (#30649) 2025-05-14 10:40:44 +03:00
explorer.html eval: Add HTML overview for evaluation runs (#29413) 2025-04-25 17:49:05 +03:00
explorer.rs eval: Add HTML overview for evaluation runs (#29413) 2025-04-25 17:49:05 +03:00
ids.rs Add new action to run agent eval (#29158) 2025-04-21 21:30:21 -07:00
instance.rs agent: Overwrite files more cautiously (#30649) 2025-05-14 10:40:44 +03:00
judge_diff_prompt.hbs eval: Fine-grained assertions (#29246) 2025-04-22 23:58:58 -03:00
judge_thread_prompt.hbs eval: Fine-grained assertions (#29246) 2025-04-22 23:58:58 -03:00
tool_metrics.rs eval: Fine-grained assertions (#29246) 2025-04-22 23:58:58 -03:00