![]() 1. The `edit_file` tool tended to use `create_or_overwrite` a bit too often, leading to corruption of long files. This change replaces the boolean flag with an `EditFileMode` enum, which helps Agent make a more deliberate choice when overwriting files. With this change, the pass rate of the new eval increased from 10% to 100%. 2. eval: Added ability to run eval on top of an existing thread. Threads can now be loaded from JSON files in the `SerializedThread` format, which makes it easy to use real threads as starting points for tests/evals. 3. Don't try to restore tool cards when running in headless or eval mode -- we don't have a window to properly do this. Release Notes: - N/A |
||
---|---|---|
.. | ||
docs | ||
src | ||
.gitignore | ||
Cargo.toml | ||
LICENSE-GPL | ||
README.md | ||
runner_settings.json |
Eval
This eval assumes the working directory is the root of the repository. Run it with:
cargo run -p eval
The eval will optionally read a .env
file in crates/eval
if you need it to set environment variables, such as API keys.
Explorer Tool
The explorer tool generates a self-contained HTML view from one or more thread JSON file. It provides a visual interface to explore the agent thread, including tool calls and results. See ./docs/explorer.md for more details.
Usage
cargo run -p eval --bin explorer -- --input <path-to-json-files> --output <output-html-path>
Example:
cargo run -p eval --bin explorer -- --input ./runs/2025-04-23_15-53-30/fastmcp_bugifx/*/last.messages.json --output /tmp/explorer.html