eval: Add HTML overview for evaluation runs (#29413)
This update generates a single self-contained .html file that shows an overview of evaluation threads in the browser. It's useful for: - Quickly reviewing results - Sharing evaluation runs - Debugging - Comparing models (TBD) Features: - Export thread JSON from the UI - Keyboard navigation (j/k or Ctrl + ←/→) - Toggle between compact and full views Generating the overview: - `cargo run -p eval` will write this file in the run dir's root. - Or you can call `cargo run -p eval --bin explorer` to generate it without running evals. Screenshot:  Release Notes: - N/A
This commit is contained in:
parent
f106dfca42
commit
3389327df5
7 changed files with 1351 additions and 149 deletions
27
crates/eval/docs/explorer.md
Normal file
27
crates/eval/docs/explorer.md
Normal file
|
@ -0,0 +1,27 @@
|
|||
# Explorer
|
||||
|
||||
Threads Explorer is a single self-contained HTML file that gives an overview of
|
||||
evaluation runs, while allowing for some interactivity.
|
||||
|
||||
When you open a file, it gives you a _thread overview_, which looks like this:
|
||||
|
||||
| Turn | Text | Tool | Result |
|
||||
| ---- | ------------------------------------ | -------------------------------------------- | --------------------------------------------- |
|
||||
| 1 | [User]: | | |
|
||||
| | Fix the bug: kwargs not passed... | | |
|
||||
| 2 | I'll help you fix that bug. | **list_directory**(path="fastmcp") | `fastmcp/src [...]` |
|
||||
| | | | |
|
||||
| 3 | Let's examine the code. | **read_file**(path="fastmcp/main.py", [...]) | `def run_application(app, \*\*kwargs): [...]` |
|
||||
| 4 | I found the issue. | **edit_file**(path="fastmcp/core.py", [...]) | `Made edit to fastmcp/core.py` |
|
||||
| 5 | Let's check if there are any errors. | **diagnostics**() | `No errors found` |
|
||||
|
||||
### Implementation details
|
||||
|
||||
`src/explorer.html` contains the template. You can open this template in a
|
||||
browser as is, and it will show some dummy values. But the main use is to set
|
||||
the `threadsData` variable with real data, which then will be used instead of
|
||||
the dummy values.
|
||||
|
||||
`src/explorer.rs` takes one or more JSON files as generated by `cargo run -p
|
||||
eval`, and outputs an HTML file for rendering these threads. Refer dummy data
|
||||
in `explorer.html` for a sample format.
|
Loading…
Add table
Add a link
Reference in a new issue