ZIm/crates/eval/docs/explorer.md
Oleksiy Syvokon 3389327df5
eval: Add HTML overview for evaluation runs (#29413)
This update generates a single self-contained .html file that shows an
overview of evaluation threads in the browser. It's useful for:

- Quickly reviewing results
- Sharing evaluation runs
- Debugging
- Comparing models (TBD)

Features:

- Export thread JSON from the UI
- Keyboard navigation (j/k or Ctrl + ←/→)
- Toggle between compact and full views

Generating the overview:

- `cargo run -p eval` will write this file in the run dir's root.
- Or you can call `cargo run -p eval --bin explorer` to generate it
without running evals.


Screenshot:

![image](https://github.com/user-attachments/assets/4ead71f6-da08-48ea-8fcb-2148d2e4b4db)


Release Notes:

- N/A
2025-04-25 17:49:05 +03:00

27 lines
1.9 KiB
Markdown

# Explorer
Threads Explorer is a single self-contained HTML file that gives an overview of
evaluation runs, while allowing for some interactivity.
When you open a file, it gives you a _thread overview_, which looks like this:
| Turn | Text | Tool | Result |
| ---- | ------------------------------------ | -------------------------------------------- | --------------------------------------------- |
| 1 | [User]: | | |
| | Fix the bug: kwargs not passed... | | |
| 2 | I'll help you fix that bug. | **list_directory**(path="fastmcp") | `fastmcp/src [...]` |
| | | | |
| 3 | Let's examine the code. | **read_file**(path="fastmcp/main.py", [...]) | `def run_application(app, \*\*kwargs): [...]` |
| 4 | I found the issue. | **edit_file**(path="fastmcp/core.py", [...]) | `Made edit to fastmcp/core.py` |
| 5 | Let's check if there are any errors. | **diagnostics**() | `No errors found` |
### Implementation details
`src/explorer.html` contains the template. You can open this template in a
browser as is, and it will show some dummy values. But the main use is to set
the `threadsData` variable with real data, which then will be used instead of
the dummy values.
`src/explorer.rs` takes one or more JSON files as generated by `cargo run -p
eval`, and outputs an HTML file for rendering these threads. Refer dummy data
in `explorer.html` for a sample format.