Oleksiy Syvokon 3389327df5

eval: Add HTML overview for evaluation runs (#29413 )

This update generates a single self-contained .html file that shows an
overview of evaluation threads in the browser. It's useful for:

- Quickly reviewing results
- Sharing evaluation runs
- Debugging
- Comparing models (TBD)

Features:

- Export thread JSON from the UI
- Keyboard navigation (j/k or Ctrl + ←/→)
- Toggle between compact and full views

Generating the overview:

- `cargo run -p eval` will write this file in the run dir's root.
- Or you can call `cargo run -p eval --bin explorer` to generate it
without running evals.


Screenshot:

![image](https://github.com/user-attachments/assets/4ead71f6-da08-48ea-8fcb-2148d2e4b4db)


Release Notes:

- N/A

2025-04-25 17:49:05 +03:00

1.9 KiB

Raw Blame History

Explorer

Threads Explorer is a single self-contained HTML file that gives an overview of evaluation runs, while allowing for some interactivity.

When you open a file, it gives you a thread overview, which looks like this:

Turn	Text	Tool	Result
1	[User]:
	Fix the bug: kwargs not passed...
2	I'll help you fix that bug.	list_directory(path="fastmcp")	`fastmcp/src [...]`

3	Let's examine the code.	read_file(path="fastmcp/main.py", [...])	`def run_application(app, \\kwargs): [...]`
4	I found the issue.	edit_file(path="fastmcp/core.py", [...])	`Made edit to fastmcp/core.py`
5	Let's check if there are any errors.	diagnostics()	`No errors found`

Implementation details

src/explorer.html contains the template. You can open this template in a browser as is, and it will show some dummy values. But the main use is to set the threadsData variable with real data, which then will be used instead of the dummy values.

src/explorer.rs takes one or more JSON files as generated by cargo run -p eval, and outputs an HTML file for rendering these threads. Refer dummy data in explorer.html for a sample format.

1.9 KiB Raw Blame History

Explorer

Implementation details

1.9 KiB

Raw Blame History