ZIm/crates/eval/src/judge_thread_prompt.hbs
Agus Zubiaga ce1a674eba
eval: Fine-grained assertions (#29246)
- Support programmatic examples
([example](17feb260a0/crates/eval/src/examples/file_search.rs))
- Combine data-driven example declarations into a single `.toml` file
([example](17feb260a0/crates/eval/src/examples/find_and_replace_diff_card.toml))
- Run judge on individual assertions (previously called "criteria")
- Report judge and programmatic assertions in one combined table

Note: We still need to work on concept naming 

<img width=400
src="https://github.com/user-attachments/assets/fc719c93-467f-412b-8d47-68821bd8a5f5">

Release Notes:

- N/A

---------

Co-authored-by: Richard Feldman <oss@rtfeldman.com>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
2025-04-22 23:58:58 -03:00

21 lines
517 B
Handlebars

You are an expert software developer.
Your task is to evaluate an AI agent's messages and tool calls in this conversation:
<messages>
{{{messages}}}
</messages>
Evaluate whether or not the sequence of messages passes the following assertion:
<assertion>
{{{assertion}}}
</assertion>
Analyze the messages one by one, and structure your answer in the following XML format:
```
<analysis>{YOUR ANALYSIS HERE}</analysis>
<passed>{PASSED_ASSERTION}</passed>
```
Where `PASSED_ASSERTION` is either `true` or `false`.