
- Support programmatic examples ([example](17feb260a0/crates/eval/src/examples/file_search.rs
)) - Combine data-driven example declarations into a single `.toml` file ([example](17feb260a0/crates/eval/src/examples/find_and_replace_diff_card.toml
)) - Run judge on individual assertions (previously called "criteria") - Report judge and programmatic assertions in one combined table Note: We still need to work on concept naming <img width=400 src="https://github.com/user-attachments/assets/fc719c93-467f-412b-8d47-68821bd8a5f5"> Release Notes: - N/A --------- Co-authored-by: Richard Feldman <oss@rtfeldman.com> Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com> Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
21 lines
517 B
Handlebars
21 lines
517 B
Handlebars
You are an expert software developer.
|
|
Your task is to evaluate an AI agent's messages and tool calls in this conversation:
|
|
|
|
<messages>
|
|
{{{messages}}}
|
|
</messages>
|
|
|
|
Evaluate whether or not the sequence of messages passes the following assertion:
|
|
|
|
<assertion>
|
|
{{{assertion}}}
|
|
</assertion>
|
|
|
|
Analyze the messages one by one, and structure your answer in the following XML format:
|
|
|
|
```
|
|
<analysis>{YOUR ANALYSIS HERE}</analysis>
|
|
<passed>{PASSED_ASSERTION}</passed>
|
|
```
|
|
|
|
Where `PASSED_ASSERTION` is either `true` or `false`.
|