ZIm/crates/eval/src at 4396ac9dd6307bb7d7870a8415f590cfecae16b7 - Yehowshua/ZIm

History

Bennet Bo Fenner 7be57baef0 agent: Fix issue with Anthropic thinking models (#33317 ) cc @osyvokon We were seeing a bunch of errors in our backend when people were using Claude models with thinking enabled. In the logs we would see > an error occurred while interacting with the Anthropic API: invalid_request_error: messages.x.content.0.type: Expected `thinking` or `redacted_thinking`, but found `text`. When `thinking` is enabled, a final `assistant` message must start with a thinking block (preceeding the lastmost set of `tool_use` and `tool_result` blocks). We recommend you include thinking blocks from previous turns. To avoid this requirement, disable `thinking`. Please consult our documentation at https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking However, this issue did not occur frequently and was not easily reproducible. Turns out it was triggered by us not correctly handling [Redacted Thinking Blocks](https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking#thinking-redaction). I could constantly reproduce this issue by including this magic string: `ANTHROPIC_MAGIC_STRING_TRIGGER_REDACTED_THINKING_46C9A13E193C177646C7398A98432ECCCE4C1253D5E2D82641AC0E52CC2876CB ` in the request, which forces `claude-3-7-sonnet` to emit redacted thinking blocks (confusingly the magic string does not seem to be working for `claude-sonnet-4`). As soon as we hit a tool call Anthropic would return an error. Thanks to @osyvokon for pointing me in the right direction 😄! Release Notes: - agent: Fixed an issue where Anthropic models would sometimes return an error when thinking was enabled		2025-06-24 16:23:59 +00:00
..
examples	agent: Less disruptive changed file notification (#31693 )	2025-06-16 18:45:24 +03:00
assertions.rs	eval: Count execution errors as failures (#30712 )	2025-05-14 20:44:19 +03:00
eval.rs	Extract an agent_ui crate from agent (#33284 )	2025-06-23 18:00:28 -07:00
example.rs	Extract an agent_ui crate from agent (#33284 )	2025-06-23 18:00:28 -07:00
explorer.html	eval: Add HTML overview for evaluation runs (#29413 )	2025-04-25 17:49:05 +03:00
explorer.rs	evals: Allow threads explorer to search for JSON files recursively (#31509 )	2025-05-27 14:18:47 +00:00
ids.rs	Use `anyhow` more idiomatically (#31052 )	2025-05-20 23:06:07 +00:00
instance.rs	agent: Fix issue with Anthropic thinking models (#33317 )	2025-06-24 16:23:59 +00:00
judge_diff_prompt.hbs	eval: Fine-grained assertions (#29246 )	2025-04-22 23:58:58 -03:00
judge_thread_prompt.hbs	eval: Fine-grained assertions (#29246 )	2025-04-22 23:58:58 -03:00
tool_metrics.rs	eval: Fine-grained assertions (#29246 )	2025-04-22 23:58:58 -03:00