Yehowshua/ZIm - Forgejo: Beyond coding. We Forge.

Author	SHA1	Message	Date
Nathan Sobo	d492939bed	Back off the eval to once a day for now (#29378 ) cc @maxbrunsfeld Release Notes: - N/A	2025-04-24 14:54:54 -06:00
Peter Tripp	338a6a3b7e	ci: Only run scheduled evals, not on main/release branch commits (#29238 ) Release Notes: - N/A	2025-04-22 16:55:36 -04:00
Max Brunsfeld	36d02de784	Rework eval to support interpretable scores and efficient repetitions (#29197 ) ### Problem We want to start continuously tracking our progress on agent evals over time. As part of this, we'd like the score to have a clear, interpretable meaning. Right now, it's a number from 0 to 5, but it's not clear what any particular number works. In addition, scores vary widely from run to run, because the agent's output is deterministic. We try to stabilize the score using a panel of judges, but the behavior of the agent itself varies much more widely than the judges' scores for a given run. ### Solution * explicit meanings of scores - In this PR, we're prescribing the diff and thread criteria files so that they must be unordered lists of assertions. For both the thread and the diff, rather than providing an abstract score, the judge's task is simply to count how many of these assertions are satisfied. A percentage score can be derived from this number, divided by the total number of assertions. * repetitions - Rather than running each example once, and judging it N times, we'll run the example N times. Right now, I'm just judging the output once per run, because I believe that with these more clear scoring criteria, the main source of non-determinism will be the agent's behavior, not the judge's ### Questions * accounting for diagnostic errors - Previously, the judge was asked to incorporate diagnostics into their abstract scores. Now that the "score" is determined directly from the criteria, the diagnostic will not be captured in the score. How should the diagnostics be accounted for in the eval? One thought is - let's simply count and report the number of errors remaining after the agent finishes, as a separate field of the run (along with diff score and thread score). We could consider normalizing it using the total lines of added code (like errors per 100 lines of code added) in order to give it some semblance of stability between examples. * repetitions - How many repetitions should we run on CI? Each repetition takes significant time, but I think running more than one repetition will make the scores significantly less volatile. ### Todo * [x] Fix `--concurrency` implementation so that only N tasks are spawned * [x] Support `--repetitions` efficiently (re-using the same worktree) * [x] Restructure judge prompts to count passing criteria, not compute abstract score * [x] Report total number of diagnostics in some way * [x] Format output nicely Release Notes: - N/A or Added/Fixed/Improved ... --------- Co-authored-by: Antonio Scandurra <me@as-cii.com>	2025-04-22 14:00:09 +00:00
Nathan Sobo	458ffaa134	Add new action to run agent eval (#29158 ) The old one wasn't linking, and https://github.com/zed-industries/zed/pull/29081 has a bunch of merge conflicts. Wanted to start simple/small. ## Todo * [x] Remove low-signal examples * [x] Make the eval run on a cron, on main, and on any PR with the `run-eval` label * [x] Noise in logs about failure to write settings ``` [2025-04-21T20:45:04Z ERROR settings] Failed to write settings to file "/home/runner/.config/zed/settings.json" Caused by: No such file or directory (os error 2) at path "/home/runner/.config/zed/.tmpLewFEs" ``` * [x] `Agentic loop stalled` (https://github.com/zed-industries/zed/actions/runs/14581044243/job/40897622894) * [x] Make sure that events are recorded in snowflake * [ ] Change judge criteria to be more explicit about meanings of scores Release Notes: - N/A --------- Co-authored-by: Antonio Scandurra <me@as-cii.com> Co-authored-by: Agus Zubiaga <hi@aguz.me> Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com> Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>	2025-04-21 21:30:21 -07:00

4 commits