Commit graph

12 commits

Author SHA1 Message Date
Peter Tripp
952e3713d7
ci: Switch to Namespace (#35835)
Follow-up to:
- https://github.com/zed-industries/zed/pull/35826

Release Notes:

- N/A
2025-08-07 23:16:25 +00:00
Peter Tripp
7679db99ac
ci: Switch from BuildJet to GitHub runners (#35826)
In response to an ongoing BuildJet outage, consider migrating CI to
GitHub hosted runners.

Also includes revert of (causing flaky tests):
- https://github.com/zed-industries/zed/pull/35741

Downsides:
- Cost (2x)
- Force migration to Ubuntu 22.04 from 20.04 will bump our glibc minimum
from 2.31 to 2.35. Which would break RHEL 9.x (glibc 2.34), Ubuntu 20.04
(EOL) and derivatives.

Release Notes:

- N/A
2025-08-07 16:59:11 -04:00
Peter Tripp
70bde54a2c
ci: Lint GitHub Actions workflows with actionlint (#34729)
Added [rhysd/actionlint](https://github.com/rhysd/actionlint/) a static
checker for GitHub Actions workflow files.
Install locally with `brew install actionlint` the run with
`actionlint`.

Inspired by: https://github.com/zed-industries/zed/pull/34704 which
yielded this observation:

> In github actions: 
> 1. strings are truthy
> 2. `${{ }}` will become a string if it doesn't wrap the whole value.
>
> So `if: false && true` becomes `false`
> and `if: ${{ false && true }}` becomes `false`
> but `if: false && ${{ true }}` becomes `"false && true"` which
evaluates true
> The reason you sometimes need `${{ }}` is because YAML doesn't like
`!`
> so `if: !false` is invalid yaml
> and `if: ${{ !false }}` works just fine.

Changes:
- Add `actionlint` job
- Refactor `job_spec` job to be more readable
- Fix all `actionlint` and `shellcheck` errors in Actions workflows (62
in all)
- Add `self-mini-macos` and `self-32vcpu-windows-2022` labels to
self-hosted runners. Not strictly related, but useful if you need to
take a runner out of the rotation (since `macOS`, `self-hosted`, and
`ARM64` are auto-set and cannot be added/removed).
- Change ci.yml macos_relase to target `self-mini-macos` instead of
`bundle` which was previously deprecated.

This would've caught the error fixed in
https://github.com/zed-industries/zed/pull/34704. Here's what that [job
failure](https://github.com/zed-industries/zed/actions/runs/16376993944/job/46279281842?pr=34729)
would've looked like.

Release Notes:

- N/A
2025-07-18 15:15:07 -04:00
Peter Tripp
7031ed8b87
ci: Fix duplicated/failed eval jobs (#33453)
I think this should fix the CI issues with Eval jobs:

1. Duplicated workflows because `synchronize` / `opened` were triggering
distinct runs. This caused failed job entries because the duplicated
workflows had a shared concurrency group and so one would pre-empt the
other.

3. Removes the no-op job, introduced as an attempted workaround in
   https://github.com/zed-industries/zed/pull/29420.

These should correctly show as "Skipped" now:

| Before | After |
| - | - |
| <img width="359" alt="Screenshot 2025-06-26 at 9 57 04"
src="https://github.com/user-attachments/assets/6ddd4f46-27c7-4d82-98ba-0f1166fc55e7"
/> | <img width="355" alt="Screenshot 2025-06-26 at 10 09 54"
src="https://github.com/user-attachments/assets/5faade2c-f17c-447a-9af9-6396f9e53016"
/> |

Release Notes:

- N/A
2025-06-26 11:01:16 -04:00
Peter Tripp
02dfaf7799
ci: Suppress evals on forks (#32479)
Be kind to those with Zed forks.

Example [action run on
fork](https://github.com/G36maid/freebsd-ports-zed/actions/runs/15525942275)
where [this
job](https://github.com/G36maid/freebsd-ports-zed/actions/runs/15549650437/job/43777665341)
will wait forever. Sorry @G36maid

Release Notes:

- N/A
2025-06-10 18:20:03 +00:00
Richard Feldman
04c68dc0cf
Make the default repetitions be 8, and concurrency 4 (#29576)
This is based on having observed that there is a lot of variation
between runs on `n=1` and `n=3`.

* With `n=8` two runs on the same branch give answers that seem close
enough to be reasonably consistent.
* With higher concurrency, trying to run this many repetitions seems to
lead language servers to time out a lot, causing evals to fail.

Release Notes:

- N/A
2025-04-30 15:21:19 -04:00
Marshall Bowers
9bee765d7f
ci: Fix typo (#29421)
This PR fixes a small typo in a comment added in #29420.

Release Notes:

- N/A
2025-04-25 15:00:12 +00:00
Marshall Bowers
8c553ee9f0
ci: Add no-op job for "Run Agent Eval" workflow (#29420)
This PR adds a no-op job for the "Run Agent Eval" workflow.

This aims to avoid marking the check as failed on a PR that does not
include the `run-eval` label.

Release Notes:

- N/A
2025-04-25 10:54:12 -04:00
Nathan Sobo
d492939bed
Back off the eval to once a day for now (#29378)
cc @maxbrunsfeld 

Release Notes:

- N/A
2025-04-24 14:54:54 -06:00
Peter Tripp
338a6a3b7e
ci: Only run scheduled evals, not on main/release branch commits (#29238)
Release Notes:

- N/A
2025-04-22 16:55:36 -04:00
Max Brunsfeld
36d02de784
Rework eval to support interpretable scores and efficient repetitions (#29197)
### Problem

We want to start continuously tracking our progress on agent evals over
time. As part of this, we'd like the *score* to have a clear,
interpretable meaning. Right now, it's a number from 0 to 5, but it's
not clear what any particular number works. In addition, scores vary
widely from run to run, because the agent's output is deterministic. We
try to stabilize the score using a panel of judges, but the behavior of
the agent itself varies much more widely than the judges' scores for a
given run.

### Solution

* **explicit meanings of scores** - In this PR, we're prescribing the
diff and thread criteria files so that they *must* be unordered lists of
assertions. For both the thread and the diff, rather than providing an
abstract score, the judge's task is simply to count how many of these
assertions are satisfied. A percentage score can be derived from this
number, divided by the total number of assertions.
* **repetitions** - Rather than running each example once, and judging
it N times, we'll **run** the example N times. Right now, I'm just
judging the output once per run, because I believe that with these more
clear scoring criteria, the main source of non-determinism will be the
*agent's* behavior, not the judge's

### Questions

* **accounting for diagnostic errors** - Previously, the judge was asked
to incorporate diagnostics into their abstract scores. Now that the
"score" is determined directly from the criteria, the diagnostic will
not be captured in the score. How should the diagnostics be accounted
for in the eval? One thought is - let's simply count and report the
number of errors remaining after the agent finishes, as a separate field
of the run (along with diff score and thread score). We could consider
normalizing it using the total lines of added code (like errors per 100
lines of code added) in order to give it some semblance of stability
between examples.

* **repetitions** - How many repetitions should we run on CI? Each
repetition takes significant time, but I think running more than one
repetition will make the scores significantly less volatile.

### Todo

* [x] Fix `--concurrency` implementation so that only N tasks are
spawned
* [x] Support `--repetitions` efficiently (re-using the same worktree)
* [x] Restructure judge prompts to count passing criteria, not compute
abstract score
* [x] Report total number of diagnostics in some way
* [x] Format output nicely

Release Notes:

- N/A *or* Added/Fixed/Improved ...

---------

Co-authored-by: Antonio Scandurra <me@as-cii.com>
2025-04-22 14:00:09 +00:00
Nathan Sobo
458ffaa134
Add new action to run agent eval (#29158)
The old one wasn't linking, and
https://github.com/zed-industries/zed/pull/29081 has a bunch of merge
conflicts. Wanted to start simple/small.

## Todo

* [x] Remove low-signal examples
* [x] Make the eval run on a cron, on main, and on any PR with the
`run-eval` label
* [x] Noise in logs about failure to write settings
    ```
[2025-04-21T20:45:04Z ERROR settings] Failed to write settings to file
"/home/runner/.config/zed/settings.json"
    
       Caused by:
No such file or directory (os error 2) at path
"/home/runner/.config/zed/.tmpLewFEs"
    ```
* [x] `Agentic loop stalled`
(https://github.com/zed-industries/zed/actions/runs/14581044243/job/40897622894)
* [x] Make sure that events are recorded in snowflake
* [ ] Change judge criteria to be more explicit about meanings of scores

Release Notes:

- N/A

---------

Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Agus Zubiaga <hi@aguz.me>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
2025-04-21 21:30:21 -07:00