### Problem
We want to start continuously tracking our progress on agent evals over
time. As part of this, we'd like the *score* to have a clear,
interpretable meaning. Right now, it's a number from 0 to 5, but it's
not clear what any particular number works. In addition, scores vary
widely from run to run, because the agent's output is deterministic. We
try to stabilize the score using a panel of judges, but the behavior of
the agent itself varies much more widely than the judges' scores for a
given run.
### Solution
* **explicit meanings of scores** - In this PR, we're prescribing the
diff and thread criteria files so that they *must* be unordered lists of
assertions. For both the thread and the diff, rather than providing an
abstract score, the judge's task is simply to count how many of these
assertions are satisfied. A percentage score can be derived from this
number, divided by the total number of assertions.
* **repetitions** - Rather than running each example once, and judging
it N times, we'll **run** the example N times. Right now, I'm just
judging the output once per run, because I believe that with these more
clear scoring criteria, the main source of non-determinism will be the
*agent's* behavior, not the judge's
### Questions
* **accounting for diagnostic errors** - Previously, the judge was asked
to incorporate diagnostics into their abstract scores. Now that the
"score" is determined directly from the criteria, the diagnostic will
not be captured in the score. How should the diagnostics be accounted
for in the eval? One thought is - let's simply count and report the
number of errors remaining after the agent finishes, as a separate field
of the run (along with diff score and thread score). We could consider
normalizing it using the total lines of added code (like errors per 100
lines of code added) in order to give it some semblance of stability
between examples.
* **repetitions** - How many repetitions should we run on CI? Each
repetition takes significant time, but I think running more than one
repetition will make the scores significantly less volatile.
### Todo
* [x] Fix `--concurrency` implementation so that only N tasks are
spawned
* [x] Support `--repetitions` efficiently (re-using the same worktree)
* [x] Restructure judge prompts to count passing criteria, not compute
abstract score
* [x] Report total number of diagnostics in some way
* [x] Format output nicely
Release Notes:
- N/A *or* Added/Fixed/Improved ...
---------
Co-authored-by: Antonio Scandurra <me@as-cii.com>
The old one wasn't linking, and
https://github.com/zed-industries/zed/pull/29081 has a bunch of merge
conflicts. Wanted to start simple/small.
## Todo
* [x] Remove low-signal examples
* [x] Make the eval run on a cron, on main, and on any PR with the
`run-eval` label
* [x] Noise in logs about failure to write settings
```
[2025-04-21T20:45:04Z ERROR settings] Failed to write settings to file
"/home/runner/.config/zed/settings.json"
Caused by:
No such file or directory (os error 2) at path
"/home/runner/.config/zed/.tmpLewFEs"
```
* [x] `Agentic loop stalled`
(https://github.com/zed-industries/zed/actions/runs/14581044243/job/40897622894)
* [x] Make sure that events are recorded in snowflake
* [ ] Change judge criteria to be more explicit about meanings of scores
Release Notes:
- N/A
---------
Co-authored-by: Antonio Scandurra <me@as-cii.com>
Co-authored-by: Agus Zubiaga <hi@aguz.me>
Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>
Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>
This PR contains the following updates:
| Package | Type | Update | Change |
|---|---|---|---|
| [actions/setup-node](https://redirect.github.com/actions/setup-node) |
action | digest | `cdca736` -> `49933ea` |
---
### Configuration
📅 **Schedule**: Branch creation - "after 3pm on Wednesday" in timezone
America/New_York, Automerge - At any time (no schedule defined).
🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.
♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.
🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.
---
- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box
---
Release Notes:
- N/A
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4yMzguMCIsInVwZGF0ZWRJblZlciI6IjM5LjIzOC4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This PR contains the following updates:
| Package | Type | Update | Change |
|---|---|---|---|
| [actions/checkout](https://redirect.github.com/actions/checkout) |
action | pinDigest | -> `11bd719` |
---
### Configuration
📅 **Schedule**: Branch creation - "after 3pm on Wednesday" in timezone
America/New_York, Automerge - At any time (no schedule defined).
🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.
♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.
🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.
---
- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box
---
Release Notes:
- N/A
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4yMzguMCIsInVwZGF0ZWRJblZlciI6IjM5LjIzOC4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Improve the logic in around release artifact bundling.
- Suppress a harmless "error: no such command: `about`" from
script/generate-licenes output
- Remove checks for main branch (which will never be true)
- Only run `Upload Artifacts to release` when not using `run-bundling`.
Prevents the creation of draft releases with just linux remote server binaries)
Release Notes:
- N/A
This PR contains the following updates:
| Package | Type | Update | Change |
|---|---|---|---|
|
[actions/dependency-review-action](https://redirect.github.com/actions/dependency-review-action)
| action | digest | `3b139cf` -> `67d4f4b` |
---
### Configuration
📅 **Schedule**: Branch creation - "after 3pm on Wednesday" in timezone
America/New_York, Automerge - At any time (no schedule defined).
🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.
♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.
🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.
---
- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box
---
Release Notes:
- N/A
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4yMzguMCIsInVwZGF0ZWRJblZlciI6IjM5LjIzOC4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Make various improvements to our github issue templates.
- Adjust line lengths to not wrap in constrained new issue view (85
cols) not just full screen view (95 columns)
- Remove reference to drag/drop logs to upload (recently multiple issues
with dead upload links)
- Cleanup list view
Release Notes:
- N/A
This adds a nix CI job to build the flake in debug mode for
aarch64-darwin and x86-linux. For now this job will only run when the
`run-nix` label is added to a PR.
The CI job doesn't push to cachix for now, so every build is a clean
build.
I also added a condition to the garbage collection step so it only runs
when the nix store is >50GB.
Release Notes:
- N/A
This adds a "workspace-hack" crate, see
[mozilla's](https://hg.mozilla.org/mozilla-central/file/3a265fdc9f33e5946f0ca0a04af73acd7e6d1a39/build/workspace-hack/Cargo.toml#l7)
for a concise explanation of why this is useful. For us in practice this
means that if I were to run all the tests (`cargo nextest r
--workspace`) and then `cargo r`, all the deps from the previous cargo
command will be reused. Before this PR it would rebuild many deps due to
resolving different sets of features for them. For me this frequently
caused long rebuilds when things "should" already be cached.
To avoid manually maintaining our workspace-hack crate, we will use
[cargo hakari](https://docs.rs/cargo-hakari) to update the build files
when there's a necessary change. I've added a step to CI that checks
whether the workspace-hack crate is up to date, and instructs you to
re-run `script/update-workspace-hack` when it fails.
Finally, to make sure that people can still depend on crates in our
workspace without pulling in all the workspace deps, we use a `[patch]`
section following [hakari's
instructions](https://docs.rs/cargo-hakari/0.9.36/cargo_hakari/patch_directive/index.html)
One possible followup task would be making guppy use our
`rust-toolchain.toml` instead of having to duplicate that list in its
config, I opened an issue for that upstream: guppy-rs/guppy#481.
TODO:
- [x] Fix the extension test failure
- [x] Ensure the dev dependencies aren't being unified by Hakari into
the main dependencies
- [x] Ensure that the remote-server binary continues to not depend on
LibSSL
Release Notes:
- N/A
---------
Co-authored-by: Mikayla <mikayla@zed.dev>
Co-authored-by: Mikayla Maki <mikayla.c.maki@gmail.com>
- bump our livekit version to include a fix for a crane bug (TODO: add
link when an issue is filed on crane)
- switch to a clang stdenv for both linux and macos
- manually unify versions of our notify crate
- remove old linker flags which were only needed for livekit
- fix an issue where RUSTFLAGS shadowed the rustflags from cargo configs
Release Notes:
- N/A
Swift bindings BEGONE
Release Notes:
- Switched from using the Swift LiveKit bindings, to the Rust bindings,
fixing https://github.com/zed-industries/zed/issues/9396, a crash when
leaving a collaboration session, and making Zed easier to build.
---------
Co-authored-by: Conrad Irwin <conrad.irwin@gmail.com>
Co-authored-by: Michael Sloan <michael@zed.dev>
This PR contains the following updates:
| Package | Type | Update | Change |
|---|---|---|---|
| [swatinem/rust-cache](https://redirect.github.com/swatinem/rust-cache)
| action | digest | `f0deed1` -> `9d47c6a` |
---
### Configuration
📅 **Schedule**: Branch creation - "after 3pm on Wednesday" in timezone
America/New_York, Automerge - At any time (no schedule defined).
🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.
♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.
🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.
---
- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box
---
Release Notes:
- N/A
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4yMDcuMSIsInVwZGF0ZWRJblZlciI6IjM5LjIwNy4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
I'll be using this to `nix run github:zed-industries/zed/nightly` and
get an up-to-date and cached nightly build.
It'll also serve as a way to warn me when the nix build is broken,
rather than having to wait for users to report it.
Eventually and depending on the build time of the nix builds, we may
want to consider putting a nix build in CI (#17458) to prevent
breakages, but for now a best-effort nightly build that doesn't block
the job if it fails is a good start.
Resolve#19937
Release Notes:
- N/A
This PR contains the following updates:
| Package | Type | Update | Change |
|---|---|---|---|
|
[actions/upload-artifact](https://redirect.github.com/actions/upload-artifact)
| action | digest | `4cec3d8` -> `ea165f8` |
---
### Configuration
📅 **Schedule**: Branch creation - "after 3pm on Wednesday" in timezone
America/New_York, Automerge - At any time (no schedule defined).
🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.
♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.
🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.
---
- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box
---
Release Notes:
- N/A
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4yMDcuMSIsInVwZGF0ZWRJblZlciI6IjM5LjIwNy4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This PR contains the following updates:
| Package | Type | Update | Change |
|---|---|---|---|
|
[cloudflare/wrangler-action](https://redirect.github.com/cloudflare/wrangler-action)
| action | digest | `392082e` -> `da0e0df` |
---
### Configuration
📅 **Schedule**: Branch creation - "after 3pm on Wednesday" in timezone
America/New_York, Automerge - At any time (no schedule defined).
🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.
♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.
🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.
---
- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box
---
Release Notes:
- N/A
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4yMDcuMSIsInVwZGF0ZWRJblZlciI6IjM5LjIwNy4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This PR contains the following updates:
| Package | Type | Update | Change |
|---|---|---|---|
| [actions/setup-node](https://redirect.github.com/actions/setup-node) |
action | digest | `1d0ff46` -> `cdca736` |
---
### Configuration
📅 **Schedule**: Branch creation - "after 3pm on Wednesday" in timezone
America/New_York, Automerge - At any time (no schedule defined).
🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.
♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.
🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.
---
- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box
---
Release Notes:
- N/A
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4yMDcuMSIsInVwZGF0ZWRJblZlciI6IjM5LjIwNy4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
This PR contains the following updates:
| Package | Type | Update | Change |
|---|---|---|---|
| [actions/checkout](https://redirect.github.com/actions/checkout) |
action | pinDigest | -> `11bd719` |
---
### Configuration
📅 **Schedule**: Branch creation - "after 3pm on Wednesday" in timezone
America/New_York, Automerge - At any time (no schedule defined).
🚦 **Automerge**: Disabled by config. Please merge this manually once you
are satisfied.
♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the
rebase/retry checkbox.
🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.
---
- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box
---
Release Notes:
- N/A
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzOS4yMDcuMSIsInVwZGF0ZWRJblZlciI6IjM5LjIwNy4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->
Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Stalebot has a maximum operations-per-run which is set at 1000. As a
result it may require multiple runs to successfully complete.
This morning it took [three
runs](https://github.com/zed-industries/zed/actions/runs/13921563707/attempts/1)
so set it to run three times two hours apart to avoid hitting github API
limits.
Release Notes:
- N/A
This PR removes the `migration_checks` job as a required check.
This was not required before, and we shouldn't make it required, as
there are cases where we need to bypass it, as is the case in
https://github.com/zed-industries/zed/pull/26832.
Release Notes:
- N/A
Remove the git beta issue template.
Improve ci.yml `job_spec` so that changes like this will not require CI in the future.
Improve ci.yml `job_spec` ensuring `output.run_license` exported for Cargo.lock.
Release Notes:
- N/A
---------
Co-authored-by: Peter Tripp <peter@zed.dev>
Follow up to:
- https://github.com/zed-industries/zed/pull/26551
We need the "Tests Pass" step to run `if: always()`.
Turns out when it's 'skipped', it counts as 'passing' with respect to
required status checks.
Release Notes:
- N/A
Let's see if the speed of `windows-2025-32` for `windows_tests` is
fast-enough for PRs and everywhere else use `windows-2025-16`. Leaving
`windows_clippy` unchanged with `windows-2025-16`.
Release Notes:
- N/A
Refactor GitHub actions CI workflow.
- Single combined 'tests_pass' action so we only need one mandatory
check for merge queue
- Add new `job_spec` job which determines what needs to be run (+5secs)
- Do not run full CI for docs only changes (~30secs vs 10+mins)
- Only run `script/generate-licenses` if Cargo.lock changed (saves
~23secs on mac_test)
- Move prettier /docs check to ci.yml and remove docs.yml
- Run Windows tests on every PR commit
- Added new Windows runners named to reflect their OS/capacity
(windows-2025-64, windows-2025-32, windows-2025-16)
Release Notes:
- N/A
Split Windows GHA CI job into `windows_clippy` and `windows_tests`
(`cargo test` and `cargo build`). `windows_clippy` will continue to run
on every PR commit, but `windows_tests` will only be run on main. Tag a
PR `windows` if you would like to run windows tests.
Added a call to the Azure metadata service to detect the Azure hardware
used by the GitHub hosted Windows runners. This is temporary and I'll
remove once I've gathered some data (adds 5-15secs to Windows CI times)
Release Notes:
- N/A
---------
Co-authored-by: Mikayla Maki <mikayla@zed.dev>