History

Julia Ryan 01ec6e0f77 Add workspace-hack (#27277 ) This adds a "workspace-hack" crate, see [mozilla's](https://hg.mozilla.org/mozilla-central/file/3a265fdc9f33e5946f0ca0a04af73acd7e6d1a39/build/workspace-hack/Cargo.toml#l7) for a concise explanation of why this is useful. For us in practice this means that if I were to run all the tests (`cargo nextest r --workspace`) and then `cargo r`, all the deps from the previous cargo command will be reused. Before this PR it would rebuild many deps due to resolving different sets of features for them. For me this frequently caused long rebuilds when things "should" already be cached. To avoid manually maintaining our workspace-hack crate, we will use [cargo hakari](https://docs.rs/cargo-hakari) to update the build files when there's a necessary change. I've added a step to CI that checks whether the workspace-hack crate is up to date, and instructs you to re-run `script/update-workspace-hack` when it fails. Finally, to make sure that people can still depend on crates in our workspace without pulling in all the workspace deps, we use a `[patch]` section following [hakari's instructions](https://docs.rs/cargo-hakari/0.9.36/cargo_hakari/patch_directive/index.html) One possible followup task would be making guppy use our `rust-toolchain.toml` instead of having to duplicate that list in its config, I opened an issue for that upstream: guppy-rs/guppy#481. TODO: - [x] Fix the extension test failure - [x] Ensure the dev dependencies aren't being unified by Hakari into the main dependencies - [x] Ensure that the remote-server binary continues to not depend on LibSSL Release Notes: - N/A --------- Co-authored-by: Mikayla <mikayla@zed.dev> Co-authored-by: Mikayla Maki <mikayla.c.maki@gmail.com>		2025-04-02 13:26:34 -07:00
..
src	Rename `assistant2` to `agent` (#27887 )	2025-04-02 00:40:47 +00:00
build.rs	Switch fully to Rust Livekit (redux) (#27126 )	2025-03-28 17:58:23 +00:00
Cargo.toml	Add workspace-hack (#27277 )	2025-04-02 13:26:34 -07:00
LICENSE-GPL	Add initial implementation of evaluating changes generated by the assistant (#26799 )	2025-03-14 23:10:25 +00:00
README.md	Add initial implementation of evaluating changes generated by the assistant (#26799 )	2025-03-14 23:10:25 +00:00

README.md

Tool Evals

A framework for evaluating and benchmarking AI assistant performance in the Zed editor.

Overview

Tool Evals provides a headless environment for running assistants evaluations on code repositories. It automates the process of:

Cloning and setting up test repositories
Sending prompts to language models
Allowing the assistant to use tools to modify code
Collecting metrics on performance
Evaluating results against known good solutions

How It Works

The system consists of several key components:

Eval: Loads test cases from the evaluation_data directory, clones repos, and executes evaluations
HeadlessAssistant: Provides a headless environment for running the AI assistant
Judge: Compares AI-generated diffs with reference solutions and scores their functional similarity

The evaluation flow:

An evaluation is loaded from the evaluation_data directory
The target repository is cloned and checked out at a specific commit
A HeadlessAssistant instance is created with the specified language model
The user prompt is sent to the assistant
The assistant responds and uses tools to modify code
Upon completion, a diff is generated from the changes
Results are saved including the diff, assistant's response, and performance metrics
If a reference solution exists, a Judge evaluates the similarity of the solution

Setup Requirements

Prerequisites

Rust and Cargo
Git
Network access to clone repositories
Appropriate API keys for language models and git services (Anthropic, GitHub, etc.)

Environment Variables

Ensure you have the required API keys set, either from a dev run of Zed or via these environment variables:

ZED_ANTHROPIC_API_KEY for Claude models
ZED_OPENAI_API_KEY for OpenAI models
ZED_GITHUB_API_KEY for GitHub API (or similar)

Usage

Running a Single Evaluation

To run a specific evaluation:

cargo run -p assistant_eval -- bubbletea-add-set-window-title

The arguments are regex patterns for the evaluation names to run, so to run all evaluations that contain bubbletea, run:

cargo run -p assistant_eval -- bubbletea

To run all evaluations:

cargo run -p assistant_eval -- --all

Evaluation Data Structure

Each evaluation should be placed in the evaluation_data directory with the following structure:

prompt.txt: The user's prompt.
original.diff: The git diff of the change anticipated for this prompt.
setup.json: Information about the repo used for the evaluation.