History

Mikayla Maki 8a307e7b89 Switch fully to Rust Livekit (redux) (#27126 ) Swift bindings BEGONE Release Notes: - Switched from using the Swift LiveKit bindings, to the Rust bindings, fixing https://github.com/zed-industries/zed/issues/9396, a crash when leaving a collaboration session, and making Zed easier to build. --------- Co-authored-by: Conrad Irwin <conrad.irwin@gmail.com> Co-authored-by: Michael Sloan <michael@zed.dev>		2025-03-28 17:58:23 +00:00
..
src	debugger: Remove fake adapter and un-gate GDB (#27557 )	2025-03-27 22:31:58 +00:00
build.rs	Switch fully to Rust Livekit (redux) (#27126 )	2025-03-28 17:58:23 +00:00
Cargo.toml	debugger: Remove fake adapter and un-gate GDB (#27557 )	2025-03-27 22:31:58 +00:00
LICENSE-GPL	Add initial implementation of evaluating changes generated by the assistant (#26799 )	2025-03-14 23:10:25 +00:00
README.md	Add initial implementation of evaluating changes generated by the assistant (#26799 )	2025-03-14 23:10:25 +00:00

Tool Evals

A framework for evaluating and benchmarking AI assistant performance in the Zed editor.

Overview

Tool Evals provides a headless environment for running assistants evaluations on code repositories. It automates the process of:

The system consists of several key components:

Eval: Loads test cases from the evaluation_data directory, clones repos, and executes evaluations
HeadlessAssistant: Provides a headless environment for running the AI assistant
Judge: Compares AI-generated diffs with reference solutions and scores their functional similarity

The evaluation flow:

An evaluation is loaded from the evaluation_data directory
The target repository is cloned and checked out at a specific commit
A HeadlessAssistant instance is created with the specified language model
The user prompt is sent to the assistant
The assistant responds and uses tools to modify code
Upon completion, a diff is generated from the changes
Results are saved including the diff, assistant's response, and performance metrics
If a reference solution exists, a Judge evaluates the similarity of the solution

Rust and Cargo
Git
Network access to clone repositories
Appropriate API keys for language models and git services (Anthropic, GitHub, etc.)

Ensure you have the required API keys set, either from a dev run of Zed or via these environment variables:

To run a specific evaluation:

cargo run -p assistant_eval -- bubbletea-add-set-window-title

The arguments are regex patterns for the evaluation names to run, so to run all evaluations that contain bubbletea, run:

cargo run -p assistant_eval -- bubbletea

To run all evaluations:

cargo run -p assistant_eval -- --all

Each evaluation should be placed in the evaluation_data directory with the following structure: