![]() This is the core change: https://github.com/zed-industries/zed/pull/26758/files#diff-044302c0d57147af17e68a0009fee3e8dcdfb4f32c27a915e70cfa80e987f765R1052 TODO: - [x] Use AsyncFn instead of Fn() -> Future in GPUI spawn methods - [x] Implement it in the whole app - [x] Implement it in the debugger - [x] Glance at the RPC crate, and see if those box future methods can be switched over. Answer: It can't directly, as you can't make an AsyncFn* into a trait object. There's ways around that, but they're all more complex than just keeping the code as is. - [ ] Fix platform specific code Release Notes: - N/A |
||
---|---|---|
.. | ||
src | ||
build.rs | ||
Cargo.toml | ||
LICENSE-GPL | ||
README.md |
Tool Evals
A framework for evaluating and benchmarking AI assistant performance in the Zed editor.
Overview
Tool Evals provides a headless environment for running assistants evaluations on code repositories. It automates the process of:
- Cloning and setting up test repositories
- Sending prompts to language models
- Allowing the assistant to use tools to modify code
- Collecting metrics on performance
- Evaluating results against known good solutions
How It Works
The system consists of several key components:
- Eval: Loads test cases from the evaluation_data directory, clones repos, and executes evaluations
- HeadlessAssistant: Provides a headless environment for running the AI assistant
- Judge: Compares AI-generated diffs with reference solutions and scores their functional similarity
The evaluation flow:
- An evaluation is loaded from the evaluation_data directory
- The target repository is cloned and checked out at a specific commit
- A HeadlessAssistant instance is created with the specified language model
- The user prompt is sent to the assistant
- The assistant responds and uses tools to modify code
- Upon completion, a diff is generated from the changes
- Results are saved including the diff, assistant's response, and performance metrics
- If a reference solution exists, a Judge evaluates the similarity of the solution
Setup Requirements
Prerequisites
- Rust and Cargo
- Git
- Network access to clone repositories
- Appropriate API keys for language models and git services (Anthropic, GitHub, etc.)
Environment Variables
Ensure you have the required API keys set, either from a dev run of Zed or via these environment variables:
ZED_ANTHROPIC_API_KEY
for Claude modelsZED_OPENAI_API_KEY
for OpenAI modelsZED_GITHUB_API_KEY
for GitHub API (or similar)
Usage
Running a Single Evaluation
To run a specific evaluation:
cargo run -p assistant_eval -- bubbletea-add-set-window-title
The arguments are regex patterns for the evaluation names to run, so to run all evaluations that contain bubbletea
, run:
cargo run -p assistant_eval -- bubbletea
To run all evaluations:
cargo run -p assistant_eval -- --all
Evaluation Data Structure
Each evaluation should be placed in the evaluation_data
directory with the following structure:
prompt.txt
: The user's prompt.original.diff
: Thegit diff
of the change anticipated for this prompt.setup.json
: Information about the repo used for the evaluation.