![]() Closes: https://github.com/zed-industries/zed/issues/20582 Allows users to select a specific model for each AI-powered feature: - Agent panel - Inline assistant - Thread summarization - Commit message generation If unspecified for a given feature, it will use the `default_model` setting. Release Notes: - Added support for configuring a specific model for each AI-powered feature --------- Co-authored-by: Danilo Leal <daniloleal09@gmail.com> Co-authored-by: Bennet Bo Fenner <bennetbo@gmx.de> |
||
---|---|---|
.. | ||
src | ||
build.rs | ||
Cargo.toml | ||
LICENSE-GPL | ||
README.md |
Tool Evals
A framework for evaluating and benchmarking the agent panel generations.
Overview
Tool Evals provides a headless environment for running assistants evaluations on code repositories. It automates the process of:
- Setting up test code and repositories
- Sending prompts to language models
- Allowing the assistant to use tools to modify code
- Collecting metrics on performance and tool usage
- Evaluating results against known good solutions
How It Works
The system consists of several key components:
- Eval: Loads exercises from the zed-ace-framework repository, creates temporary repos, and executes evaluations
- HeadlessAssistant: Provides a headless environment for running the AI assistant
- Judge: Evaluates AI-generated solutions against reference implementations and assigns scores
- Templates: Defines evaluation frameworks for different tasks (Project Creation, Code Modification, Conversational Guidance)
Setup Requirements
Prerequisites
- Rust and Cargo
- Git
- Python (for report generation)
- Network access to clone repositories
- Appropriate API keys for language models and git services (Anthropic, GitHub, etc.)
Environment Variables
Ensure you have the required API keys set, either from a dev run of Zed or via these environment variables:
ZED_ANTHROPIC_API_KEY
for Claude modelsZED_GITHUB_API_KEY
for GitHub API (or similar)
Usage
Running Evaluations
# Run all tests
cargo run -p assistant_eval -- --all
# Run only specific languages
cargo run -p assistant_eval -- --all --languages python,rust
# Limit concurrent evaluations
cargo run -p assistant_eval -- --all --concurrency 5
# Limit number of exercises per language
cargo run -p assistant_eval -- --all --max-exercises-per-language 3
Evaluation Template Types
The system supports three types of evaluation templates:
- ProjectCreation: Tests the model's ability to create new implementations from scratch
- CodeModification: Tests the model's ability to modify existing code to meet new requirements
- ConversationalGuidance: Tests the model's ability to provide guidance without writing code
Support Repo
The zed-industries/zed-ace-framework contains the analytics and reporting scripts.