assistant_eval: Add ACE framework (#27181)

Release Notes: - N/A --------- Co-authored-by: Michael Sloan <michael@zed.dev>
2025-04-02 23:02:06 -05:00 · 2025-04-02 23:02:06 -05:00 · cd85b430e4
commit cd85b430e4
parent d3e4de7c72
11 changed files with 1113 additions and 373 deletions
--- a/Cargo.lock
+++ b/Cargo.lock
@ -595,14 +595,12 @@ dependencies = [
 "futures 0.3.31",
 "gpui",
 "gpui_tokio",
- "itertools 0.14.0",
 "language",
 "language_model",
 "language_models",
 "node_runtime",
 "project",
 "prompt_store",
- "regex",
 "release_channel",
 "reqwest_client",
 "serde",
@ -610,7 +608,9 @@ dependencies = [
 "serde_json_lenient",
 "settings",
 "smol",
+ "tempfile",
 "util",
+ "walkdir",
 "workspace-hack",
 ]

--- a/Cargo.toml
+++ b/Cargo.toml
@ -579,6 +579,7 @@ unicode-script = "0.5.7"
 url = "2.2"
 urlencoding = "2.1.2"
 uuid = { version = "1.1.2", features = ["v4", "v5", "v7", "serde"] }
+walkdir = "2.3"
 wasmparser = "0.221"
 wasm-encoder = "0.221"
 wasmtime = { version = "29", default-features = false, features = [
--- a/crates/assistant_eval/Cargo.toml
+++ b/crates/assistant_eval/Cargo.toml
@ -27,14 +27,12 @@ fs.workspace = true
 futures.workspace = true
 gpui.workspace = true
 gpui_tokio.workspace = true
-itertools.workspace = true
 language.workspace = true
 language_model.workspace = true
 language_models.workspace = true
 node_runtime.workspace = true
 project.workspace = true
 prompt_store.workspace = true
-regex.workspace = true
 release_channel.workspace = true
 reqwest_client.workspace = true
 serde.workspace = true
@ -42,5 +40,7 @@ serde_json.workspace = true
 serde_json_lenient.workspace = true
 settings.workspace = true
 smol.workspace = true
+tempfile.workspace = true
 util.workspace = true
+walkdir.workspace = true
 workspace-hack.workspace = true
--- a/crates/assistant_eval/README.md
+++ b/crates/assistant_eval/README.md
@ -1,34 +1,25 @@
 # Tool Evals

-A framework for evaluating and benchmarking AI assistant performance in the Zed editor.
+A framework for evaluating and benchmarking the agent panel generations.

 ## Overview

 Tool Evals provides a headless environment for running assistants evaluations on code repositories. It automates the process of:

-1. Cloning and setting up test repositories
+1. Setting up test code and repositories
 2. Sending prompts to language models
 3. Allowing the assistant to use tools to modify code
-4. Collecting metrics on performance
+4. Collecting metrics on performance and tool usage
 5. Evaluating results against known good solutions

 ## How It Works

 The system consists of several key components:

- **Eval**: Loads test cases from the evaluation_data directory, clones repos, and executes evaluations
+- **Eval**: Loads exercises from the zed-ace-framework repository, creates temporary repos, and executes evaluations
 - **HeadlessAssistant**: Provides a headless environment for running the AI assistant
- **Judge**: Compares AI-generated diffs with reference solutions and scores their functional similarity
-
-The evaluation flow:
-1. An evaluation is loaded from the evaluation_data directory
-2. The target repository is cloned and checked out at a specific commit
-3. A HeadlessAssistant instance is created with the specified language model
-4. The user prompt is sent to the assistant
-5. The assistant responds and uses tools to modify code
-6. Upon completion, a diff is generated from the changes
-7. Results are saved including the diff, assistant's response, and performance metrics
-8. If a reference solution exists, a Judge evaluates the similarity of the solution
+- **Judge**: Evaluates AI-generated solutions against reference implementations and assigns scores
+- **Templates**: Defines evaluation frameworks for different tasks (Project Creation, Code Modification, Conversational Guidance)

 ## Setup Requirements

@ -36,6 +27,7 @@ The evaluation flow:

 - Rust and Cargo
 - Git
+- Python (for report generation)
 - Network access to clone repositories
 - Appropriate API keys for language models and git services (Anthropic, GitHub, etc.)

@ -43,35 +35,34 @@ The evaluation flow:

 Ensure you have the required API keys set, either from a dev run of Zed or via these environment variables:
 - `ZED_ANTHROPIC_API_KEY` for Claude models
- `ZED_OPENAI_API_KEY` for OpenAI models
 - `ZED_GITHUB_API_KEY` for GitHub API (or similar)

 ## Usage

-### Running a Single Evaluation
-
-To run a specific evaluation:
-
-```bash
-cargo run -p assistant_eval -- bubbletea-add-set-window-title
-```
-
-The arguments are regex patterns for the evaluation names to run, so to run all evaluations that contain `bubbletea`, run:
-
-```bash
-cargo run -p assistant_eval -- bubbletea
-```
-
-To run all evaluations:
+### Running Evaluations

 ```bash
+# Run all tests
 cargo run -p assistant_eval -- --all
+
+# Run only specific languages
+cargo run -p assistant_eval -- --all --languages python,rust
+
+# Limit concurrent evaluations
+cargo run -p assistant_eval -- --all --concurrency 5
+
+# Limit number of exercises per language
+cargo run -p assistant_eval -- --all --max-exercises-per-language 3
 ```

-## Evaluation Data Structure
+### Evaluation Template Types

-Each evaluation should be placed in the `evaluation_data` directory with the following structure:
+The system supports three types of evaluation templates:

-* `prompt.txt`: The user's prompt.
-* `original.diff`: The `git diff` of the change anticipated for this prompt.
-* `setup.json`: Information about the repo used for the evaluation.
+1. **ProjectCreation**: Tests the model's ability to create new implementations from scratch
+2. **CodeModification**: Tests the model's ability to modify existing code to meet new requirements
+3. **ConversationalGuidance**: Tests the model's ability to provide guidance without writing code
+
+### Support Repo
+
+The [zed-industries/zed-ace-framework](https://github.com/zed-industries/zed-ace-framework) contains the analytics and reporting scripts.
--- a/crates/assistant_eval/src/eval.rs
+++ b/crates/assistant_eval/src/eval.rs
@ -1,6 +1,8 @@
+use crate::git_commands::{run_git, setup_temp_repo};
 use crate::headless_assistant::{HeadlessAppState, HeadlessAssistant};
+use crate::{get_exercise_language, get_exercise_name, templates_eval::Template};
 use agent::RequestKind;
-use anyhow::anyhow;
+use anyhow::{Result, anyhow};
 use collections::HashMap;
 use gpui::{App, Task};
 use language_model::{LanguageModel, TokenUsage};
@ -10,19 +12,26 @@ use std::{
    io::Write,
    path::{Path, PathBuf},
    sync::Arc,
-    time::Duration,
+    time::{Duration, SystemTime},
 };
-use util::command::new_smol_command;

-pub struct Eval {
-    pub name: String,
-    pub path: PathBuf,
-    pub repo_path: PathBuf,
-    pub eval_setup: EvalSetup,
-    pub user_prompt: String,
+#[derive(Debug, Serialize, Deserialize, Clone)]
+pub struct EvalResult {
+    pub exercise_name: String,
+    pub template_name: String,
+    pub score: String,
+    pub diff: String,
+    pub assistant_response: String,
+    pub elapsed_time_ms: u128,
+    pub timestamp: u128,
+    // Token usage fields
+    pub input_tokens: usize,
+    pub output_tokens: usize,
+    pub total_tokens: usize,
+    pub tool_use_counts: usize,
+    pub judge_model_name: String, // Added field for judge model name
 }

-#[derive(Debug, Serialize)]
 pub struct EvalOutput {
    pub diff: String,
    pub last_message: String,
@ -38,19 +47,31 @@ pub struct EvalSetup {
    pub base_sha: String,
 }

+pub struct Eval {
+    pub repo_path: PathBuf,
+    pub eval_setup: EvalSetup,
+    pub user_prompt: String,
+}
+
 impl Eval {
-    /// Loads the eval from a path (typically in `evaluation_data`). Clones and checks out the repo
-    /// if necessary.
-    pub async fn load(name: String, path: PathBuf, repos_dir: &Path) -> anyhow::Result<Self> {
+    // Keep this method for potential future use, but mark it as intentionally unused
+    #[allow(dead_code)]
+    pub async fn load(_name: String, path: PathBuf, repos_dir: &Path) -> Result<Self> {
        let prompt_path = path.join("prompt.txt");
        let user_prompt = smol::unblock(|| std::fs::read_to_string(prompt_path)).await?;
        let setup_path = path.join("setup.json");
        let setup_contents = smol::unblock(|| std::fs::read_to_string(setup_path)).await?;
        let eval_setup = serde_json_lenient::from_str_lenient::<EvalSetup>(&setup_contents)?;
+
+        // Move this internal function inside the load method since it's only used here
+        fn repo_dir_name(url: &str) -> String {
+            url.trim_start_matches("https://")
+                .replace(|c: char| !c.is_alphanumeric(), "_")
+        }
+
        let repo_path = repos_dir.join(repo_dir_name(&eval_setup.url));
+
        Ok(Eval {
-            name,
-            path,
            repo_path,
            eval_setup,
            user_prompt,
@ -62,9 +83,9 @@ impl Eval {
        app_state: Arc<HeadlessAppState>,
        model: Arc<dyn LanguageModel>,
        cx: &mut App,
-    ) -> Task<anyhow::Result<EvalOutput>> {
+    ) -> Task<Result<EvalOutput>> {
        cx.spawn(async move |cx| {
-            checkout_repo(&self.eval_setup, &self.repo_path).await?;
+            run_git(&self.repo_path, &["checkout", &self.eval_setup.base_sha]).await?;

            let (assistant, done_rx) =
                cx.update(|cx| HeadlessAssistant::new(app_state.clone(), cx))??;
@ -104,9 +125,43 @@ impl Eval {

            done_rx.recv().await??;

+            // Add this section to check untracked files
+            println!("Checking for untracked files:");
+            let untracked = run_git(
+                &self.repo_path,
+                &["ls-files", "--others", "--exclude-standard"],
+            )
+            .await?;
+            if untracked.is_empty() {
+                println!("No untracked files found");
+            } else {
+                // Add all files to git so they appear in the diff
+                println!("Adding untracked files to git");
+                run_git(&self.repo_path, &["add", "."]).await?;
+            }
+
+            // get git status
+            let _status = run_git(&self.repo_path, &["status", "--short"]).await?;
+
            let elapsed_time = start_time.elapsed()?;

-            let diff = query_git(&self.repo_path, vec!["diff"]).await?;
+            // Get diff of staged changes (the files we just added)
+            let staged_diff = run_git(&self.repo_path, &["diff", "--staged"]).await?;
+
+            // Get diff of unstaged changes
+            let unstaged_diff = run_git(&self.repo_path, &["diff"]).await?;
+
+            // Combine both diffs
+            let diff = if unstaged_diff.is_empty() {
+                staged_diff
+            } else if staged_diff.is_empty() {
+                unstaged_diff
+            } else {
+                format!(
+                    "# Staged changes\n{}\n\n# Unstaged changes\n{}",
+                    staged_diff, unstaged_diff
+                )
+            };

            assistant.update(cx, |assistant, cx| {
                let thread = assistant.thread.read(cx);
@ -132,12 +187,9 @@ impl Eval {
 }

 impl EvalOutput {
-    // Method to save the output to a directory
-    pub fn save_to_directory(
-        &self,
-        output_dir: &Path,
-        eval_output_value: String,
-    ) -> anyhow::Result<()> {
+    // Keep this method for potential future use, but mark it as intentionally unused
+    #[allow(dead_code)]
+    pub fn save_to_directory(&self, output_dir: &Path, eval_output_value: String) -> Result<()> {
        // Create the output directory if it doesn't exist
        fs::create_dir_all(&output_dir)?;

@ -192,76 +244,305 @@ impl EvalOutput {
    }
 }

-fn repo_dir_name(url: &str) -> String {
-    url.trim_start_matches("https://")
-        .replace(|c: char| !c.is_alphanumeric(), "_")
+pub async fn read_instructions(exercise_path: &Path) -> Result<String> {
+    let instructions_path = exercise_path.join(".docs").join("instructions.md");
+    println!("Reading instructions from: {}", instructions_path.display());
+    let instructions = smol::unblock(move || std::fs::read_to_string(&instructions_path)).await?;
+    Ok(instructions)
 }

-async fn checkout_repo(eval_setup: &EvalSetup, repo_path: &Path) -> anyhow::Result<()> {
-    if !repo_path.exists() {
-        smol::unblock({
-            let repo_path = repo_path.to_path_buf();
-            || std::fs::create_dir_all(repo_path)
-        })
-        .await?;
-        run_git(repo_path, vec!["init"]).await?;
-        run_git(repo_path, vec!["remote", "add", "origin", &eval_setup.url]).await?;
-    } else {
-        let actual_origin = query_git(repo_path, vec!["remote", "get-url", "origin"]).await?;
-        if actual_origin != eval_setup.url {
-            return Err(anyhow!(
-                "remote origin {} does not match expected origin {}",
-                actual_origin,
-                eval_setup.url
-            ));
-        }
+pub async fn read_example_solution(exercise_path: &Path, language: &str) -> Result<String> {
+    // Map the language to the file extension
+    let language_extension = match language {
+        "python" => "py",
+        "go" => "go",
+        "rust" => "rs",
+        "typescript" => "ts",
+        "javascript" => "js",
+        "ruby" => "rb",
+        "php" => "php",
+        "bash" => "sh",
+        "multi" => "diff",
+        "internal" => "diff",
+        _ => return Err(anyhow!("Unsupported language: {}", language)),
+    };
+    let example_path = exercise_path
+        .join(".meta")
+        .join(format!("example.{}", language_extension));
+    println!("Reading example solution from: {}", example_path.display());
+    let example = smol::unblock(move || std::fs::read_to_string(&example_path)).await?;
+    Ok(example)
+}

-        // TODO: consider including "-x" to remove ignored files. The downside of this is that it will
-        // also remove build artifacts, and so prevent incremental reuse there.
-        run_git(repo_path, vec!["clean", "--force", "-d"]).await?;
-        run_git(repo_path, vec!["reset", "--hard", "HEAD"]).await?;
+pub async fn save_eval_results(exercise_path: &Path, results: Vec<EvalResult>) -> Result<()> {
+    let eval_dir = exercise_path.join("evaluation");
+    fs::create_dir_all(&eval_dir)?;
+
+    let eval_file = eval_dir.join("evals.json");
+
+    println!("Saving evaluation results to: {}", eval_file.display());
+    println!(
+        "Results to save: {} evaluations for exercise path: {}",
+        results.len(),
+        exercise_path.display()
+    );
+
+    // Check file existence before reading/writing
+    if eval_file.exists() {
+        println!("Existing evals.json file found, will update it");
+    } else {
+        println!("No existing evals.json file found, will create new one");
    }

-    run_git(
-        repo_path,
-        vec!["fetch", "--depth", "1", "origin", &eval_setup.base_sha],
-    )
-    .await?;
-    run_git(repo_path, vec!["checkout", &eval_setup.base_sha]).await?;
+    // Structure to organize evaluations by test name and timestamp
+    let mut eval_data: serde_json::Value = if eval_file.exists() {
+        let content = fs::read_to_string(&eval_file)?;
+        serde_json::from_str(&content).unwrap_or_else(|_| serde_json::json!({}))
+    } else {
+        serde_json::json!({})
+    };
+
+    // Get current timestamp for this batch of results
+    let timestamp = SystemTime::now()
+        .duration_since(SystemTime::UNIX_EPOCH)?
+        .as_millis()
+        .to_string();
+
+    // Group the new results by test name (exercise name)
+    for result in results {
+        let exercise_name = &result.exercise_name;
+        let template_name = &result.template_name;
+
+        println!(
+            "Adding result: exercise={}, template={}",
+            exercise_name, template_name
+        );
+
+        // Ensure the exercise entry exists
+        if eval_data.get(exercise_name).is_none() {
+            eval_data[exercise_name] = serde_json::json!({});
+        }
+
+        // Ensure the timestamp entry exists as an object
+        if eval_data[exercise_name].get(&timestamp).is_none() {
+            eval_data[exercise_name][&timestamp] = serde_json::json!({});
+        }
+
+        // Add this result under the timestamp with template name as key
+        eval_data[exercise_name][&timestamp][template_name] = serde_json::to_value(&result)?;
+    }
+
+    // Write back to file with pretty formatting
+    let json_content = serde_json::to_string_pretty(&eval_data)?;
+    match fs::write(&eval_file, json_content) {
+        Ok(_) => println!("✓ Successfully saved results to {}", eval_file.display()),
+        Err(e) => println!("✗ Failed to write results file: {}", e),
+    }

    Ok(())
 }

-async fn run_git(repo_path: &Path, args: Vec<&str>) -> anyhow::Result<()> {
-    let exit_status = new_smol_command("git")
-        .current_dir(repo_path)
-        .args(args.clone())
-        .status()
-        .await?;
-    if exit_status.success() {
-        Ok(())
-    } else {
-        Err(anyhow!(
-            "`git {}` failed with {}",
-            args.join(" "),
-            exit_status,
-        ))
-    }
-}
+pub async fn run_exercise_eval(
+    exercise_path: PathBuf,
+    template: Template,
+    model: Arc<dyn LanguageModel>,
+    judge_model: Arc<dyn LanguageModel>,
+    app_state: Arc<HeadlessAppState>,
+    base_sha: String,
+    _framework_path: PathBuf,
+    cx: gpui::AsyncApp,
+) -> Result<EvalResult> {
+    let exercise_name = get_exercise_name(&exercise_path);
+    let language = get_exercise_language(&exercise_path)?;
+    let mut instructions = read_instructions(&exercise_path).await?;
+    instructions.push_str(&format!(
+        "\n\nWhen writing the code for this prompt, use {} to achieve the goal.",
+        language
+    ));
+    let example_solution = read_example_solution(&exercise_path, &language).await?;

-async fn query_git(repo_path: &Path, args: Vec<&str>) -> anyhow::Result<String> {
-    let output = new_smol_command("git")
-        .current_dir(repo_path)
-        .args(args.clone())
-        .output()
+    println!(
+        "Running evaluation for exercise: {} with template: {}",
+        exercise_name, template.name
+    );
+
+    // Create temporary directory with exercise files
+    let temp_dir = setup_temp_repo(&exercise_path, &base_sha).await?;
+    let temp_path = temp_dir.path().to_path_buf();
+
+    if template.name == "ProjectCreation" {
+        for entry in fs::read_dir(&temp_path)? {
+            let entry = entry?;
+            let path = entry.path();
+
+            // Skip directories that start with dot (like .docs, .meta, .git)
+            if path.is_dir()
+                && path
+                    .file_name()
+                    .and_then(|name| name.to_str())
+                    .map(|name| name.starts_with("."))
+                    .unwrap_or(false)
+            {
+                continue;
+            }
+
+            // Delete regular files
+            if path.is_file() {
+                println!("  Deleting file: {}", path.display());
+                fs::remove_file(path)?;
+            }
+        }
+
+        // Commit the deletion so it shows up in the diff
+        run_git(&temp_path, &["add", "."]).await?;
+        run_git(
+            &temp_path,
+            &["commit", "-m", "Remove root files for clean slate"],
+        )
        .await?;
-    if output.status.success() {
-        Ok(String::from_utf8(output.stdout)?.trim().to_string())
-    } else {
-        Err(anyhow!(
-            "`git {}` failed with {}",
-            args.join(" "),
-            output.status
-        ))
    }
+
+    let local_commit_sha = run_git(&temp_path, &["rev-parse", "HEAD"]).await?;
+
+    // Prepare prompt based on template
+    let prompt = match template.name {
+        "ProjectCreation" => format!(
+            "I need to create a new implementation for this exercise. Please create all the necessary files in the best location.\n\n{}",
+            instructions
+        ),
+        "CodeModification" => format!(
+            "I need help updating my code to meet these requirements. Please modify the appropriate files:\n\n{}",
+            instructions
+        ),
+        "ConversationalGuidance" => format!(
+            "I'm trying to solve this coding exercise but I'm not sure where to start. Can you help me understand the requirements and guide me through the solution process without writing code for me?\n\n{}",
+            instructions
+        ),
+        _ => instructions.clone(),
+    };
+
+    let start_time = SystemTime::now();
+
+    // Create a basic eval struct to work with the existing system
+    let eval = Eval {
+        repo_path: temp_path.clone(),
+        eval_setup: EvalSetup {
+            url: format!("file://{}", temp_path.display()),
+            base_sha: local_commit_sha, // Use the local commit SHA instead of the framework base SHA
+        },
+        user_prompt: prompt,
+    };
+
+    // Run the evaluation
+    let eval_output = cx
+        .update(|cx| eval.run(app_state.clone(), model.clone(), cx))?
+        .await?;
+
+    // Get diff from git
+    let diff = eval_output.diff.clone();
+
+    // For project creation template, we need to compare with reference implementation
+    let judge_output = if template.name == "ProjectCreation" {
+        let project_judge_prompt = template
+            .content
+            .replace(
+                "<!-- ```requirements go here``` -->",
+                &format!("```\n{}\n```", instructions),
+            )
+            .replace(
+                "<!-- ```reference code goes here``` -->",
+                &format!("```{}\n{}\n```", language, example_solution),
+            )
+            .replace(
+                "<!-- ```git diff goes here``` -->",
+                &format!("```\n{}\n```", diff),
+            );
+
+        // Use the run_with_prompt method which we'll add to judge.rs
+        let judge = crate::judge::Judge {
+            original_diff: None,
+            original_message: Some(project_judge_prompt),
+            model: judge_model.clone(),
+        };
+
+        cx.update(|cx| judge.run_with_prompt(cx))?.await?
+    } else if template.name == "CodeModification" {
+        // For CodeModification, we'll compare the example solution with the LLM-generated solution
+        let code_judge_prompt = template
+            .content
+            .replace(
+                "<!-- ```reference code goes here``` -->",
+                &format!("```{}\n{}\n```", language, example_solution),
+            )
+            .replace(
+                "<!-- ```git diff goes here``` -->",
+                &format!("```\n{}\n```", diff),
+            );
+
+        // Use the run_with_prompt method
+        let judge = crate::judge::Judge {
+            original_diff: None,
+            original_message: Some(code_judge_prompt),
+            model: judge_model.clone(),
+        };
+
+        cx.update(|cx| judge.run_with_prompt(cx))?.await?
+    } else {
+        // Conversational template
+        let conv_judge_prompt = template
+            .content
+            .replace(
+                "<!-- ```query goes here``` -->",
+                &format!("```\n{}\n```", instructions),
+            )
+            .replace(
+                "<!-- ```transcript goes here``` -->",
+                &format!("```\n{}\n```", eval_output.last_message),
+            )
+            .replace(
+                "<!-- ```git diff goes here``` -->",
+                &format!("```\n{}\n```", diff),
+            );
+
+        // Use the run_with_prompt method for consistency
+        let judge = crate::judge::Judge {
+            original_diff: None,
+            original_message: Some(conv_judge_prompt),
+            model: judge_model.clone(),
+        };
+
+        cx.update(|cx| judge.run_with_prompt(cx))?.await?
+    };
+
+    let elapsed_time = start_time.elapsed()?;
+
+    // Calculate total tokens as the sum of input and output tokens
+    let input_tokens = eval_output.token_usage.input_tokens;
+    let output_tokens = eval_output.token_usage.output_tokens;
+    let tool_use_counts = eval_output.tool_use_counts.values().sum::<u32>();
+    let total_tokens = input_tokens + output_tokens;
+
+    // Get judge model name
+    let judge_model_name = judge_model.id().0.to_string();
+
+    // Save results to evaluation directory
+    let result = EvalResult {
+        exercise_name: exercise_name.clone(),
+        template_name: template.name.to_string(),
+        score: judge_output.trim().to_string(),
+        diff,
+        assistant_response: eval_output.last_message.clone(),
+        elapsed_time_ms: elapsed_time.as_millis(),
+        timestamp: SystemTime::now()
+            .duration_since(SystemTime::UNIX_EPOCH)?
+            .as_millis(),
+        // Convert u32 token counts to usize
+        input_tokens: input_tokens.try_into().unwrap(),
+        output_tokens: output_tokens.try_into().unwrap(),
+        total_tokens: total_tokens.try_into().unwrap(),
+        tool_use_counts: tool_use_counts.try_into().unwrap(),
+        judge_model_name, // Add judge model name to result
+    };
+
+    Ok(result)
 }
--- a/crates/assistant_eval/src/get_exercise.rs
+++ b/crates/assistant_eval/src/get_exercise.rs
@ -0,0 +1,149 @@
+use anyhow::{Result, anyhow};
+use std::{
+    fs,
+    path::{Path, PathBuf},
+};
+
+pub fn get_exercise_name(exercise_path: &Path) -> String {
+    exercise_path
+        .file_name()
+        .unwrap_or_default()
+        .to_string_lossy()
+        .to_string()
+}
+
+pub fn get_exercise_language(exercise_path: &Path) -> Result<String> {
+    // Extract the language from path (data/python/exercises/... => python)
+    let parts: Vec<_> = exercise_path.components().collect();
+
+    for (i, part) in parts.iter().enumerate() {
+        if i > 0 && part.as_os_str() == "eval_code" {
+            if i + 1 < parts.len() {
+                let language = parts[i + 1].as_os_str().to_string_lossy().to_string();
+                return Ok(language);
+            }
+        }
+    }
+
+    Err(anyhow!(
+        "Could not determine language from path: {:?}",
+        exercise_path
+    ))
+}
+
+pub fn find_exercises(
+    framework_path: &Path,
+    languages: &[&str],
+    max_per_language: Option<usize>,
+) -> Result<Vec<PathBuf>> {
+    let mut all_exercises = Vec::new();
+
+    println!("Searching for exercises in languages: {:?}", languages);
+
+    for language in languages {
+        let language_dir = framework_path
+            .join("eval_code")
+            .join(language)
+            .join("exercises")
+            .join("practice");
+
+        println!("Checking language directory: {:?}", language_dir);
+        if !language_dir.exists() {
+            println!("Warning: Language directory not found: {:?}", language_dir);
+            continue;
+        }
+
+        let mut exercises = Vec::new();
+        match fs::read_dir(&language_dir) {
+            Ok(entries) => {
+                for entry_result in entries {
+                    match entry_result {
+                        Ok(entry) => {
+                            let path = entry.path();
+
+                            if path.is_dir() {
+                                // Special handling for "internal" directory
+                                if *language == "internal" {
+                                    // Check for repo_info.json to validate it's an internal exercise
+                                    let repo_info_path = path.join(".meta").join("repo_info.json");
+                                    let instructions_path =
+                                        path.join(".docs").join("instructions.md");
+
+                                    if repo_info_path.exists() && instructions_path.exists() {
+                                        exercises.push(path);
+                                    }
+                                } else {
+                                    // Map the language to the file extension - original code
+                                    let language_extension = match *language {
+                                        "python" => "py",
+                                        "go" => "go",
+                                        "rust" => "rs",
+                                        "typescript" => "ts",
+                                        "javascript" => "js",
+                                        "ruby" => "rb",
+                                        "php" => "php",
+                                        "bash" => "sh",
+                                        "multi" => "diff",
+                                        _ => continue, // Skip unsupported languages
+                                    };
+
+                                    // Check if this is a valid exercise with instructions and example
+                                    let instructions_path =
+                                        path.join(".docs").join("instructions.md");
+                                    let has_instructions = instructions_path.exists();
+                                    let example_path = path
+                                        .join(".meta")
+                                        .join(format!("example.{}", language_extension));
+                                    let has_example = example_path.exists();
+
+                                    if has_instructions && has_example {
+                                        exercises.push(path);
+                                    }
+                                }
+                            }
+                        }
+                        Err(err) => println!("Error reading directory entry: {}", err),
+                    }
+                }
+            }
+            Err(err) => println!(
+                "Error reading directory {}: {}",
+                language_dir.display(),
+                err
+            ),
+        }
+
+        // Sort exercises by name for consistent selection
+        exercises.sort_by(|a, b| {
+            let a_name = a.file_name().unwrap_or_default().to_string_lossy();
+            let b_name = b.file_name().unwrap_or_default().to_string_lossy();
+            a_name.cmp(&b_name)
+        });
+
+        // Apply the limit if specified
+        if let Some(limit) = max_per_language {
+            if exercises.len() > limit {
+                println!(
+                    "Limiting {} exercises to {} for language {}",
+                    exercises.len(),
+                    limit,
+                    language
+                );
+                exercises.truncate(limit);
+            }
+        }
+
+        println!(
+            "Found {} exercises for language {}: {:?}",
+            exercises.len(),
+            language,
+            exercises
+                .iter()
+                .map(|p| p.file_name().unwrap_or_default().to_string_lossy())
+                .collect::<Vec<_>>()
+        );
+        all_exercises.extend(exercises);
+    }
+
+    Ok(all_exercises)
+}
--- a/crates/assistant_eval/src/git_commands.rs
+++ b/crates/assistant_eval/src/git_commands.rs
@ -0,0 +1,125 @@
+use anyhow::{Result, anyhow};
+use serde::Deserialize;
+use std::{fs, path::Path};
+use tempfile::TempDir;
+use util::command::new_smol_command;
+use walkdir::WalkDir;
+
+#[derive(Debug, Deserialize)]
+pub struct SetupConfig {
+    #[serde(rename = "base.sha")]
+    pub base_sha: String,
+}
+
+#[derive(Debug, Deserialize)]
+pub struct RepoInfo {
+    pub remote_url: String,
+    pub head_sha: String,
+}
+
+pub async fn run_git(repo_path: &Path, args: &[&str]) -> Result<String> {
+    let output = new_smol_command("git")
+        .current_dir(repo_path)
+        .args(args)
+        .output()
+        .await?;
+
+    if output.status.success() {
+        Ok(String::from_utf8(output.stdout)?.trim().to_string())
+    } else {
+        Err(anyhow!(
+            "Git command failed: {} with status: {}",
+            args.join(" "),
+            output.status
+        ))
+    }
+}
+
+pub async fn read_base_sha(framework_path: &Path) -> Result<String> {
+    let setup_path = framework_path.join("setup.json");
+    let setup_content = smol::unblock(move || std::fs::read_to_string(&setup_path)).await?;
+    let setup_config: SetupConfig = serde_json_lenient::from_str_lenient(&setup_content)?;
+    Ok(setup_config.base_sha)
+}
+
+pub async fn read_repo_info(exercise_path: &Path) -> Result<RepoInfo> {
+    let repo_info_path = exercise_path.join(".meta").join("repo_info.json");
+    println!("Reading repo info from: {}", repo_info_path.display());
+    let repo_info_content = smol::unblock(move || std::fs::read_to_string(&repo_info_path)).await?;
+    let repo_info: RepoInfo = serde_json_lenient::from_str_lenient(&repo_info_content)?;
+
+    // Remove any quotes from the strings
+    let remote_url = repo_info.remote_url.trim_matches('"').to_string();
+    let head_sha = repo_info.head_sha.trim_matches('"').to_string();
+
+    Ok(RepoInfo {
+        remote_url,
+        head_sha,
+    })
+}
+
+pub async fn setup_temp_repo(exercise_path: &Path, _base_sha: &str) -> Result<TempDir> {
+    let temp_dir = TempDir::new()?;
+
+    // Check if this is an internal exercise by looking for repo_info.json
+    let repo_info_path = exercise_path.join(".meta").join("repo_info.json");
+    if repo_info_path.exists() {
+        // This is an internal exercise, handle it differently
+        let repo_info = read_repo_info(exercise_path).await?;
+
+        // Clone the repository to the temp directory
+        let url = repo_info.remote_url;
+        let clone_path = temp_dir.path();
+        println!(
+            "Cloning repository from {} to {}",
+            url,
+            clone_path.display()
+        );
+        run_git(
+            &std::env::current_dir()?,
+            &["clone", &url, &clone_path.to_string_lossy()],
+        )
+        .await?;
+
+        // Checkout the specified commit
+        println!("Checking out commit: {}", repo_info.head_sha);
+        run_git(temp_dir.path(), &["checkout", &repo_info.head_sha]).await?;
+
+        println!("Successfully set up internal repository");
+    } else {
+        // Original code for regular exercises
+        // Copy the exercise files to the temp directory, excluding .docs and .meta
+        for entry in WalkDir::new(exercise_path).min_depth(0).max_depth(10) {
+            let entry = entry?;
+            let source_path = entry.path();
+
+            // Skip .docs and .meta directories completely
+            if source_path.starts_with(exercise_path.join(".docs"))
+                || source_path.starts_with(exercise_path.join(".meta"))
+            {
+                continue;
+            }
+
+            if source_path.is_file() {
+                let relative_path = source_path.strip_prefix(exercise_path)?;
+                let dest_path = temp_dir.path().join(relative_path);
+
+                // Make sure parent directories exist
+                if let Some(parent) = dest_path.parent() {
+                    fs::create_dir_all(parent)?;
+                }
+
+                fs::copy(source_path, dest_path)?;
+            }
+        }
+
+        // Initialize git repo in the temp directory
+        run_git(temp_dir.path(), &["init"]).await?;
+        run_git(temp_dir.path(), &["add", "."]).await?;
+        run_git(temp_dir.path(), &["commit", "-m", "Initial commit"]).await?;
+
+        println!("Created temp repo without .docs and .meta directories");
+    }
+
+    Ok(temp_dir)
+}
--- a/crates/assistant_eval/src/headless_assistant.rs
+++ b/crates/assistant_eval/src/headless_assistant.rs
@ -102,6 +102,40 @@ impl HeadlessAssistant {
                    thread.use_pending_tools(cx);
                });
            }
+            ThreadEvent::ToolConfirmationNeeded => {
+                // Automatically approve all tools that need confirmation in headless mode
+                println!("Tool confirmation needed - automatically approving in headless mode");
+
+                // Get the tools needing confirmation
+                let tools_needing_confirmation: Vec<_> = thread
+                    .read(cx)
+                    .tools_needing_confirmation()
+                    .cloned()
+                    .collect();
+
+                // Run each tool that needs confirmation
+                for tool_use in tools_needing_confirmation {
+                    if let Some(tool) = thread.read(cx).tools().tool(&tool_use.name, cx) {
+                        thread.update(cx, |thread, cx| {
+                            println!("Auto-approving tool: {}", tool_use.name);
+
+                            // Create a request to send to the tool
+                            let request = thread.to_completion_request(RequestKind::Chat, cx);
+                            let messages = Arc::new(request.messages);
+
+                            // Run the tool
+                            thread.run_tool(
+                                tool_use.id.clone(),
+                                tool_use.ui_text.clone(),
+                                tool_use.input.clone(),
+                                &messages,
+                                tool,
+                                cx,
+                            );
+                        });
+                    }
+                }
+            }
            ThreadEvent::ToolFinished {
                tool_use_id,
                pending_tool_use,
@ -127,6 +161,10 @@ impl HeadlessAssistant {
                            thread.attach_tool_results(vec![], cx);
                            thread.send_to_model(model, RequestKind::Chat, cx);
                        });
+                    } else {
+                        println!(
+                            "Warning: No active language model available to continue conversation"
+                        );
                    }
                }
            }
--- a/crates/assistant_eval/src/judge.rs
+++ b/crates/assistant_eval/src/judge.rs
@ -1,58 +1,28 @@
-use crate::eval::EvalOutput;
 use crate::headless_assistant::send_language_model_request;
 use anyhow::anyhow;
 use gpui::{App, Task};
 use language_model::{
    LanguageModel, LanguageModelRequest, LanguageModelRequestMessage, MessageContent, Role,
 };
-use std::{path::Path, sync::Arc};
+use std::sync::Arc;

 pub struct Judge {
-    pub original_diff: Option<String>,
    #[allow(dead_code)]
+    pub original_diff: Option<String>,
    pub original_message: Option<String>,
    pub model: Arc<dyn LanguageModel>,
 }

 impl Judge {
-    pub async fn load(eval_path: &Path, model: Arc<dyn LanguageModel>) -> anyhow::Result<Judge> {
-        let original_diff_path = eval_path.join("original.diff");
-        let original_diff = smol::unblock(move || {
-            if std::fs::exists(&original_diff_path)? {
-                anyhow::Ok(Some(std::fs::read_to_string(&original_diff_path)?))
-            } else {
-                anyhow::Ok(None)
-            }
-        });
-
-        let original_message_path = eval_path.join("original_message.txt");
-        let original_message = smol::unblock(move || {
-            if std::fs::exists(&original_message_path)? {
-                anyhow::Ok(Some(std::fs::read_to_string(&original_message_path)?))
-            } else {
-                anyhow::Ok(None)
-            }
-        });
-
-        Ok(Self {
-            original_diff: original_diff.await?,
-            original_message: original_message.await?,
-            model,
-        })
-    }
-
-    pub fn run(&self, eval_output: &EvalOutput, cx: &mut App) -> Task<anyhow::Result<String>> {
-        let Some(original_diff) = self.original_diff.as_ref() else {
-            return Task::ready(Err(anyhow!("No original.diff found")));
+    pub fn run_with_prompt(&self, cx: &mut App) -> Task<anyhow::Result<String>> {
+        let Some(prompt) = self.original_message.as_ref() else {
+            return Task::ready(Err(anyhow!("No prompt provided in original_message")));
        };

-        // TODO: check for empty diff?
-        let prompt = diff_comparison_prompt(&original_diff, &eval_output.diff);
-
        let request = LanguageModelRequest {
            messages: vec![LanguageModelRequestMessage {
                role: Role::User,
-                content: vec![MessageContent::Text(prompt)],
+                content: vec![MessageContent::Text(prompt.clone())],
                cache: false,
            }],
            temperature: Some(0.0),
@ -61,61 +31,7 @@ impl Judge {
        };

        let model = self.model.clone();
+        let request = request.clone();
        cx.spawn(async move |cx| send_language_model_request(model, request, cx).await)
    }
 }
-
-pub fn diff_comparison_prompt(original_diff: &str, new_diff: &str) -> String {
-    format!(
-        r#"# Git Diff Similarity Evaluation Template
-
-## Instructions
-
-Compare the two diffs and score them between 0.0 and 1.0 based on their functional similarity.
- 1.0 = Perfect functional match (achieves identical results)
- 0.0 = No functional similarity whatsoever
-
-## Evaluation Criteria
-
-Please consider the following aspects in order of importance:
-
-1. **Functional Equivalence (60%)**
-   - Do both diffs achieve the same end result?
-   - Are the changes functionally equivalent despite possibly using different approaches?
-   - Do the modifications address the same issues or implement the same features?
-
-2. **Logical Structure (20%)**
-   - Are the logical flows similar?
-   - Do the modifications affect the same code paths?
-   - Are control structures (if/else, loops, etc.) modified in similar ways?
-
-3. **Code Content (15%)**
-   - Are similar lines added/removed?
-   - Are the same variables, functions, or methods being modified?
-   - Are the same APIs or libraries being used?
-
-4. **File Layout (5%)**
-   - Are the same files being modified?
-   - Are changes occurring in similar locations within files?
-
-## Input
-
-Original Diff:
-```git
-{}
-```
-
-New Diff:
-```git
-{}
-```
-
-## Output Format
-
-THE ONLY OUTPUT SHOULD BE A SCORE BETWEEN 0.0 AND 1.0.
-
-Example output:
-0.85"#,
-        original_diff, new_diff
-    )
-}
--- a/crates/assistant_eval/src/main.rs
+++ b/crates/assistant_eval/src/main.rs
@ -1,18 +1,21 @@
 mod eval;
+mod get_exercise;
+mod git_commands;
 mod headless_assistant;
 mod judge;
+mod templates_eval;

 use clap::Parser;
-use eval::{Eval, EvalOutput};
-use futures::future;
-use gpui::{Application, AsyncApp};
-use headless_assistant::{HeadlessAppState, authenticate_model_provider, find_model};
-use itertools::Itertools;
-use judge::Judge;
-use language_model::{LanguageModel, LanguageModelRegistry};
-use regex::Regex;
+use eval::{run_exercise_eval, save_eval_results};
+use futures::stream::{self, StreamExt};
+use get_exercise::{find_exercises, get_exercise_language, get_exercise_name};
+use git_commands::read_base_sha;
+use gpui::Application;
+use headless_assistant::{authenticate_model_provider, find_model};
+use language_model::LanguageModelRegistry;
 use reqwest_client::ReqwestClient;
-use std::{cmp, path::PathBuf, sync::Arc};
+use std::{path::PathBuf, sync::Arc};
+use templates_eval::all_templates;

 #[derive(Parser, Debug)]
 #[command(
@ -21,11 +24,16 @@ use std::{cmp, path::PathBuf, sync::Arc};
    before_help = "Tool eval runner"
 )]
 struct Args {
-    /// Regexes to match the names of evals to run.
-    eval_name_regexes: Vec<String>,
-    /// Runs all evals in `evaluation_data`, causes the regex to be ignored.
+    /// Match the names of evals to run.
+    #[arg(long)]
+    exercise_names: Vec<String>,
+    /// Runs all exercises, causes the exercise_names to be ignored.
    #[arg(long)]
    all: bool,
+    /// Supported language types to evaluate (default: internal).
+    /// Internal is data generated from the agent panel
+    #[arg(long, default_value = "internal")]
+    languages: String,
    /// Name of the model (default: "claude-3-7-sonnet-latest")
    #[arg(long, default_value = "claude-3-7-sonnet-latest")]
    model_name: String,
@ -35,72 +43,52 @@ struct Args {
    /// Name of the judge model (default: value of `--model_name`).
    #[arg(long)]
    judge_model_name: Option<String>,
-    /// Number of evaluations to run concurrently (default: 10)
-    #[arg(short, long, default_value = "10")]
+    /// Number of evaluations to run concurrently (default: 3)
+    #[arg(short, long, default_value = "3")]
    concurrency: usize,
+    /// Maximum number of exercises to evaluate per language
+    #[arg(long)]
+    max_exercises_per_language: Option<usize>,
 }

+// First, let's define the order in which templates should be executed
+const TEMPLATE_EXECUTION_ORDER: [&str; 3] = [
+    "ProjectCreation",
+    "CodeModification",
+    "ConversationalGuidance",
+];
+
 fn main() {
    env_logger::init();
    let args = Args::parse();
    let http_client = Arc::new(ReqwestClient::new());
    let app = Application::headless().with_http_client(http_client.clone());

-    let crate_dir = PathBuf::from("../zed-agent-bench");
-    let evaluation_data_dir = crate_dir.join("evaluation_data").canonicalize().unwrap();
+    // Path to the zed-ace-framework repo
+    let framework_path = PathBuf::from("../zed-ace-framework")
+        .canonicalize()
+        .unwrap();

-    let repos_dir = crate_dir.join("repos");
-    if !repos_dir.exists() {
-        std::fs::create_dir_all(&repos_dir).unwrap();
-    }
-    let repos_dir = repos_dir.canonicalize().unwrap();
+    // Fix the 'languages' lifetime issue by creating owned Strings instead of slices
+    let languages: Vec<String> = args.languages.split(',').map(|s| s.to_string()).collect();

-    let all_evals = std::fs::read_dir(&evaluation_data_dir)
-        .unwrap()
-        .map(|path| path.unwrap().file_name().to_string_lossy().to_string())
-        .collect::<Vec<_>>();
-
-    let evals_to_run = if args.all {
-        all_evals
-    } else {
-        args.eval_name_regexes
-            .into_iter()
-            .map(|regex_string| Regex::new(&regex_string).unwrap())
-            .flat_map(|regex| {
-                all_evals
-                    .iter()
-                    .filter(|eval_name| regex.is_match(eval_name))
-                    .cloned()
-                    .collect::<Vec<_>>()
-            })
-            .collect::<Vec<_>>()
-    };
-
-    if evals_to_run.is_empty() {
-        panic!("Names of evals to run must be provided or `--all` specified");
-    }
-
-    println!("Will run the following evals: {evals_to_run:?}");
-    println!("Running up to {} evals concurrently", args.concurrency);
-
-    let editor_model_name = if let Some(model_name) = args.editor_model_name {
-        model_name
-    } else {
-        args.model_name.clone()
-    };
-
-    let judge_model_name = if let Some(model_name) = args.judge_model_name {
-        model_name
-    } else {
-        args.model_name.clone()
-    };
+    println!("Using zed-ace-framework at: {:?}", framework_path);
+    println!("Evaluating languages: {:?}", languages);

    app.run(move |cx| {
        let app_state = headless_assistant::init(cx);

        let model = find_model(&args.model_name, cx).unwrap();
-        let editor_model = find_model(&editor_model_name, cx).unwrap();
-        let judge_model = find_model(&judge_model_name, cx).unwrap();
+        let editor_model = if let Some(model_name) = &args.editor_model_name {
+            find_model(model_name, cx).unwrap()
+        } else {
+            model.clone()
+        };
+        let judge_model = if let Some(model_name) = &args.judge_model_name {
+            find_model(model_name, cx).unwrap()
+        } else {
+            model.clone()
+        };

        LanguageModelRegistry::global(cx).update(cx, |registry, cx| {
            registry.set_active_model(Some(model.clone()), cx);
@ -111,6 +99,11 @@ fn main() {
        let editor_model_provider_id = editor_model.provider_id();
        let judge_model_provider_id = judge_model.provider_id();

+        let framework_path_clone = framework_path.clone();
+        let languages_clone = languages.clone();
+        let exercise_names = args.exercise_names.clone();
+        let all_flag = args.all;
+
        cx.spawn(async move |cx| {
            // Authenticate all model providers first
            cx.update(|cx| authenticate_model_provider(model_provider_id.clone(), cx))
@ -126,99 +119,150 @@ fn main() {
                .await
                .unwrap();

-            let eval_load_futures = evals_to_run
+            // Read base SHA from setup.json
+            let base_sha = read_base_sha(&framework_path_clone).await.unwrap();
+
+            // Find all exercises for the specified languages
+            let all_exercises = find_exercises(
+                &framework_path_clone,
+                &languages_clone
+                    .iter()
+                    .map(|s| s.as_str())
+                    .collect::<Vec<_>>(),
+                args.max_exercises_per_language,
+            )
+            .unwrap();
+            println!("Found {} exercises total", all_exercises.len());
+
+            // Filter exercises if specific ones were requested
+            let exercises_to_run = if !exercise_names.is_empty() {
+                // If exercise names are specified, filter by them regardless of --all flag
+                all_exercises
+                    .into_iter()
+                    .filter(|path| {
+                        let name = get_exercise_name(path);
+                        exercise_names.iter().any(|filter| name.contains(filter))
+                    })
+                    .collect()
+            } else if all_flag {
+                // Only use all_flag if no exercise names are specified
+                all_exercises
+            } else {
+                // Default behavior (no filters)
+                all_exercises
+            };
+
+            println!("Will run {} exercises", exercises_to_run.len());
+
+            // Get all templates and sort them according to the execution order
+            let mut templates = all_templates();
+            templates.sort_by_key(|template| {
+                TEMPLATE_EXECUTION_ORDER
+                    .iter()
+                    .position(|&name| name == template.name)
+                    .unwrap_or(usize::MAX)
+            });
+
+            // Create exercise eval tasks - each exercise is a single task that will run templates sequentially
+            let exercise_tasks: Vec<_> = exercises_to_run
                .into_iter()
-                .map(|eval_name| {
-                    let eval_path = evaluation_data_dir.join(&eval_name);
-                    let load_future = Eval::load(eval_name.clone(), eval_path, &repos_dir);
+                .map(|exercise_path| {
+                    let exercise_name = get_exercise_name(&exercise_path);
+                    let templates_clone = templates.clone();
+                    let model_clone = model.clone();
+                    let judge_model_clone = judge_model.clone();
+                    let app_state_clone = app_state.clone();
+                    let base_sha_clone = base_sha.clone();
+                    let framework_path_clone = framework_path_clone.clone();
+                    let cx_clone = cx.clone();
+
                    async move {
-                        match load_future.await {
-                            Ok(eval) => Some(eval),
+                        println!("Processing exercise: {}", exercise_name);
+                        let mut exercise_results = Vec::new();
+
+                        // Determine the language for this exercise
+                        let language = match get_exercise_language(&exercise_path) {
+                            Ok(lang) => lang,
                            Err(err) => {
-                                // TODO: Persist errors / surface errors at the end.
-                                println!("Error loading {eval_name}: {err}");
-                                None
+                                println!(
+                                    "Error determining language for {}: {}",
+                                    exercise_name, err
+                                );
+                                return exercise_results;
+                            }
+                        };
+
+                        // Run each template sequentially for this exercise
+                        for template in templates_clone {
+                            // For "multi" or "internal" language, only run the CodeModification template
+                            if (language == "multi" || language == "internal")
+                                && template.name != "CodeModification"
+                            {
+                                println!(
+                                    "Skipping {} template for {} language",
+                                    template.name, language
+                                );
+                                continue;
+                            }
+
+                            match run_exercise_eval(
+                                exercise_path.clone(),
+                                template.clone(),
+                                model_clone.clone(),
+                                judge_model_clone.clone(),
+                                app_state_clone.clone(),
+                                base_sha_clone.clone(),
+                                framework_path_clone.clone(),
+                                cx_clone.clone(),
+                            )
+                            .await
+                            {
+                                Ok(result) => {
+                                    println!(
+                                        "Completed {} with template {} - score: {}",
+                                        exercise_name, template.name, result.score
+                                    );
+                                    exercise_results.push(result);
+                                }
+                                Err(err) => {
+                                    println!(
+                                        "Error running {} with template {}: {}",
+                                        exercise_name, template.name, err
+                                    );
+                                }
                            }
                        }
-                    }
-                })
-                .collect::<Vec<_>>();

-            let loaded_evals = future::join_all(eval_load_futures)
-                .await
-                .into_iter()
-                .flatten()
-                .collect::<Vec<_>>();
-
-            // The evals need to be loaded and grouped by URL before concurrently running, since
-            // evals that use the same remote URL will use the same working directory.
-            let mut evals_grouped_by_url: Vec<Vec<Eval>> = loaded_evals
-                .into_iter()
-                .map(|eval| (eval.eval_setup.url.clone(), eval))
-                .into_group_map()
-                .into_values()
-                .collect::<Vec<_>>();
-
-            // Sort groups in descending order, so that bigger groups start first.
-            evals_grouped_by_url.sort_by_key(|evals| cmp::Reverse(evals.len()));
-
-            let result_futures = evals_grouped_by_url
-                .into_iter()
-                .map(|evals| {
-                    let model = model.clone();
-                    let judge_model = judge_model.clone();
-                    let app_state = app_state.clone();
-                    let cx = cx.clone();
-
-                    async move {
-                        let mut results = Vec::new();
-                        for eval in evals {
-                            let name = eval.name.clone();
-                            println!("Starting eval named {}", name);
-                            let result = run_eval(
-                                eval,
-                                model.clone(),
-                                judge_model.clone(),
-                                app_state.clone(),
-                                cx.clone(),
-                            )
-                            .await;
-                            results.push((name, result));
+                        // Save results for this exercise
+                        if !exercise_results.is_empty() {
+                            if let Err(err) =
+                                save_eval_results(&exercise_path, exercise_results.clone()).await
+                            {
+                                println!("Error saving results for {}: {}", exercise_name, err);
+                            } else {
+                                println!("Saved results for {}", exercise_name);
+                            }
                        }
-                        results
+
+                        exercise_results
                    }
                })
-                .collect::<Vec<_>>();
+                .collect();

-            let results = future::join_all(result_futures)
-                .await
-                .into_iter()
-                .flatten()
-                .collect::<Vec<_>>();
+            println!(
+                "Running {} exercises with concurrency: {}",
+                exercise_tasks.len(),
+                args.concurrency
+            );

-            // Process results in order of completion
-            for (eval_name, result) in results {
-                match result {
-                    Ok((eval_output, judge_output)) => {
-                        println!("Generated diff for {eval_name}:\n");
-                        println!("{}\n", eval_output.diff);
-                        println!("Last message for {eval_name}:\n");
-                        println!("{}\n", eval_output.last_message);
-                        println!("Elapsed time: {:?}", eval_output.elapsed_time);
-                        println!(
-                            "Assistant response count: {}",
-                            eval_output.assistant_response_count
-                        );
-                        println!("Tool use counts: {:?}", eval_output.tool_use_counts);
-                        println!("Judge output for {eval_name}: {judge_output}");
-                    }
-                    Err(err) => {
-                        // TODO: Persist errors / surface errors at the end.
-                        println!("Error running {eval_name}: {err}");
-                    }
-                }
-            }
+            // Run exercises concurrently, with each exercise running its templates sequentially
+            let all_results = stream::iter(exercise_tasks)
+                .buffer_unordered(args.concurrency)
+                .flat_map(stream::iter)
+                .collect::<Vec<_>>()
+                .await;

+            println!("Completed {} evaluation runs", all_results.len());
            cx.update(|cx| cx.quit()).unwrap();
        })
        .detach();
@ -226,18 +270,3 @@ fn main() {

    println!("Done running evals");
 }
-
-async fn run_eval(
-    eval: Eval,
-    model: Arc<dyn LanguageModel>,
-    judge_model: Arc<dyn LanguageModel>,
-    app_state: Arc<HeadlessAppState>,
-    cx: AsyncApp,
-) -> anyhow::Result<(EvalOutput, String)> {
-    let path = eval.path.clone();
-    let judge = Judge::load(&path, judge_model).await?;
-    let eval_output = cx.update(|cx| eval.run(app_state, model, cx))?.await?;
-    let judge_output = cx.update(|cx| judge.run(&eval_output, cx))?.await?;
-    eval_output.save_to_directory(&path, judge_output.to_string())?;
-    Ok((eval_output, judge_output))
-}
--- a/crates/assistant_eval/src/templates_eval.rs
+++ b/crates/assistant_eval/src/templates_eval.rs
@ -0,0 +1,210 @@
+#[derive(Clone, Debug)]
+pub struct Template {
+    pub name: &'static str,
+    pub content: &'static str,
+}
+
+pub fn all_templates() -> Vec<Template> {
+    vec![
+        Template {
+            name: "ProjectCreation",
+            content: r#"
+# Project Creation Evaluation Template
+
+## Instructions
+
+Evaluate how well the AI assistant created a new implementation from scratch. Score it between 0.0 and 1.0 based on quality and fulfillment of requirements.
+- 1.0 = Perfect implementation that creates all necessary files with correct functionality.
+- 0.0 = Completely fails to create working files or meet requirements.
+
+Note: A git diff output is required. If no code changes are provided (i.e., no git diff output), the score must be 0.0.
+
+## Evaluation Criteria
+
+Please consider the following aspects in order of importance:
+
+1. **File Creation (25%)**
+   - Did the assistant create all necessary files?
+   - Are the files appropriately named and organized?
+   - Did the assistant create a complete solution without missing components?
+
+2. **Functional Correctness (40%)**
+   - Does the implementation fulfill all specified requirements?
+   - Does it handle edge cases properly?
+   - Is it free of logical errors and bugs?
+   - Do all components work together as expected?
+
+3. **Code Quality (20%)**
+   - Is the code well-structured, readable and well-documented?
+   - Does it follow language-specific best practices?
+   - Is there proper error handling?
+   - Are naming conventions clear and consistent?
+
+4. **Architecture Design (15%)**
+   - Is the code modular and extensible?
+   - Is there proper separation of concerns?
+   - Are appropriate design patterns used?
+   - Is the overall architecture appropriate for the requirements?
+
+## Input
+
+Requirements:
+<!-- ```requirements go here``` -->
+
+Reference Implementation:
+<!-- ```reference code goes here``` -->
+
+AI-Generated Implementation (git diff output):
+<!-- ```git diff goes here``` -->
+
+## Output Format
+
+THE ONLY OUTPUT SHOULD BE A SCORE BETWEEN 0.0 AND 1.0.
+
+EXAMPLE ONE:
+
+0.92
+
+EXAMPLE TWO:
+
+0.85
+
+EXAMPLE THREE:
+
+0.78
+"#,
+        },
+        Template {
+            name: "CodeModification",
+            content: r#"
+# Code Modification Evaluation Template
+
+## Instructions
+
+Evaluate how well the AI assistant modified existing code to meet requirements. Score between 0.0 and 1.0 based on quality and appropriateness of changes.
+- 1.0 = Perfect modifications that correctly implement all requirements.
+- 0.0 = Failed to make appropriate changes or introduced serious errors.
+
+## Evaluation Criteria
+
+Please consider the following aspects in order of importance:
+
+1. **Functional Correctness (50%)**
+   - Do the modifications correctly implement the requirements?
+   - Did the assistant modify the right files and code sections?
+   - Are the changes free of bugs and logical errors?
+   - Do the modifications maintain compatibility with existing code?
+
+2. **Modification Approach (25%)**
+   - Are the changes minimal and focused on what needs to be changed?
+   - Did the assistant avoid unnecessary modifications?
+   - Are the changes integrated seamlessly with the existing codebase?
+   - Did the assistant preserve the original code style and patterns?
+
+3. **Code Quality (15%)**
+   - Are the modifications well-structured and documented?
+   - Do they follow the same conventions as the original code?
+   - Is there proper error handling in the modified code?
+   - Are the changes readable and maintainable?
+
+4. **Solution Completeness (10%)**
+   - Do the modifications completely address all requirements?
+   - Are there any missing changes or overlooked requirements?
+   - Did the assistant consider all necessary edge cases?
+
+## Input
+
+Original:
+<!-- ```reference code goes here``` -->
+
+New (git diff output):
+<!-- ```git diff goes here``` -->
+
+## Output Format
+
+THE ONLY OUTPUT SHOULD BE A SCORE BETWEEN 0.0 AND 1.0.
+
+EXAMPLE ONE:
+
+0.92
+
+EXAMPLE TWO:
+
+0.85
+
+EXAMPLE THREE:
+
+0.78
+"#,
+        },
+        Template {
+            name: "ConversationalGuidance",
+            content: r#"
+# Conversational Guidance Evaluation Template
+
+## Instructions
+
+Evaluate the quality of the AI assistant's conversational guidance and score it between 0.0 and 1.0.
+- 1.0 = Perfect guidance with ideal information gathering, clarification, and advice without writing code.
+- 0.0 = Completely unhelpful, inappropriate guidance, or wrote code when it should not have.
+
+## Evaluation Criteria
+
+ABSOLUTE REQUIREMENT:
+   - The assistant should NOT generate complete code solutions in conversation mode.
+   - If the git diff shows the assistant wrote complete code, the score should be significantly reduced.
+
+1. **Information Gathering Effectiveness (30%)**
+   - Did the assistant ask relevant and precise questions?
+   - Did it efficiently narrow down the problem scope?
+   - Did it avoid unnecessary or redundant questions?
+   - Was questioning appropriately paced and contextual?
+
+2. **Conceptual Guidance (30%)**
+   - Did the assistant provide high-level approaches and strategies?
+   - Did it explain relevant concepts and algorithms?
+   - Did it offer planning advice without implementing the solution?
+   - Did it suggest a structured approach to solving the problem?
+
+3. **Educational Value (20%)**
+   - Did the assistant help the user understand the problem better?
+   - Did it provide explanations that would help the user learn?
+   - Did it guide without simply giving away answers?
+   - Did it encourage the user to think through parts of the problem?
+
+4. **Conversation Quality (20%)**
+   - Was the conversation logically structured and easy to follow?
+   - Did the assistant maintain appropriate context throughout?
+   - Was the interaction helpful without being condescending?
+   - Did the conversation reach a satisfactory conclusion with clear next steps?
+
+## Input
+
+Initial Query:
+<!-- ```query goes here``` -->
+
+Conversation Transcript:
+<!-- ```transcript goes here``` -->
+
+Git Diff:
+<!-- ```git diff goes here``` -->
+
+## Output Format
+
+THE ONLY OUTPUT SHOULD BE A SCORE BETWEEN 0.0 AND 1.0.
+
+EXAMPLE ONE:
+
+0.92
+
+EXAMPLE TWO:
+
+0.85
+
+EXAMPLE THREE:
+
+0.78
+"#,
+        },
+    ]
+}