Yehowshua/ZIm - Forgejo: Beyond coding. We Forge.

Author	SHA1	Message	Date
Richard Feldman	04c68dc0cf	Make the default repetitions be 8, and concurrency 4 (#29576 ) This is based on having observed that there is a lot of variation between runs on `n=1` and `n=3`. * With `n=8` two runs on the same branch give answers that seem close enough to be reasonably consistent. * With higher concurrency, trying to run this many repetitions seems to lead language servers to time out a lot, causing evals to fail. Release Notes: - N/A	2025-04-30 15:21:19 -04:00
Richard Feldman	c8685dc90f	Fix eval judging missing final response (#29638 ) Fixed issue where eval thread judges were not considering the last response in the thread. The problem was that they were getting the full list of messages from `last_request`, which (being a request!) did not have the response yet. Release Notes: - N/A	2025-04-29 23:02:46 -04:00
Richard Feldman	d566864891	Make code block eval resilient to indentation (#29633 ) This reduces spurious failures in the eval. Release Notes: - N/A	2025-04-30 02:13:13 +00:00
Richard Feldman	d7004030b3	Code block evals (#29619 ) Add a targeted eval for code block formatting, and revise the system prompt accordingly. ### Eval before, n=8 <img width="728" alt="eval before" src="https://github.com/user-attachments/assets/552b6146-3d26-4eaa-86f9-9fc36c0cadf2" /> ### Eval after prompt change, n=8 (excluding the new evals, so just testing the prompt change) <img width="717" alt="eval after" src="https://github.com/user-attachments/assets/c78c7a54-4c65-470c-b135-8691584cd73e" /> Release Notes: - N/A	2025-04-29 18:52:09 -04:00
Agus Zubiaga	fd17f2d8ae	agent: Enrich `grep` tool output with syntax information (#29601 ) The `grep` tool used to include 4 lines of context around the match, but the lines included would often be unhelpful. This PR improves this behavior by using the range of the parent syntax node that contains the full line(s) matched. The match headers will also now include symbol breadcrumbs so that the model can already gather code structure before/without reading files. ````md ### impl GitRepository for RealGitRepository › fn compare_checkpoints › L1278-1284 ```rust let result = git .run(&[ "diff-tree", "--quiet", &left.commit_sha.to_string(), &right.commit_sha.to_string(), ]) ``` ```` This positively impacts the `add_arg_to_trait_method` eval example with better diff output, fewer tool failures, and reduced total turns. Note: We have some plans to use a an "elision" approach where we would combine all matches for a given file, skipping lines between them while keeping symbol declaration lines. The theory is that this would be map more closely to the expected input for edits. For now, this PR is a significant improvement. Release Notes: - Agent: Enrich `grep` tool output with syntax information	2025-04-29 17:03:02 +00:00
Danilo Leal	bbe8d6a654	agent: Cancel pending in-edit user message upon new message submit (#29565 ) Previously, if you clicked on a user message to edit it, and then, while the user message has the editor pending, sent a new message via the textarea, the whole thread would be grayed out because we hadn't dismissed the to-be-edited pending user message. That's now fixed. Release Notes: - agent: Fixed a bug that would make the whole thread be grayed out upon sending a new message while a user message had a pending edit.	2025-04-28 18:51:41 -03:00
Marshall Bowers	ce93961fe0	agent: Add "max mode" toggle (#29549 ) This PR adds a "max mode" toggle to the Agent panel, for models that support it. Only visible to folks in the `new-billing` feature flag. Icon is just a placeholder. Release Notes: - N/A	2025-04-28 16:50:47 +00:00
Finn Evers	3a1bd38503	reqwest_client: Only register proxies with valid proxy URIs (#27773 ) Closes #27641 This PR fixes invalid proxy URIs being registered despite the URI not being a valid proxy URI. Whilst investigating #27641 , I noticed that currently any proxy URI passed to `RequestClient::proxy_and_user_agent` will be assigned to the created client, even if the URI is not a valid proxy URI. Given a test as an example: We create an URI here and pass it as a proxy to `ReqwestClient::proxy_and_user_agent`: https://github.com/zed-industries/zed/blob/main/crates/reqwest_client/src/reqwest_client.rs#L272-L273 In `ReqwestClient::proxy_and_user_agent`we take the proxy parameter here `9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L46)` and set it unconditionally here: `9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L62)` , not considering at all whether the proxy was successfully created above. Concluding, we currently do not actually check whether a proxy was successfully created, but rather whether an URI is equal to itself, which trivially holds. The existing test for a malformed proxy URI `9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L293-L297)` does not check whether invalid proxies cause an error, but rather checks whether `http::Uri::from_static` panics on an invalid URI, [which it does as documented](https://docs.rs/http/latest/http/uri/struct.Uri.html#panics). Thus, the tests currently do not really check anything proxy-related and invalid proxies are assigned as valid proxies. --- This PR fixes the behaviour by considering whether the proxy was actually properly parsed and only assigning it if that is the case. Furthermore, it improves logging in case of errors so issues like the linked one are easier to debug (for the linked issue, the log will now include that the proxy schema is not supported in the logs). Lastly, it also updates the test for a malformed proxy URI. The test now actually checks that malformed proxy URIs are not registered for the client rather than testing the `http` crate. The update also initially caused the [test for a `socks4a` proxy](`9b40770e9f/crates/reqwest_client/src/reqwest_client.rs (L280C1-L282C50)`) to fail. This happened because the reqwest-library introduced supports for `socks4a` proxies in [version 0.12.13](https://github.com/seanmonstar/reqwest/blob/master/CHANGELOG.md#v01213). Thus, this PR includes a bump of the reqwest library to add proper support for socks4a proxies. Release Notes: - Added support for socks4a proxies. --------- Co-authored-by: Peter Tripp <peter@zed.dev>	2025-04-28 11:12:16 -04:00
tidely	f060918b57	zed: Remove unnecessary clones (#29513 ) `App::http_client` and `Client::http_client` both return an owned `Arc` which it clones internally. This means we can remove unnecessary clones when calling these methods. Release Notes: - N/A	2025-04-27 19:23:37 -07:00
Michael Sloan	609c528ceb	Refactor markdown formatting utilities to avoid building intermediate strings (#29511 ) These were nearly always used when using `format!` / `write!` etc, so it makes sense to not have an intermediate `String`. Release Notes: - N/A	2025-04-27 19:04:51 +00:00
Marshall Bowers	b28756ae3f	eval: Use workspace dependencies (#29430 ) This PR updates the `eval` crate to use workspace dependencies. Also did a bit of cleanup of the `Cargo.toml`. Release Notes: - N/A	2025-04-25 16:11:26 +00:00
Marshall Bowers	a5405fcbd7	eval: Add support for reading from a `.env` file (#29426 ) This PR adds support for the eval to read environment variables from a `.env` file located in the `crates/eval` directory. For instance, you can use it to set your Anthropic API key: ``` ANTHROPIC_API_KEY=<secret> ``` Release Notes: - N/A	2025-04-25 15:53:02 +00:00
Oleksiy Syvokon	3389327df5	eval: Add HTML overview for evaluation runs (#29413 ) This update generates a single self-contained .html file that shows an overview of evaluation threads in the browser. It's useful for: - Quickly reviewing results - Sharing evaluation runs - Debugging - Comparing models (TBD) Features: - Export thread JSON from the UI - Keyboard navigation (j/k or Ctrl + ←/→) - Toggle between compact and full views Generating the overview: - `cargo run -p eval` will write this file in the run dir's root. - Or you can call `cargo run -p eval --bin explorer` to generate it without running evals. Screenshot: ![image](https://github.com/user-attachments/assets/4ead71f6-da08-48ea-8fcb-2148d2e4b4db) Release Notes: - N/A	2025-04-25 17:49:05 +03:00
Michael Sloan	17ecf94f6f	Restructure agent context (#29233 ) Simplifies the data structures involved in agent context by removing caching and limiting the use of ContextId: * `AssistantContext` enum is now like an ID / handle to context that does not need to be updated. `ContextId` still exists but is only used for generating unique `ElementId`. * `ContextStore` has a `IndexMap<ContextSetEntry>`. Only need to keep a `HashSet<ThreadId>` consistent with it. `ContextSetEntry` is a newtype wrapper around `AssistantContext` which implements eq / hash on a subset of fields. * Thread `Message` directly stores its context. Fixes the following bugs: * If a context entry is removed from the strip and added again, it was reincluded in the next message. * Clicking file context in the thread that has been removed from the context strip didn't jump to the file. * Refresh of directory context didn't reflect added / removed files. * Deleted directories would remain in the message editor context strip. * Token counting requests didn't include image context. * File, directory, and symbol context deduplication relied on `ProjectPath` for identity, and so didn't handle renames. * Symbol context line numbers didn't update when shifted Known bugs (not fixed): * Deleting a directory causes it to disappear from messages in threads. Fixing this in a nice way is tricky. One easy fix is to store the original path and show that on deletion. It's weird that deletion would cause the name to "revert", though. Another possibility would be to snapshot context metadata on add (ala `AddedContext`), and keep that around despite deletion. Release Notes: - N/A	2025-04-24 21:29:33 +00:00
Richard Feldman	720dfee803	Treat invalid JSON in tool calls as failed tool calls (#29375 ) Release Notes: - N/A --------- Co-authored-by: Max <max@zed.dev> Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com>	2025-04-24 16:54:27 -04:00
Max Brunsfeld	f125353b6f	Add tree-sitter example to the eval (#29321 ) Interesting things about this example: * It's a useful, non-trivial change I made with the agent in Tree-sitter * It runs fast * It frequently showcases edit file errors * It occasionally completely errors out due to errors parsing tool call input JSON Release Notes: - N/A	2025-04-23 18:46:38 -07:00
Agus Zubiaga	8b5835de17	agent: Improve initial file search quality (#29317 ) This PR significantly improves the quality of the initial file search that occurs when the model doesn't yet know the full path to a file it needs to read/edit. Previously, the assertions in file_search often failed on main as the model attempted to guess full file paths. On this branch, it reliably calls `find_path` (previously `path_search`) before reading files. After getting the model to find paths first, I noticed it would try using `grep` instead of `path_search`. This motivated renaming `path_search` to `find_path` (continuing the analogy to unix commands) and adding system prompt instructions about proper tool selection. Note: I know the command is just called `find`, but that seemed too general. In my eval runs, the `file_search` example improved from 40% ± 10% to 98% ± 2%. The only assertion I'm seeing occasionally fail is "glob starts with `**` or project". We can probably add some instructions in that regard. Release Notes: - N/A	2025-04-23 21:24:41 -03:00
Agus Zubiaga	45d3f5168a	eval: New `add_arg_to_trait_method` example (#29297 ) Release Notes: - N/A --------- Co-authored-by: Richard Feldman <oss@rtfeldman.com>	2025-04-23 18:46:39 +00:00
Danilo Leal	8366cd0b52	agent: Render diffs for the edit file tool (#29234 ) This PR implements the `ToolCard` for the edit file tool, which allow us to display an editor with a diff in the thread view with the changes performed by the model. - [x] Fix buffer sometimes displaying empty - [x] Stop buffer from scrolling together with the thread - [x] Fix multibuffer header sometimes appearing - [x] Fix buffer height issue - [x] Implement "full height" expand button - [x] Add "Jump To File" functionality - [x] Polish and refine styles Release Notes: - agent: Added diff preview cards in the thread view for edits performed by the agent. --------- Co-authored-by: João Marcos <marcospb19@hotmail.com> Co-authored-by: Richard Feldman <oss@rtfeldman.com> Co-authored-by: Agus Zubiaga <hi@aguz.me> Co-authored-by: Conrad Irwin <conrad.irwin@gmail.com>	2025-04-23 15:43:33 -03:00
Oleksiy Syvokon	f69aeb6311	Do not log unfinished tools use that are in the middle of streaming (#29275 ) Release Notes: - N/A	2025-04-23 13:19:01 +00:00
Oleksiy Syvokon	76a78b550b	eval: Write JSON-serialized thread (#29271 ) This adds `last.message.json` file that contains the full request plus response (serialized as a message from assistant for consistency with other messages). Motivation: to capture more info and to make analysis of finished runs easier. Release Notes: - N/A	2025-04-23 15:22:19 +03:00
Agus Zubiaga	ce1a674eba	eval: Fine-grained assertions (#29246 ) - Support programmatic examples ([example](`17feb260a0/crates/eval/src/examples/file_search.rs`)) - Combine data-driven example declarations into a single `.toml` file ([example](`17feb260a0/crates/eval/src/examples/find_and_replace_diff_card.toml`)) - Run judge on individual assertions (previously called "criteria") - Report judge and programmatic assertions in one combined table Note: We still need to work on concept naming <img width=400 src="https://github.com/user-attachments/assets/fc719c93-467f-412b-8d47-68821bd8a5f5"> Release Notes: - N/A --------- Co-authored-by: Richard Feldman <oss@rtfeldman.com> Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com> Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>	2025-04-22 23:58:58 -03:00
Max Brunsfeld	36d02de784	Rework eval to support interpretable scores and efficient repetitions (#29197 ) ### Problem We want to start continuously tracking our progress on agent evals over time. As part of this, we'd like the score to have a clear, interpretable meaning. Right now, it's a number from 0 to 5, but it's not clear what any particular number works. In addition, scores vary widely from run to run, because the agent's output is deterministic. We try to stabilize the score using a panel of judges, but the behavior of the agent itself varies much more widely than the judges' scores for a given run. ### Solution * explicit meanings of scores - In this PR, we're prescribing the diff and thread criteria files so that they must be unordered lists of assertions. For both the thread and the diff, rather than providing an abstract score, the judge's task is simply to count how many of these assertions are satisfied. A percentage score can be derived from this number, divided by the total number of assertions. * repetitions - Rather than running each example once, and judging it N times, we'll run the example N times. Right now, I'm just judging the output once per run, because I believe that with these more clear scoring criteria, the main source of non-determinism will be the agent's behavior, not the judge's ### Questions * accounting for diagnostic errors - Previously, the judge was asked to incorporate diagnostics into their abstract scores. Now that the "score" is determined directly from the criteria, the diagnostic will not be captured in the score. How should the diagnostics be accounted for in the eval? One thought is - let's simply count and report the number of errors remaining after the agent finishes, as a separate field of the run (along with diff score and thread score). We could consider normalizing it using the total lines of added code (like errors per 100 lines of code added) in order to give it some semblance of stability between examples. * repetitions - How many repetitions should we run on CI? Each repetition takes significant time, but I think running more than one repetition will make the scores significantly less volatile. ### Todo * [x] Fix `--concurrency` implementation so that only N tasks are spawned * [x] Support `--repetitions` efficiently (re-using the same worktree) * [x] Restructure judge prompts to count passing criteria, not compute abstract score * [x] Report total number of diagnostics in some way * [x] Format output nicely Release Notes: - N/A or Added/Fixed/Improved ... --------- Co-authored-by: Antonio Scandurra <me@as-cii.com>	2025-04-22 14:00:09 +00:00
Nathan Sobo	458ffaa134	Add new action to run agent eval (#29158 ) The old one wasn't linking, and https://github.com/zed-industries/zed/pull/29081 has a bunch of merge conflicts. Wanted to start simple/small. ## Todo * [x] Remove low-signal examples * [x] Make the eval run on a cron, on main, and on any PR with the `run-eval` label * [x] Noise in logs about failure to write settings ``` [2025-04-21T20:45:04Z ERROR settings] Failed to write settings to file "/home/runner/.config/zed/settings.json" Caused by: No such file or directory (os error 2) at path "/home/runner/.config/zed/.tmpLewFEs" ``` * [x] `Agentic loop stalled` (https://github.com/zed-industries/zed/actions/runs/14581044243/job/40897622894) * [x] Make sure that events are recorded in snowflake * [ ] Change judge criteria to be more explicit about meanings of scores Release Notes: - N/A --------- Co-authored-by: Antonio Scandurra <me@as-cii.com> Co-authored-by: Agus Zubiaga <hi@aguz.me> Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com> Co-authored-by: Thomas Mickley-Doyle <tmickleydoyle@gmail.com>	2025-04-21 21:30:21 -07:00
Michael Sloan	70c51b513b	agent eval: Default to also running typescript examples (#29185 ) Release Notes: - N/A	2025-04-21 23:59:35 +00:00
Michael Sloan	9249919b7a	Write `{result_count}.diff` and `last.diff` eval run outputs (#29181 ) These are only written when the diff has changed. `patch.diff` has been removed as its redundant with `last.diff`. It can be convenient to open `last.diff` and use undo/redo to navigate its history. Release Notes: - N/A	2025-04-21 23:19:07 +00:00
Richard Feldman	4f2f9ff762	Streaming tool calls (#29179 ) https://github.com/user-attachments/assets/7854a737-ef83-414c-b397-45122e4f32e8 Release Notes: - Create file and edit file tools now stream their tool descriptions, so you can see what they're doing sooner. --------- Co-authored-by: Marshall Bowers <git@maxdeviant.com>	2025-04-21 22:28:32 +00:00
Thomas Mickley-Doyle	733cd6b68c	agent: Remove non-rust examples from evals (#29139 ) Release Notes: - N/A	2025-04-21 12:55:24 -07:00
Michael Sloan	0f3ac38332	Agent eval: Copy `.rules` file into eval worktree for examples based on Zed (#29116 ) Also reverts #29108, which cherry-picked the rules file for an eval example. Release Notes: - N/A	2025-04-21 12:02:44 -06:00
Conrad Irwin	9d35f0389d	debugger: More tidy up for SSH (#28993 ) Split `locator` out of DebugTaskDefinition to make it clearer when location needs to happen. Release Notes: - N/A --------- Co-authored-by: Anthony Eid <hello@anthonyeid.me> Co-authored-by: Anthony <anthony@zed.dev> Co-authored-by: Cole Miller <m@cole-miller.net>	2025-04-21 16:00:03 +00:00
Antonio Scandurra	97ab0980d1	Start tracking tool failure rates in eval (#29122 ) This pull request will print all the used tools and their failure rates. The objective goal should be to minimize that failure rate. @tmickleydoyle: this also changes the telemetry event to report `tool_metrics` as opposed to `tool_use_counts`. Ideally I'd love to be able to plot failure rates by tool and hopefully see that percentage go down. Can we do that with the data we're tracking with this pull request? Release Notes: - N/A	2025-04-21 16:16:43 +02:00
Agus Zubiaga	ceeae790b7	eval: Improve lang server idle detection (#29135 ) Brings back #29013 after it was accidentally reverted by https://github.com/zed-industries/zed/pull/28961/commits/e9bb15b9063615762c866c30aaf646acb12af1f3. Release Notes: - N/A	2025-04-21 00:17:28 +00:00
Nathan Sobo	107d8ca483	Rename regex search tool to grep and accept an include glob pattern (#29100 ) This PR renames the `regex_search` tool to `grep` because I think it conveys more meaning to the model, the idea of searching the filesystem with a regular expression. It's also one word and the model seems to be using it effectively after some additional prompt tuning. It also takes an include pattern to filter on the specific files we try to search. I'd like to encourage the model to scope its searches more aggressively, as in my testing, I'm only seeing it filter on file extension. Release Notes: - N/A	2025-04-20 00:53:30 +00:00
Michael Sloan	a91948aeb4	agent: Reorder some linux keybindings to match mac keybindings (#29107 ) Release Notes: - Made keybindings for agent panel closer to the precedence order used on Mac. This fixes use of `enter` to add context from the menu triggered by `@` referencing.	2025-04-20 00:01:43 +00:00
Michael Sloan	fbf7caf93e	Default to fast model for thread summaries and titles + don't include system prompt / context / thinking segments (#29102 ) * Adds a fast / cheaper model to providers and defaults thread summarization to this model. Initial motivation for this was that https://github.com/zed-industries/zed/pull/29099 would cause these requests to fail when used with a thinking model. It doesn't seem correct to use a thinking model for summarization. * Skips system prompt, context, and thinking segments. * If tool use is happening, allows 2 tool uses + one more agent response before summarizing. Downside of this is that there was potential for some prefix cache reuse before, especially for title summarization (thread summarization omitted tool results and so would not share a prefix for those). This seems fine as these requests should typically be fairly small. Even for full thread summarization, skipping all tool use / context should greatly reduce the token use. Release Notes: - N/A	2025-04-19 23:26:29 +00:00
Bennet Bo Fenner	bafc086d27	agent: Preserve thinking blocks between requests (#29055 ) Looks like the required backend component of this was deployed. https://github.com/zed-industries/monorepo/actions/runs/14541199197 Release Notes: - N/A --------- Co-authored-by: Antonio Scandurra <me@as-cii.com> Co-authored-by: Agus Zubiaga <hi@aguz.me> Co-authored-by: Richard Feldman <oss@rtfeldman.com> Co-authored-by: Nathan Sobo <nathan@zed.dev>	2025-04-19 20:12:03 +00:00
Michael Sloan	d88b06a5dc	Simplify language model registry + only emit change events on change (#29086 ) * Now only does default fallback logic in the registry * Only emits change events when there is actually a change Release Notes: - N/A	2025-04-19 08:26:42 +00:00
Michael Sloan	98ceffe026	Pretty tool inputs in eval output markdown + numbered assistant messages (#29082 ) Release Notes: - N/A	2025-04-19 06:59:22 +00:00
Nathan Sobo	bab28560ef	Systematically optimize agentic editing performance (#28961 ) Now that we've established a proper eval in tree, this PR is reboots of our agent loop back to a set of minimal tools and simpler prompts. We should aim to get this branch feeling subjectively competitive with what's on main and then merge it, and build from there. Let's invest in our eval and use it to drive better performance of the agent loop. How you can help: Pick an example, and then make the outcome faster or better. It's fine to even use your own subjective judgment, as our evaluation criteria likely need tuning as well at this point. Focus on making the agent work better in your own subjective experience first. Let's focus on simple/practical improvements to make this thing work better, then determine how we can craft our judgment criteria to lock those improvements in. Release Notes: - N/A --------- Co-authored-by: Max <max@zed.dev> Co-authored-by: Antonio <antonio@zed.dev> Co-authored-by: Agus <agus@zed.dev> Co-authored-by: Richard <richard@zed.dev> Co-authored-by: Max Brunsfeld <maxbrunsfeld@gmail.com> Co-authored-by: Antonio Scandurra <me@as-cii.com> Co-authored-by: Michael Sloan <mgsloan@gmail.com>	2025-04-19 02:47:59 +00:00
Marshall Bowers	7abe2c9c31	agent: Attach thread ID and prompt ID to telemetry events (#29069 ) This PR attaches the thread ID and the new prompt ID to telemetry events for completions in the Agent panel. Release Notes: - N/A --------- Co-authored-by: Mikayla Maki <mikayla.c.maki@gmail.com>	2025-04-18 20:41:02 +00:00
Michael Sloan	327fee4d22	Init prompt store in agent eval (#29068 ) Needed after #28915 Release Notes: - N/A	2025-04-18 20:06:34 +00:00
Michael Sloan	502a0f6535	agent: Use default prompts from prompt library in system prompt (#28915 ) Related to #28490. - Default prompts from the prompt library are now included as "user rules" in the system prompt. - Presence of these user rules is shown at the beginning of the thread in the UI. _ Now uses an `Entity<PromptStore>` instead of an `Arc<PromptStore>`. Motivation for this is emitting a `PromptsUpdatedEvent`. - Now disallows concurrent reloading of the system prompt. Before this change it was possible for reloads to race. Release Notes: - agent: Added support for including default prompts from the Prompt Library as "user rules" in the system prompt. --------- Co-authored-by: Danilo Leal <daniloleal09@gmail.com>	2025-04-18 09:32:35 -06:00
Marshall Bowers	c2cd4fd7a1	agent: Show request usage in the panel (#29006 ) This PR adds a banner showing request usage in the Agent panel: <img width="640" alt="Screenshot 2025-04-17 at 5 51 46 PM" src="https://github.com/user-attachments/assets/e0eb036c-57c1-441c-bbab-7dab1c6e56d9" /> Only visible to users on the new billing. Note to Joseph: Doesn't need to be cherry-picked to Preview. Release Notes: - N/A --------- Co-authored-by: Nate <nate@zed.dev>	2025-04-17 22:16:57 +00:00
Oleksiy Syvokon	6dd622d6c3	eval: Fix git revision existence check (#28959 ) This change fixes a bug in the worktree initialization. Details: `git ref-parse --verify $HASH` just checks that $HASH is a well-formed hash and will successfully return even if $HASH doesn't exist. Release Notes: - N/A	2025-04-17 19:57:37 +03:00
Thomas Mickley-Doyle	8de53bd89f	agent: Add git commit ID to the eval telemetry data (#28895 ) Release Notes: - N/A	2025-04-16 14:13:43 -05:00
Michael Sloan	320abe9b22	Agent Eval: Check if SHA already fetched (#28846 ) Release Notes: - N/A	2025-04-16 06:54:22 +00:00
Michael Sloan	9a9f2e71ca	Agent Eval: Initial support for running examples repeatedly (#28844 ) Not ideal as it creates a separate worktree for each repetition Release Notes: - N/A	2025-04-16 06:35:55 +00:00
Michael Sloan	609895d95f	Agent Eval: bounded concurrency (#28843 ) Release Notes: - N/A	2025-04-16 00:05:46 -06:00
Michael Sloan	da2d8bd845	Agent Eval: Distinguish tool successes and failures in log (#28839 ) Release Notes: - N/A	2025-04-15 22:51:33 -06:00
Thomas Mickley-Doyle	222d4a2546	agent: Add telemetry for eval runs (#28816 ) Release Notes: - N/A --------- Co-authored-by: Joseph <joseph@zed.dev>	2025-04-16 02:54:26 +00:00

1 2 3

113 commits