Add more eval examples + filtering examples by language + fix git concurrent usage (#28719)

Release Notes: - N/A --------- Co-authored-by: michael <michael@zed.dev> Co-authored-by: agus <agus@zed.dev>
2025-04-14 17:05:46 -05:00 · 2025-04-14 17:05:46 -05:00 · d74f0735c2
commit d74f0735c2
parent a8b1ef3531
76 changed files with 365 additions and 8 deletions
--- a/Cargo.lock
+++ b/Cargo.lock
@ -326,7 +326,6 @@ dependencies = [
 "serde_json",
 "strum",
 "thiserror 2.0.12",
- "util",
 "workspace-hack",
 ]

--- a/crates/anthropic/Cargo.toml
+++ b/crates/anthropic/Cargo.toml
@ -25,5 +25,4 @@ serde.workspace = true
 serde_json.workspace = true
 strum.workspace = true
 thiserror.workspace = true
-util.workspace = true
 workspace-hack.workspace = true
--- a/crates/anthropic/src/anthropic.rs
+++ b/crates/anthropic/src/anthropic.rs
@ -10,7 +10,6 @@ use http_client::{AsyncBody, HttpClient, Method, Request as HttpRequest};
 use serde::{Deserialize, Serialize};
 use strum::{EnumIter, EnumString};
 use thiserror::Error;
-use util::ResultExt as _;

 pub use supported_countries::*;

@ -363,11 +362,25 @@ pub struct RateLimitInfo {

 impl RateLimitInfo {
    fn from_headers(headers: &HeaderMap<HeaderValue>) -> Self {
+        // Check if any rate limit headers exist
+        let has_rate_limit_headers = headers
+            .keys()
+            .any(|k| k.as_str().starts_with("anthropic-ratelimit-"));
+
+        if !has_rate_limit_headers {
+            return Self {
+                requests: None,
+                tokens: None,
+                input_tokens: None,
+                output_tokens: None,
+            };
+        }
+
        Self {
-            requests: RateLimit::from_headers("requests", headers).log_err(),
-            tokens: RateLimit::from_headers("tokens", headers).log_err(),
-            input_tokens: RateLimit::from_headers("input-tokens", headers).log_err(),
-            output_tokens: RateLimit::from_headers("output-tokens", headers).log_err(),
+            requests: RateLimit::from_headers("requests", headers).ok(),
+            tokens: RateLimit::from_headers("tokens", headers).ok(),
+            input_tokens: RateLimit::from_headers("input-tokens", headers).ok(),
+            output_tokens: RateLimit::from_headers("output-tokens", headers).ok(),
        }
    }
 }
--- a/crates/eval/examples/auth_session_management/base.toml
+++ b/crates/eval/examples/auth_session_management/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/workos/authkit-js.git"
+revision = "949345d85782a93e8f1738ec31823948ffc26301"
+language_extension = "ts"
--- a/crates/eval/examples/auth_session_management/criteria.md
+++ b/crates/eval/examples/auth_session_management/criteria.md
@ -0,0 +1,10 @@
+1. Add a new test case in `create-client.test.ts` for when the `returnTo` option is provided during sign-out. It verifies that the sign-out URL includes the correct `return_to` query parameter with the provided URL. The test sets up a mock client, calls signOut with a returnTo value, and asserts that the resulting URL contains the expected session_id and return_to parameters while maintaining the correct API endpoint structure.
+2. Modifies the `signOut` method in `create-client.ts` to accept an optional options parameter containing a returnTo string. Instead of directly passing the sessionId to getLogoutUrl, it now passes an object containing both the sessionId and the returnTo value from the options. The method maintains its existing behavior of checking for an access token and clearing session data when a URL is available.
+3. Updates the HTTP client tests in `http-client.test.ts` to reflect the new getLogoutUrl signature. It adds a test case for the basic logout URL and a new describe block for when returnTo is provided, verifying that the URL includes the properly encoded return_to parameter. The test ensures the URL construction handles both cases correctly.
+4. Modifies the `getLogoutUrl` method in `http-client.ts` to accept an object parameter with sessionId and returnTo properties instead of just a sessionId string. It maintains the base URL construction but now conditionally adds the return_to query parameter only when a returnTo value is provided, while always including the session_id parameter. The method handles URL construction and parameter encoding internally.
+5. Updates the session initialization logic in `create-client.ts` to check for either a `workos-has-session` cookie or a refresh token (retrieved via `getRefreshToken`). This allows the client to refresh sessions even if no `code` is present in the URL, especially in development environments.
+6. Adds corresponding test coverage in `create-client.test.ts`:
+   - When no code is in the URL but the `workos-has-session` cookie exists, the session should be refreshed.
+   - When devMode is enabled and a refresh token is present in localStorage, the session should be refreshed.
+   - When devMode is enabled but no refresh token exists, the client should be created without making any network requests.
+   - When neither a code, cookie, nor refresh token is present, the client should initialize without refreshing.
--- a/crates/eval/examples/auth_session_management/prompt.md
+++ b/crates/eval/examples/auth_session_management/prompt.md
@ -0,0 +1,3 @@
+I need to improve our logout feature. When users sign out, they should be able to specify a return URL to redirect to afterward. Right now, signing out just takes them to a default page, but we want to support custom redirects (like back to the homepage or a login screen). The URL should be safely included in the logout request. Make sure existing logouts still work normally when no redirect is specified.
+
+Also, note that we updated how the client initializes its session. It should now check for either a `workos-has-session` cookie or a valid refresh token (even in devMode). This ensures that sessions are refreshed appropriately even without a code in the URL. Be sure this logic is covered by the minimum tests.
--- a/crates/eval/examples/checkpoint_stability/base.toml
+++ b/crates/eval/examples/checkpoint_stability/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/cline/cline.git"
+revision = "a26494e5cc453f9c7e148d35895fda3f74d03284"
+language_extension = "ts"
--- a/crates/eval/examples/checkpoint_stability/criteria.md
+++ b/crates/eval/examples/checkpoint_stability/criteria.md
@ -0,0 +1,5 @@
+1. A new changeset file is created to document a patch that improves diff editing animations and enhances prompts for large file edits. An indicator showing the number of diff edits is also added next to each file path.
+2. In `diff.ts`, the error message thrown when a `SEARCH` block doesn’t match content has been updated to clarify that the mismatch could be due to out-of-order blocks.
+3. In `responses.ts`, the assistant response for diff mismatches now recommends limiting to 1–3 `SEARCH/REPLACE` blocks at a time for large files. It also simplifies fallback instructions for using the `write_to_file` tool.
+4. The `DiffViewProvider.ts` file has been updated to replace line-by-line animations with chunk-based updates for better performance. For large diffs, a smooth scrolling animation is introduced to maintain visual context. Small diffs still scroll directly.
+5. In `CodeAccordian.tsx`, a new visual indicator displays the number of `REPLACE` blocks in the code diff using a diff icon and count, providing quick insight into the volume of changes.
--- a/crates/eval/examples/checkpoint_stability/prompt.md
+++ b/crates/eval/examples/checkpoint_stability/prompt.md
@ -0,0 +1,7 @@
+We're trying to improve both performance and usability when working with large diffs in the editor. A few areas need attention:
+
+First, the current diff animation applies updates line-by-line, which can feel slow and visually jarring for large edits. Could you revise the logic so that we update the editor in larger chunks instead? For smaller diffs, direct scrolling to the edited line is fine, but for larger changes, it would be great to implement a smooth scrolling animation that steps through the affected region before settling at the final line.
+
+Second, the current error message when a SEARCH block doesn't match is a bit too vague. Let's make it clearer that the issue could be due to out-of-order or imprecise SEARCH/REPLACE blocks, especially when working with multiple blocks. It might also help to add a suggestion that users try only 1–3 changes at a time for large files before retrying.
+
+Finally, in the file accordion UI, it would be useful to show how many edits a file contains. Could you parse the diff content and display a count of REPLACE blocks next to the file path, maybe with a small icon for clarity?
--- a/crates/eval/examples/dd_iaptic_mcp_server_integration/base.toml
+++ b/crates/eval/examples/dd_iaptic_mcp_server_integration/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/punkpeye/awesome-mcp-servers.git"
+revision = "5480a9849b01ae8a5c1433d75ad0415975609571"
+language_extension = "md"
--- a/crates/eval/examples/dd_iaptic_mcp_server_integration/criteria.md
+++ b/crates/eval/examples/dd_iaptic_mcp_server_integration/criteria.md
@ -0,0 +1,5 @@
+1. The diff shows changes to `README.md`, specifically adding a new entry to the "Tools and integrations" list. The new entry is for `@iaptic/mcp-server-iaptic`, which provides access to customer purchase and revenue data.
+2. The added line includes:
+   - The GitHub repository URL
+   - Three emojis: 🎖️ (possibly representing awards or achievements), 📇 (profiles or contacts), and ☁️ (cloud)
+   - A description of the tool's functionality: "Connect with [iaptic](https://www.iaptic.com) to ask about your Customer Purchases, Transaction data and App Revenue statistics"
--- a/crates/eval/examples/dd_iaptic_mcp_server_integration/prompt.md
+++ b/crates/eval/examples/dd_iaptic_mcp_server_integration/prompt.md
@ -0,0 +1,3 @@
+Please add a new tool entry to the README.md file's integration list: "@iaptic/mcp-server-iaptic" with GitHub link, described as "Connect with [iaptic](https://www.iaptic.com) to ask about your Customer Purchases, Transaction data and App Revenue statistics", tagged with the following emojis: 🎖️ 📇 ☁️. Place it appropriately in the existing tools section, following the current alphabetical or category-based order.
+
+Edit the README file with the above, new resource
--- a/crates/eval/examples/debian_image_builder/base.toml
+++ b/crates/eval/examples/debian_image_builder/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/avkcode/container-tools.git"
+revision = "34137bb453b4d2dd28b08bd80e26bc3105a50ada"
+language_extension = "sh"
--- a/crates/eval/examples/debian_image_builder/criteria.md
+++ b/crates/eval/examples/debian_image_builder/criteria.md
@ -0,0 +1,4 @@
+1. Changes to the Makefile where the parameter "--keyrign" was corrected to "--keyring" in multiple build targets including debian11, debian11-java, debian11-java-slim, debian11-graal, debian11-graal-slim, debian11-corretto, debian11-java-slim-maven, debian11-java-slim-gradle, debian11-graal-slim-maven, and debian11-graal-slim-gradle. This appears to be a typo fix across all Java-related build configurations in the Makefile.
+2. Introduces significant enhancements to the debian/mkimage.sh script, including adding a usage function with detailed documentation, improving error handling for command-line arguments, and fixing the "--keyrign" parameter to "--keyring" to match the Makefile changes. It also adds better validation for required arguments and more descriptive error messages when values are missing. The script now includes comprehensive documentation about its purpose and usage examples.
+3. Shows extensive improvements to the script's functionality and robustness, including adding tracing capabilities, better error handling, and more informative logging. It introduces new helper functions like usage(), die(), warn(), and info() for better user feedback. The script now properly checks for required commands (debootstrap, unzip, trivy) and provides installation instructions if they're missing. It also includes better system checks (Linux OS verification, root privileges check, SELinux status) and implements a more reliable way to handle GPG keys by setting up the correct directory structure and permissions before key import.
+4. Continues the script improvements with better package management, repository configuration, and container setup. It adds proper apt repository configuration in the target system, implements package installation with retries, and includes Docker-specific optimizations. The script now provides clearer output about installed packages and their sizes. It also includes better cleanup procedures and more informative completion messages with clear instructions on how to load and run the resulting Docker image. The output now includes example commands and proper formatting for better readability.
--- a/crates/eval/examples/debian_image_builder/prompt.md
+++ b/crates/eval/examples/debian_image_builder/prompt.md
@ -0,0 +1 @@
+I need to make several improvements to our Debian image-building scripts. First, fix the typo in the `Makefile` where `--keyrign` is incorrectly used instead of `--keyring` across all build targets, including the standard Debian image and Java variants like `debian11-java`, `debian11-graal`, and `debian11-corretto`. Second, enhance the `debian/mkimage.sh` script to include proper error handling, usage documentation, and command-line argument validation. The script should check for required tools like `debootstrap`, `unzip`, and `trivy`, and provide installation instructions if they're missing. Improve the GPG key setup by ensuring the `/root/.gnupg` directory is properly configured before importing keys. Add structured logging with timestamps, warnings, and informational messages. Implement better package installation with retries and proper cleanup. Finally, include clear instructions at the end on how to load and run the generated Docker image, with example commands for verification. The script should be robust, well-documented, and fail early with meaningful error messages if system requirements aren't met.
--- a/crates/eval/examples/docs_restructure/base.toml
+++ b/crates/eval/examples/docs_restructure/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/YuhangSong/Arena-Baselines.git"
+revision = "801ed8566110ddc4a6ada0cc70171c636d78dbb8"
+language_extension = "py"
--- a/crates/eval/examples/docs_restructure/criteria.md
+++ b/crates/eval/examples/docs_restructure/criteria.md
@ -0,0 +1,12 @@
+1. README.md Features Section Reorganization
+The features section has been reorganized into two subsections ("Baselines" and "Games") with markdown tables added. The previous bullet points were replaced with more structured content including supported/benchmarked status indicators. A new "Visualization" section was added with TensorBoard and port forwarding instructions.
+2. Content Relocation and File Restructuring
+The Tennis game documentation and action space details were moved from README.md to a new games.md file. The README was cleaned up by removing commented-out content and consolidating documentation sections. YAML config files (Benchmark-2T1P-Discrete.yaml and Test-Pong.yaml) were modified to replace `selfplay_recent_prob` with `playing_policy_load_recent_prob` and adjust population size options.
+3. train.py Refactoring
+Significant changes to train.py including:
+- Renamed `selfplay_recent_prob` parameter to `playing_policy_load_recent_prob`
+- Simplified the nested grid search structure by removing unnecessary loops
+- Improved policy loading logic with better checkpoint path handling
+- Enhanced error handling and logging for policy saving/reloading
+- Removed redundant code and improved code organization
+- Added more descriptive console output during policy operations
--- a/crates/eval/examples/docs_restructure/prompt.md
+++ b/crates/eval/examples/docs_restructure/prompt.md
@ -0,0 +1,13 @@
+I need to refactor the multi-agent configuration system in our Arena-Baselines repository. The current policy_assignment parameter (self_play, independent) is too coarse. I want to replace it with a more flexible set of parameters to better support advanced training schemes like population-based training (PBT) and sophisticated self-play with historical opponents.
+
+Specifically, I will introduce four new configuration parameters:
+
+iterations_per_reload: Controls the frequency (in training iterations) at which policies are saved and potentially reloaded.
+num_learning_policies: Explicitly defines how many agents use policies that are actively being trained (can be an integer or 'all').
+selfplay_recent_prob: For non-learning agents (players), this determines the probability of loading the latest version of a learning policy versus loading a uniformly random historical version during reloads.
+size_population: Specifies the number of distinct policy versions maintained for each learning agent, enabling PBT-style experiments.
+To implement this, I will significantly modify train.py. This includes updating the argument parser, changing how experiment configurations are expanded (especially with grid_search), and implementing a new callback function (on_train_result). This callback will handle the periodic saving (using pickle) of learning policies to structured directories and the reloading of all policies (learning and playing) according to the new parameters (iterations_per_reload, selfplay_recent_prob, size_population). Playing policies will use deterministic actions.
+
+I'll also reorganize the codebase by renaming arena/rllib_env.py to arena/arena.py and creating a new arena/utils.py file to house utility functions (like configuration helpers, ID generators, DeterministicCategorical) and constants.
+
+Finally, I will update the example configuration files (Benchmark-2T1P-Discrete.yaml, Test-Pong.yaml) to remove policy_assignment and demonstrate the usage of the new parameters, including within grid_search.
--- a/crates/eval/examples/expand_laravel_php_support/base.toml
+++ b/crates/eval/examples/expand_laravel_php_support/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/calebporzio/sushi.git"
+revision = "01dd34fe3374f5fb7ce63756c0419385e31cd532"
+language_extension = "php"
--- a/crates/eval/examples/expand_laravel_php_support/criteria.md
+++ b/crates/eval/examples/expand_laravel_php_support/criteria.md
@ -0,0 +1,3 @@
+1. The GitHub workflow file has been significantly updated to expand testing coverage and improve the CI process. The changes introduce a new `fail-fast: false` setting to allow all matrix combinations to complete even if some fail. The testing matrix now includes PHP 8.4 and Laravel 12.* alongside the existing versions. The configuration includes specific testbench version mappings for Laravel 12.* and removes the DBAL requirement for Laravel 11.* tests. Numerous new test combinations have been added across all Laravel versions to include PHP 8.4 testing. The dependency installation process has been restructured into separate steps - one specifically for DBAL when needed, and another for general dependencies using updated composer commands with precise version constraints.
+2. The composer.json file has been updated to support the newly added Laravel 12.* version in both the main requirements and development dependencies. The testbench package now explicitly includes versions 5.* and 10.* in its supported range. For testing tools, PHPUnit 11.* has been added to the list of supported versions while maintaining backward compatibility with older versions. These changes ensure the package can be used with the latest Laravel ecosystem components while preserving compatibility with existing installations.
+st file modifications primarily focus on adapting to changes in Laravel 11+ where column type handling was updated. The changes introduce version-aware assertions that check whether to expect 'string' or 'varchar' as column types based on the Laravel version being tested. A new import for the version comparison function was added to support these conditional checks. Additional safeguards were implemented, including a check for the HandlesAnnotations trait before running database migration tests, making the test suite more robust when running in different environments. The column type assertions in multiple test methods were updated to use these version-aware checks to maintain compatibility across Laravel versions.
--- a/crates/eval/examples/expand_laravel_php_support/prompt.md
+++ b/crates/eval/examples/expand_laravel_php_support/prompt.md
@ -0,0 +1,11 @@
+
+I'd like to update our Laravel package's CI workflow and dependencies to ensure compatibility with the upcoming Laravel 12 release and PHP 8.4. Currently, our package supports Laravel versions 5.8 through 11 and PHP versions 7.1 through 8.3, and we'll need to extend this support while maintaining backward compatibility.
+
+**Key Changes Needed:**
+First, we'll need to update composer.json to explicitly support Laravel 12. The CI test matrix should also be expanded to include PHP 8.4 testing across all supported Laravel versions. The workflow configuration will require adjustments to properly handle these new version combinations.
+
+There are some test compatibility issues we'll need to address - particularly around how we check string column types in Laravel 11+ (where 'string' was changed to 'varchar'), and we should add conditional skipping for tests that depend on traits that might not be available in all test environments.
+
+While making these changes, we could also implement some workflow improvements: enabling the fail-fast: false option to get complete test results even with individual failures, modernizing our dependency installation approach using the newer composer update syntax, and making the DBAL dependency installation conditional since it's not needed for all test cases.
+
+Would you be able to help review these changes or suggest any additional considerations we should keep in mind for this compatibility update? I want to make sure we maintain stability while expanding our support coverage.
--- a/crates/eval/examples/finnish_translation/base.toml
+++ b/crates/eval/examples/finnish_translation/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/sdras/array-explorer.git"
+revision = "8ff1a72f7ba24d44946bf591c3586b0dcccc2381"
+language_extension = "js"
--- a/crates/eval/examples/finnish_translation/criteria.md
+++ b/crates/eval/examples/finnish_translation/criteria.md
@ -0,0 +1,12 @@
+1. **EditorConfig Change**
+Added a new setting `quote_type = single` to the `.editorconfig` file. This specifies that single quotes should be used for quoting in the codebase.
+2. **New Finnish Locale Files**
+Added two new Finnish language files:
+   - `src/locale/fi/index.js`: Contains Finnish translations for UI strings and method descriptions
+   - `store/fi/index.js`: Contains Finnish translations for all array method documentation (298 lines)
+   - `store/fi/meta.json`: Metadata about the Finnish translation (language code "fi", full name "Finnish", created by "sjarva")
+3. **Store Integration Updates**
+Modified `store/index.js` to:
+   - Import the new Finnish locale files (`import fi from './fi/index'` and `import translationsFi from '../src/locale/fi/index'`)
+   - Add Finnish to the Vuex store state (`fi`)
+   - Register Finnish translations with Vue I18n (`Vue.i18n.add('fi', translationsFi)`)
--- a/crates/eval/examples/finnish_translation/prompt.md
+++ b/crates/eval/examples/finnish_translation/prompt.md
@ -0,0 +1,5 @@
+I’m working on adding Finnish (fi) language support to our array method reference application, which helps users determine the right JavaScript array methods based on their needs. To achieve this, I’ll need to:
+
+First, create the Finnish locale file containing translations for method selection options, method types (such as add, remove, find, and iterate), and primary action choices. Next, I’ll add Finnish translations to the store, covering all array methods (like splice, push, and unshift), including detailed descriptions of their behaviors, parameters, return values, and example code with outputs.
+
+Additionally, I’ll generate a Finnish meta file with language metadata (language code, full name, and contributor info). Finally, I’ll update the main store index to integrate Finnish alongside existing languages like English, Spanish, and German.
--- a/crates/eval/examples/language_model_file_support/base.toml
+++ b/crates/eval/examples/language_model_file_support/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/vercel/ai.git"
+revision = "1766edec300deb05c84bb7fefc034af4c2bc1165"
+language_extension = "ts"
--- a/crates/eval/examples/language_model_file_support/criteria.md
+++ b/crates/eval/examples/language_model_file_support/criteria.md
@ -0,0 +1,3 @@
+1. Introduces a new changeset file that documents a patch for the '@ai-sdk/provider' package. The changeset indicates a chore task where 'LanguageModelV2File' is being extracted, suggesting a refactoring effort to modularize the codebase by separating file-related types into their own module.
+2. Modifications to the language model v2 index file where a new export statement for 'language-model-v2-file' has been added. This change reflects the extraction mentioned in the changeset and makes the new file type available to other parts of the application. Additionally, there are significant changes to the language model v2 implementation file where the inline file type definition has been replaced with the newly extracted 'LanguageModelV2File' type, both in the main model interface and in the stream part union type, demonstrating the consolidation of file-related types into a single, reusable definition.
+3. Present the newly created 'language-model-v2-file.ts' file which defines the 'LanguageModelV2File' type with comprehensive documentation. The type includes two properties: 'mediaType' which specifies the IANA media type of the file with a reference to the official media types registry, and 'data' which can be either a base64 encoded string or binary data, with clear documentation about maintaining the original format from the API without unnecessary conversion. This new file represents the extracted type that is now being used throughout the codebase.
--- a/crates/eval/examples/language_model_file_support/prompt.md
+++ b/crates/eval/examples/language_model_file_support/prompt.md
@ -0,0 +1 @@
+We need to improve how our language model handles file attachments by making the file type definitions more modular and reusable. Currently, file-related properties are defined inline within the model’s response and stream types, which makes maintenance harder and duplicates documentation. The goal is to extract these definitions into a dedicated type that can be shared consistently across both static responses and streaming payloads. The new type should include clear documentation about media types (referencing IANA standards) and support both base64 and binary data formats without unnecessary conversions. This change should maintain backward compatibility while centralizing the file structure definition for better type safety and readability. Focus on clean separation of concerns, and ensure the extracted type is properly exported and imported where needed.
--- a/crates/eval/examples/license_management/base.toml
+++ b/crates/eval/examples/license_management/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/SAP-samples/abap-cheat-sheets.git"
+revision = "262c0472eeb03e05ff8235767356a328d97850e6"
+require_lsp = false
--- a/crates/eval/examples/license_management/criteria.md
+++ b/crates/eval/examples/license_management/criteria.md
@ -0,0 +1,3 @@
+1. The file `.reuse/dep5` has been deleted. This file previously contained copyright and licensing information in Debian's copyright format, including details about API usage with SAP products, copyright notice (2022 SAP SE or affiliates), and Apache-2.0 license information.
+2. A new file `REUSE.toml` has been created with similar copyright and licensing information but in a different format. It includes the package name, supplier information, download location, and the same detailed disclaimer about API usage with SAP products that was in the deleted file.
+3. The new `REUSE.toml` file also contains annotations specifying that the copyright text and Apache-2.0 license apply to all files (`path = "**"`) with aggregate precedence, effectively maintaining the same licensing terms but in a different configuration format.
--- a/crates/eval/examples/license_management/prompt.md
+++ b/crates/eval/examples/license_management/prompt.md
@ -0,0 +1,17 @@
+I need to switch our license stuff from the old .reuse/dep5 file to the new REUSE.toml format. basically same info, just different format. here's what's in the old file:
+
+project name: abap-cheat-sheets
+contact: daniel reger's email
+repo link
+that long SAP API disclaimer
+copyright: SAP + contributors, 2022
+license: Apache-2.0
+need to:
+
+delete the old .reuse/dep5 file
+make a new REUSE.toml with:
+same project info (name, contact, repo)
+same exact API disclaimer text
+SPDX-style copyright & license fields
+apply to all files (** glob) with aggregate precedence
+not changing any actual license terms, just updating the format. can you give me the exact REUSE.toml file we need?
--- a/crates/eval/examples/metal_i64_support/base.toml
+++ b/crates/eval/examples/metal_i64_support/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/huggingface/candle.git"
+revision = "3164a19a5dc18f5e0f7a063ae85a0cfd289e98f1"
+language_extension = "rs"
--- a/crates/eval/examples/metal_i64_support/criteria.md
+++ b/crates/eval/examples/metal_i64_support/criteria.md
@ -0,0 +1,4 @@
+1. The changes improve the configurability of the `TextGeneration` struct and its initialization by refactoring generation parameters (`temperature`, `top_p`) to use non-optional types with default values, simplifying their use throughout the codebase.
+2. The argument parser is updated to enhance usability: `verbose_prompt` is renamed to a more general `verbose` flag, several arguments are given default values (e.g., `temperature`, `top_p`, `sample_len`), and optional arguments like `cache_path` and `weight_path` are now properly handled with conditional logic and fallbacks.
+3. The code loading the model configuration is updated to support deserializing from a JSON config file using Serde, and the `Config` struct is extended with a new `rope_ratio` field with a default value via a helper function, improving flexibility for different model setups.
+4. Import statements and general code layout are cleaned up for clarity and consistency, including reorganizing imports and removing unnecessary unwraps or panics, while maintaining the same core functionality of the text generation pipeline.
--- a/crates/eval/examples/metal_i64_support/prompt.md
+++ b/crates/eval/examples/metal_i64_support/prompt.md
@ -0,0 +1 @@
+I'd like to improve the configurability and usability of the text generation script for the CodeGeeX4-9B model. Please refactor the argument parsing to set more user-friendly defaults where possible, especially for generation parameters like temperature and top-p, and change fields like verbose_prompt to a more general verbose flag. Simplify the handling of optional paths like cache or weight paths, making them truly optional with fallbacks. I also want the model config to support deserialization from a JSON file instead of relying on hardcoded defaults, including support for a rope_ratio parameter with a sensible default. Lastly, please clean up the code for consistency—such as import ordering—and ensure everything aligns with these improvements without changing the overall functionality.
--- a/crates/eval/examples/nan_diff_handling/base.toml
+++ b/crates/eval/examples/nan_diff_handling/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/AsyncBanana/microdiff"
+revision = "ce2055948483d01fb1e96def4ab98d6339d3b2f9"
+language_extension = "js"
--- a/crates/eval/examples/nan_diff_handling/criteria.md
+++ b/crates/eval/examples/nan_diff_handling/criteria.md
@ -0,0 +1,6 @@
+1. **NaN Comparison Logic Update**:
+The diff modifies the comparison function to explicitly handle NaN values as equivalent. Previously, the function relied on string conversion for NaN comparison, but now it first checks if both values are NaN using Number.isNaN() before proceeding with other comparison logic. This change ensures consistent behavior when comparing NaN values in objects.
+2. **New NaN Test Suite - Object Operations**:
+A comprehensive test suite is added to verify NaN handling in object operations. The tests cover: creating new objects with NaN values, changing NaN values to other numbers, verifying no changes when NaN values remain the same, and removing properties with NaN values. Each test case validates the diff output structure and type of operation.
+3. **New NaN Test Suite - Array Operations**:
+The test suite extends to array operations with similar test cases as objects but adapted for array contexts. It tests: adding NaN to arrays, replacing NaN with other numbers, maintaining arrays with unchanged NaN values, and removing NaN elements from arrays. The tests ensure consistent behavior between object and array operations involving NaN values.
--- a/crates/eval/examples/nan_diff_handling/prompt.md
+++ b/crates/eval/examples/nan_diff_handling/prompt.md
@ -0,0 +1 @@
+The goal of this update is to fix NaN value handling in our JavaScript object diffing functionality. Currently, the diff function fails to properly recognize that two NaN values should be treated as equal due to JavaScript's native behavior where `NaN !== NaN`. This causes incorrect change detection when comparing objects or arrays containing NaN values. The solution involves modifying the diff function to explicitly check for NaN values using `Number.isNaN()` during comparisons of object keys and values, ensuring NaN values are treated as equivalent. The implementation requires adding specific NaN equivalence checks while maintaining existing comparison logic. Additionally, comprehensive unit tests are being added to verify correct handling across various scenarios: creating objects/arrays with NaN values, changing NaN values to other values, ensuring no false positives when NaN values remain unchanged, and properly tracking removal of NaN values from both objects and arrays. This change will bring the diff behavior in line with mathematical expectations for NaN comparisons while maintaining all other existing functionality.
--- a/crates/eval/examples/optimizer_schema_refactor/base.toml
+++ b/crates/eval/examples/optimizer_schema_refactor/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/redis/redis-vl-python.git"
+revision = "494e5e2f8cf800b90c7383385095c2e503404bc5"
+language_extension = "py"
--- a/crates/eval/examples/optimizer_schema_refactor/criteria.md
+++ b/crates/eval/examples/optimizer_schema_refactor/criteria.md
@ -0,0 +1,3 @@
+1. The changes involve renaming the `TestData` class to `LabeledData` across multiple files. This includes updating the import statements in `__init__.py`, `cache.py`, `router.py`, `schema.py`, and `utils.py` to reflect this new class name. The `__all__` list in `__init__.py` is also updated to export `LabeledData` instead of `TestData`. This appears to be a conceptual renaming to better reflect the purpose of the data structure.
+2. The modifications update all function signatures and type hints that previously used `TestData` to now use `LabeledData`. This affects several functions in `cache.py` including `_generate_run_cache`, `_eval_cache`, and `_grid_search_opt_cache`, as well as functions in `router.py` like `_generate_run_router` and `_eval_router`. The utility functions in `utils.py` are also updated to work with `LabeledData` instead of `TestData`.
+3. The changes introduce a new `search_step` parameter in the router optimization logic within `router.py`, with a default value of 0.10. This parameter is passed through to the `_router_random_search` function and is used in the optimization process. The test file `test_threshold_optimizer.py` is updated to explicitly set this parameter to 0.5 when calling the optimize method, demonstrating how it can be configured for different search granularities during threshold optimization.
--- a/crates/eval/examples/optimizer_schema_refactor/prompt.md
+++ b/crates/eval/examples/optimizer_schema_refactor/prompt.md
@ -0,0 +1 @@
+I need to refactor our codebase to improve the clarity and consistency of our data model, particularly around how we handle labeled evaluation data for our threshold optimization system. Currently, the naming and structure might imply that this data is only used for testing, when in reality it represents labeled examples that power both training and evaluation. The changes should better reflect that these are curated data points with known outcomes, not just test cases. Focus on updating the core data model and ensuring all dependent components—like the cache optimizer, router, and evaluation utilities—properly reference this updated concept. The implementation should maintain all existing functionality while making the naming more semantically accurate. Where relevant, consider adding parameters to fine-tune optimization behavior, like allowing control over the granularity of threshold searches.
--- a/crates/eval/examples/rate_limit_endpoints/base.toml
+++ b/crates/eval/examples/rate_limit_endpoints/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/matryer/goblueprints.git"
+revision = "68041a598865cc3f4fa2acd4119081a2ea0826bf"
+language_extension = "go"
--- a/crates/eval/examples/rate_limit_endpoints/criteria.md
+++ b/crates/eval/examples/rate_limit_endpoints/criteria.md
@ -0,0 +1,12 @@
+1. The main.go changes introduce rate-limited endpoints by creating them via `MakeEndpoints` and passing them to both HTTP and gRPC servers instead of directly using the service. This includes:
+   - Adding endpoint creation before server startup
+   - Modifying HTTP server to use endpoints
+   - Modifying gRPC server to use endpoints
+2. The server_grpc.go changes update the gRPC server implementation to use the provided endpoints instead of creating them internally. This affects both hash and validate endpoints which are now taken from the Endpoints struct rather than being created via makeHashEndpoint/makeValidateEndpoint.
+3. The server_http.go changes mirror the gRPC server changes, modifying the HTTP server to use endpoints from the Endpoints struct rather than creating them internally for both hash and validate routes.
+4. The service.go changes include:
+   - Renaming makeHashEndpoint to MakeHashEndpoint and making it public
+   - Renaming makeValidateEndpoint to MakeValidateEndpoint and making it public
+   - Adding new MakeEndpoints function that creates rate-limited endpoints using a token bucket (5 requests per second)
+   - Adding new dependencies for rate limiting (kitrl and ratelimit packages)
+   - The Endpoints struct remains the same but is now populated with rate-limited versions of the endpoints
--- a/crates/eval/examples/rate_limit_endpoints/prompt.md
+++ b/crates/eval/examples/rate_limit_endpoints/prompt.md
@ -0,0 +1,18 @@
+Here’s a more abstract, goal-oriented version of your request without diving into implementation specifics:
+
+---
+
+### **Request: Add Rate Limiting to Vault Service**
+
+We need to introduce rate limiting to our vault service to protect it from excessive traffic and ensure fair usage. The service currently handles password hashing and validation through both HTTP and gRPC, and we want to enforce a controlled request rate across all endpoints.
+
+#### **Key Requirements:**
+- Apply a global rate limit (e.g., 5 requests per second) to prevent abuse.
+- Ensure the rate limiting works consistently across both HTTP and gRPC interfaces.
+- Refactor the service to cleanly support rate limiting without breaking existing functionality.
+- Maintain flexibility so that limits can be adjusted if needed.
+
+#### **Implementation Approach (High-Level):**
+- Use a token bucket or similar algorithm for smooth rate limiting.
+- Integrate with our existing middleware/request pipeline.
+- Keep the changes minimal but scalable for future adjustments.
--- a/crates/eval/examples/request_to_axios_migration/base.toml
+++ b/crates/eval/examples/request_to_axios_migration/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/localtunnel/localtunnel.git"
+revision = "4c136a265c2005bcb81bf47709c8ca9b634f2fc1"
+language_extension = "js"
--- a/crates/eval/examples/request_to_axios_migration/criteria.md
+++ b/crates/eval/examples/request_to_axios_migration/criteria.md
@ -0,0 +1,3 @@
+1. The first change replaces the `request` module import with `axios` in Tunnel.js. This is accompanied by modifications to the request parameters where `path` and `json` fields are removed and replaced with `responseType: 'json'`. The request URI construction is also slightly modified to separate the base URI from the parameters.
+2. The second chunk shows significant changes to the request handling logic in Tunnel.js. The callback-based `request` implementation is replaced with a promise-based `axios.get` approach. The error handling is restructured to use `.catch()` instead of checking for errors in the callback. The success case now extracts data from `res.data` instead of directly from the response body, and the status code check looks at `res.status` instead of `res.statusCode`.
+3. The third chunk shows changes to package.json where the `request` dependency is removed and replaced with `axios` at version 0.17.1. The dependencies are also reordered, with `debug` and `openurl` moved up and `yargs` moved to the end of the list, though their versions remain unchanged. The devDependencies section remains untouched.
--- a/crates/eval/examples/request_to_axios_migration/prompt.md
+++ b/crates/eval/examples/request_to_axios_migration/prompt.md
@ -0,0 +1 @@
+I need help modernizing the HTTP client in my Node.js tunneling service. The current implementation uses the older `request` library, which is now deprecated, and I'd like to switch to a more modern, promise-based alternative like `axios`. The changes should maintain all existing functionality—including error handling, retry logic, and response parsing—but improve readability and maintainability by using async/await or proper promise chaining where possible. The request parameters and response handling should be updated to match the new library's conventions while preserving the same behavior for downstream consumers. Additionally, ensure the package.json dependencies are updated accordingly, removing deprecated packages and cleaning up the dependency list. The core tunneling logic should remain unchanged; this is purely about updating the HTTP client layer to be more future-proof.
--- a/crates/eval/examples/runtime_script_refactor/base.toml
+++ b/crates/eval/examples/runtime_script_refactor/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/thalissonvs/pydoll.git"
+revision = "9ea9e91c716b60a7cc8f11ecd865093d460f31aa"
+language_extension = "py"
--- a/crates/eval/examples/runtime_script_refactor/criteria.md
+++ b/crates/eval/examples/runtime_script_refactor/criteria.md
@ -0,0 +1,6 @@
+1. **Added RuntimeCommands import and WebElement to page.py**
+The changes add an import for `RuntimeCommands` and `WebElement` to `page.py`. The `execute_js_script` method is renamed to `execute_script` and enhanced to support execution in the context of a WebElement. The method now uses `RuntimeCommands` for script evaluation.
+2. **Refactored Runtime-related commands from DomCommands to new RuntimeCommands class**
+The changes move all Runtime-related command templates and methods from `DomCommands` in `dom.py` to a new `runtime.py` file. This includes `EVALUATE_TEMPLATE`, `CALL_FUNCTION_ON_TEMPLATE`, `GET_PROPERTIES`, and their associated methods. The DomCommands class now uses RuntimeCommands for JavaScript evaluation.
+3. **Added Scripts constants and enhanced WebElement functionality**
+The changes add a new `Scripts` class to `constants.py` containing JavaScript snippets for common operations. The `element.py` file is significantly enhanced with new methods for script execution, visibility checking, and improved click handling. New exceptions are added to `exceptions.py` for better error handling.
--- a/crates/eval/examples/runtime_script_refactor/prompt.md
+++ b/crates/eval/examples/runtime_script_refactor/prompt.md
@ -0,0 +1,7 @@
+I'm looking to improve our Python web automation library (pydoll) to make it more robust and maintainable, particularly around JavaScript execution and element interactions. Currently, we need to better organize our Runtime-related commands and enhance how scripts are executed in the browser context.
+
+The main focus areas include creating a dedicated RuntimeCommands class to centralize all JavaScript-related operations, moving these functions out of DomCommands for cleaner separation of concerns. This new class would handle script evaluation, function calling, and property lookups. We should also enhance the existing page.execute_js_script method—renaming it to execute_script for clarity—and expand its functionality to support execution within specific WebElement contexts, including passing elements as arguments.
+
+For element interactions, we need more reliable mechanisms, particularly around clicking elements. The improvements would include visibility checks, verifying elements aren't obscured, and implementing proper error handling with descriptive exceptions when interactions fail. The current click implementation should be moved to realistic_click, while the new click method would incorporate these safety checks. Additionally, we should consolidate commonly used JavaScript snippets into a centralized Scripts class for better maintainability.
+
+The overall goal is to strengthen the library's reliability for automation tasks while making the codebase more organized and easier to maintain. These changes will provide better error handling, clearer structure, and more intuitive APIs for working with page elements and JavaScript execution. Would you be able to help break this down into actionable steps or suggest any improvements to this approach?
--- a/crates/eval/examples/standardized_docker_dependency_checks/base.toml
+++ b/crates/eval/examples/standardized_docker_dependency_checks/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/basecamp/kamal.git"
+revision = "0174b872bfc34b66852cffb58514ae079f21d299"
+language_extension = "rb"
--- a/crates/eval/examples/standardized_docker_dependency_checks/criteria.md
+++ b/crates/eval/examples/standardized_docker_dependency_checks/criteria.md
@ -0,0 +1,7 @@
+1. The changes introduce a new `DependencyError` class in `kamal/cli.rb` alongside other error classes like `BootError` and `HookError`. This new error class will be used to handle dependency-related failures.
+2. In `kamal/cli/base.rb`, a new method `ensure_docker_installed` is added which checks for Docker and buildx plugin installation locally. It raises the new `DependencyError` with appropriate messages if either Docker or buildx plugin are not found, replacing similar functionality that was previously scattered elsewhere.
+3. The `kamal/cli/build.rb` file is modified to use the new `ensure_docker_installed` method instead of the removed `verify_local_dependencies` method. The error handling is now consistent, using `DependencyError` instead of `BuildError` for dependency-related failures.
+4. The `kamal/cli/registry.rb` file now includes a call to `ensure_docker_installed` at the start of the login method, ensuring Docker is available before attempting registry operations.
+5. The `kamal/commands/base.rb` file adds a new public method `ensure_docker_installed` that combines checks for both Docker and buildx plugin installation, moving this functionality from the Builder class.
+6. The `kamal/commands/builder.rb` file is simplified by removing the `ensure_local_dependencies_installed` method and related private methods, as this functionality has been moved to the base commands class.
+7. Test files are updated to reflect these changes, with `build_test.rb` now expecting `DependencyError` instead of `BuildError` for dependency failures, and `registry_test.rb` adding a new test case for Docker dependency checking during login.
--- a/crates/eval/examples/standardized_docker_dependency_checks/prompt.md
+++ b/crates/eval/examples/standardized_docker_dependency_checks/prompt.md
@ -0,0 +1 @@
+I need to improve how our codebase handles Docker dependency checks and error reporting. Right now, the logic for verifying Docker and buildx installations is scattered across different classes, and the error messages aren't consistent. I'd like a more unified approach where we centralize these checks in a single place, making it easier to maintain and reuse. Additionally, we should introduce a dedicated error type for dependency-related failures instead of repurposing existing errors like BuildError. The changes should ensure that any command requiring Docker (like builds or registry logins) properly validates dependencies first, with clear error messages if something is missing. The solution should be clean, follow existing patterns in the codebase, and include any necessary test updates to reflect the new behavior.
--- a/crates/eval/examples/table_metrics_sorting/base.toml
+++ b/crates/eval/examples/table_metrics_sorting/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/duyet/clickhouse-monitoring.git"
+revision = "b8ab1a957115f41c916e7061b432ae00b1bbe7db"
+language_extension = "ts"
--- a/crates/eval/examples/table_metrics_sorting/criteria.md
+++ b/crates/eval/examples/table_metrics_sorting/criteria.md
@ -0,0 +1,5 @@
+1. The SQL query in tables-overview.ts has been enhanced to include additional metrics for part sizes, both average and maximum. New fields have been added for compressed and uncompressed average part sizes with their readable formats and percentage calculations. Similarly, maximum part size metrics have been added with the same set of calculations. These additions provide more granular visibility into table partition characteristics while maintaining the existing percentage calculations relative to the maximum values across all tables.
+2. The column ordering and formatting in tables-overview.ts has been updated to accommodate the new part size metrics. The new readable_avg_part_size and readable_max_part_size columns have been added to the columns array and configured with BackgroundBar formatting. The engine column has been moved to the end of the list for better grouping of related metrics. The sortingFns configuration has been added to specify custom sorting behavior for various compressed and uncompressed size columns.
+3. The column definitions system has been enhanced to support custom sorting functions. A new sorting-fns.ts file has been created containing a sort_column_using_actual_value function that enables sorting based on underlying numeric values rather than formatted strings. The getColumnDefs function now checks for both custom and built-in sorting functions in the config and applies them appropriately to column definitions.
+4. The data table component has been updated to include custom sorting functions in its configuration. The getCustomSortingFns function is now passed to the table's sortingFns option, making these functions available for all columns. The ValueOf utility type has been added to generic.ts to support proper typing of the sorting functions.
+5. The query config type has been extended to include a new optional sortingFns property. This property allows specifying custom sorting functions for specific columns in the table configuration. The type imports have been reorganized, and CustomSortingFnNames is now properly imported and used in the QueryConfig interface.
--- a/crates/eval/examples/table_metrics_sorting/prompt.md
+++ b/crates/eval/examples/table_metrics_sorting/prompt.md
@ -0,0 +1 @@
+I need to enhance our data table functionality to support more advanced sorting capabilities, particularly for columns that display formatted values (like readable sizes or percentages) but should sort based on their underlying raw numeric values. The table should also include additional metrics for average and maximum part sizes (both compressed and uncompressed) to give better insights into table storage characteristics. These new metrics should follow the same pattern as existing columns, with formatted readable versions, percentage calculations relative to the dataset maximum, and proper sorting behavior. The sorting system should be flexible enough to support both custom sorting logic (like comparing raw numbers behind formatted strings) and built-in sorting methods, with a clean way to configure which columns use which sorting approach. The implementation should maintain consistency with our existing column formatting system and integrate smoothly with the React Table setup we already have in place.
--- a/crates/eval/examples/tax_id_validation/base.toml
+++ b/crates/eval/examples/tax_id_validation/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/go-playground/validator.git"
+revision = "4676b8e43bb907ef07f3bcc4ae2a218b05d60397"
+language_extension = "go"
--- a/crates/eval/examples/tax_id_validation/criteria.md
+++ b/crates/eval/examples/tax_id_validation/criteria.md
@ -0,0 +1,3 @@
+1. Documentation updates in README.md, where a new validation type for Employer Identification Numbers (EIN) was added to the supported validators table. This addition was carefully positioned between the existing "e164" phone number format and "email" validators to maintain alphabetical ordering. The entry follows the established table format with pipe-separated columns and includes a clear description indicating its purpose for validating U.S. Employer Identification Numbers. Notably, this change was made without modifying any of the existing documentation entries, preserving all current validator descriptions while expanding the supported validation types.
+2. Core implementation of the EIN validation across multiple files. In baked_in.go, this involved adding an "ein" entry to the validator map that points to a newly created isEIN function, following the same pattern as other validator registrations. The isEIN() function itself implements the validation logic, checking for both length requirements (exactly 10 characters) and pattern matching using a new regular expression. The regexes.go file was updated with a new einRegexString constant defining the EIN pattern (##-#######) and corresponding regex variable initialization, utilizing the existing lazyRegexCompile helper function for consistency. Documentation was added in doc.go following the established format for validator descriptions, complete with a simple usage example. Throughout these changes, careful attention was paid to maintain consistent error handling patterns and code organization while removing unnecessary newlines in several functions to improve readability.
+3. Testing improvements and code quality enhancements, primarily in validator_test.go. A comprehensive TestEINStringValidation test case was added, covering various valid and invalid EIN formats, including tests for length requirements and hyphen positioning. This new test follows the same structure and assertion patterns as existing validation tests. Numerous code quality improvements were made throughout the test file, including grouping interface declarations, fixing comment formatting, removing unnecessary newlines in struct declarations, correcting indentation in test cases, and adding missing newlines between tests. These changes significantly improved code readability while maintaining all existing test logic and ensuring backward compatibility. The improvements demonstrate careful attention to maintaining consistent patterns throughout the test suite while adding thorough test coverage for the new EIN validation functionality.
--- a/crates/eval/examples/tax_id_validation/prompt.md
+++ b/crates/eval/examples/tax_id_validation/prompt.md
@ -0,0 +1,10 @@
+
+Add validation support for Employer Identification Numbers (EIN) to the Go validator library
+
+I need to implement a new validator function for US Employer Identification Numbers (EIN) in this Go validation library. The EIN validator should:
+
+1. Create a new tag called "ein" that validates if a string is a valid US Employer Identification Number
+2. Follow the pattern of ##-#######, where # is a digit (regex pattern would be ^(\d{2}-\d{7})$)
+3. Ensure the field contains exactly 10 characters (including the hyphen)
+4. Document the new validator in the README.md and doc.go files
+5. Add proper unit tests to verify validation works correctly for valid and invalid EINs
--- a/crates/eval/examples/test_infrastructure/base.toml
+++ b/crates/eval/examples/test_infrastructure/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/dagster-io/dagster.git"
+revision = "c9ed914a76baa6fb761a97f3236f96cd7d5361e6"
+language_extension = "py"
--- a/crates/eval/examples/test_infrastructure/criteria.md
+++ b/crates/eval/examples/test_infrastructure/criteria.md
@ -0,0 +1,3 @@
+1. Introduce a new docker-compose.yml file in the integration tests directory for the monitoring daemon test suite. This file defines two services: a PostgreSQL database with test credentials exposed on port 5432, and a localstack S3 service exposed on port 4566. These services provide the necessary infrastructure for running the monitoring tests.
+2. Shows significant modifications to the test_monitoring.py file, including new imports (boto3, Path, and docker_compose_cm), removal of the dagster_aws tests import, and the addition of new fixtures. The new fixtures handle docker-compose setup, provide hostnames for services, configure AWS environment variables with test credentials, and initialize an S3 bucket for testing purposes. The changes reflect a shift from using external AWS credentials to using localstack for S3 testing.
+3. Reveals structural changes to the test file, where the aws_env fixture has been moved from the bottom of the file to be grouped with other fixtures. The original implementation that relied on get_aws_creds() has been replaced with a new implementation that uses localstack with hardcoded test credentials, and the test_docker_monitoring_run_out_of_attempts function remains at the end of the file but now uses the new aws_env fixture implementation.
--- a/crates/eval/examples/test_infrastructure/prompt.md
+++ b/crates/eval/examples/test_infrastructure/prompt.md
@ -0,0 +1 @@
+Refactor the monitoring daemon integration tests to use local Docker-managed dependencies instead of direct AWS dependencies. First, create a docker-compose.yml file with two services: a PostgreSQL container with test credentials exposed on port 5432, and a LocalStack S3 container exposed on port 4566. Next, modify the test file to remove reliance on external AWS credentials and replace them with fixtures that configure a LocalStack S3 mock. The fixtures should include session-scoped setup for hostnames, PostgreSQL connections, and AWS environment variables with hardcoded test credentials (e.g., fake access keys). Ensure the S3 fixture initializes a test bucket. Move the AWS environment fixture to align with other fixtures and update the test logic to use the new LocalStack endpoint URL, handling both local and Buildkite environments. Keep the core test cases (like monitoring run attempts) intact but adapt them to use the new Docker-based dependencies.
--- a/crates/eval/examples/tool_response_handling/base.toml
+++ b/crates/eval/examples/tool_response_handling/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/block/goose.git"
+revision = "d7308457fe3f1b9c7253de45b2f81ddc4f005fe5"
+language_extension = "rs"
--- a/crates/eval/examples/tool_response_handling/criteria.md
+++ b/crates/eval/examples/tool_response_handling/criteria.md
@ -0,0 +1,3 @@
+1. All Goose packages (`goose`, `goose-bench`, `goose-cli`, `goose-mcp`, `goose-server`) were updated from version `1.0.17` to `1.0.18` in `Cargo.lock`. These updates ensure compatibility and consistency across related packages.
+2. The `goose-app` version in `ui/desktop/package-lock.json` was also updated to `1.0.18`, maintaining alignment with the backend and shared libraries.
+3. In `App.tsx`, the `useConfig` hook was destructured to directly use `addExtension` instead of the older `addExtensionToConfig` function. All occurrences of the old function name were updated, including inside effects and async calls, to use the new unified method. This change simplifies extension handling logic while preserving current behavior.
--- a/crates/eval/examples/tool_response_handling/prompt.md
+++ b/crates/eval/examples/tool_response_handling/prompt.md
@ -0,0 +1 @@
+Upgrade all Goose-related packages and apps from version 1.0.17 to 1.0.18 throughout the codebase. This includes updating version references in Cargo.lock, package-lock.json, and source files where applicable. In addition, streamline the addExtension logic in App.tsx by removing the outdated addExtensionToConfig references and replacing them with the new unified addExtension function. Ensure that all function dependencies and hooks reflect this updated usage. The goal is to improve maintainability and consistency across the codebase without introducing any functional changes.
--- a/crates/eval/examples/toolbar_endpoints/base.toml
+++ b/crates/eval/examples/toolbar_endpoints/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/django-cms/django-cms.git"
+revision = "0b775f27300c4347be18a5bb7b1b172d6a943ccf"
+language_extension = "py"
--- a/crates/eval/examples/toolbar_endpoints/criteria.md
+++ b/crates/eval/examples/toolbar_endpoints/criteria.md
@ -0,0 +1,3 @@
+1. The changes add two new URL patterns ('cms_placeholder_add_plugin' and 'cms_placeholder_edit_plugin') to the list of endpoints in the toolbar middleware configuration. These endpoints will now be recognized by the toolbar system.
+2. The changes add test cases for the new toolbar endpoints in the test file. The first test case verifies that the toolbar is properly attached to requests for the 'cms_placeholder_add_plugin' admin endpoint. The test creates a mock request and checks that the toolbar attribute is present after middleware processing.
+3. The changes include a second test case that verifies toolbar functionality for the 'cms_placeholder_edit_plugin' admin endpoint. Similar to the first test, it creates a mock request with plugin ID (1) and checks for the presence of the toolbar attribute after middleware processing. This maintains consistency with the existing test for 'cms_placeholder_clear_placeholder'.
--- a/crates/eval/examples/toolbar_endpoints/prompt.md
+++ b/crates/eval/examples/toolbar_endpoints/prompt.md
@ -0,0 +1,3 @@
+I'm working on improving the Django CMS toolbar middleware to better support plugin management functionality. Currently, the toolbar is only enabled for specific views defined in the `TOOLBAR_URL_PREFIXES` within toolbar.py, but I've noticed we're missing support for two critical plugin-related operations: adding and editing plugins through the `cms_placeholder_add_plugin` and `cms_placeholder_edit_plugin` views. These views should have access to the toolbar object just like our other administrative actions, as they're fundamental to the content editing experience.
+
+To implement this enhancement, we'll need to make two key changes. First, we should add both 'cms_placeholder_add_plugin' and 'cms_placeholder_edit_plugin' to the allowed URL prefixes list in cms/middleware/toolbar.py. Second, we should expand our test coverage in cms/tests/test_toolbar.py to verify that the toolbar object is properly attached to requests hitting these endpoints, maintaining consistency with how we test other toolbar-enabled views. This change will ensure a more complete and reliable toolbar experience throughout the entire plugin management workflow.
--- a/crates/eval/examples/war_and_uri_corrections/base.toml
+++ b/crates/eval/examples/war_and_uri_corrections/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/jetty/jetty.project.git"
+revision = "dc685b6f84e94ad2eb6a3930769e6eab0cab3fa6"
+language_extension = "java"
--- a/crates/eval/examples/war_and_uri_corrections/criteria.md
+++ b/crates/eval/examples/war_and_uri_corrections/criteria.md
@ -0,0 +1,7 @@
+1. The changes add an import for `URIUtil` and modify the URL creation in `OSGiApp.java` to use `URIUtil.correctURI()` for proper URI handling. The modification ensures correct URI formatting before converting to URL.
+2. The changes add an import for `URIUtil` and modify the URI creation in `Util.java` to use `URIUtil.correctURI()` when handling file paths. This ensures proper URI formatting for paths starting with "file:/".
+3. The changes in both `WebInfConfiguration.java` files (EE10 and EE9 versions) refactor the war file handling logic. The modifications:
+   - Add explanatory comments about looking for sibling directories
+   - Change how the war path is obtained (using webApp.getPath() instead of creating new resources)
+   - Restructure the conditional logic for better clarity
+   - Maintain the same functionality but with improved safety checks and documentation
--- a/crates/eval/examples/war_and_uri_corrections/prompt.md
+++ b/crates/eval/examples/war_and_uri_corrections/prompt.md
@ -0,0 +1,7 @@
+I’m working on improvements to a Jetty OSGi application’s file path handling and deployment logic. The changes focus on two main areas: URI normalization and WAR file extraction.
+
+First, the URI handling logic needs updates to ensure consistent formatting, particularly when dealing with file paths. Currently, there are cases where paths aren’t properly normalized, especially when converting between file URIs and URLs. This affects both core OSGi resource resolution and utility methods that process path strings. The goal is to apply systematic corrections so that paths are reliably formatted across different scenarios.
+
+Second, the WAR file extraction process requires refinement to make it more robust. The current implementation checks for pre-extracted sibling directories, but the logic could be strengthened by using the resolved webApp path directly rather than reconstructing it from strings. Additionally, the code would benefit from clearer documentation and added safeguards to handle edge cases gracefully. These changes will apply to both the EE9 and EE10 WebApp configurations, ensuring consistent behavior across versions.
+
+The overarching aim is to reduce deployment failures and improve maintainability while keeping the changes backward-compatible.
--- a/crates/eval/examples/window_title_support/base.toml
+++ b/crates/eval/examples/window_title_support/base.toml
@ -0,0 +1,3 @@
+url = "https://github.com/charmbracelet/bubbletea.git"
+revision = "bc1c475eb0263aba13ef430f191677e153dc0320"
+language_extension = "go"
--- a/crates/eval/examples/window_title_support/criteria.md
+++ b/crates/eval/examples/window_title_support/criteria.md
@ -0,0 +1,4 @@
+1. Adds a new `setWindowTitle` method to the `standardRenderer` struct that sets the terminal window title using the OSC 0 escape sequence. It includes thread safety with mutex locking and uses fmt.Fprintf to send the escape sequence with the provided title.
+2. Modifies the `handleMessages` method in `standardRenderer` to handle a new `setWindowTitleMsg` message type by calling the new `setWindowTitle` method. This completes the rendering-side implementation for window title updates.
+3. Updates the event loop in the Program struct to properly handle `setWindowTitleMsg` messages by passing them through to the renderer without additional processing, similar to other renderer-specific messages.
+4. Adds documentation to the commands tutorial README explaining how to set window titles in Bubble Tea applications. It shows examples of using `tea.SetWindowTitle()` in both Init and Update methods, and explains its usefulness for reflecting application state in the window title.
--- a/crates/eval/examples/window_title_support/prompt.md
+++ b/crates/eval/examples/window_title_support/prompt.md
@ -0,0 +1,11 @@
+I’d like to add the ability to set terminal window titles in our Bubble Tea framework. This would let applications dynamically update the title bar (e.g., to show status or app names).
+
+Requirements:
+
+Expose a user-friendly way to set titles (e.g., a SetWindowTitle command).
+Ensure it works cross-platform with standard terminal escape codes.
+Include a minimal example and docs showing usage.
+Constraints:
+
+Follow existing patterns for commands/messages.
+Thread-safe rendering.
--- a/crates/eval/src/eval.rs
+++ b/crates/eval/src/eval.rs
@ -39,6 +39,9 @@ struct Args {
    /// Model to use (default: "claude-3-7-sonnet-latest")
    #[arg(long, default_value = "claude-3-7-sonnet-latest")]
    model: String,
+    /// Languages to run (comma-separated, e.g. "js,ts,py"). If unspecified, only Rust examples are run.
+    #[arg(long, value_delimiter = ',')]
+    languages: Option<Vec<String>>,
 }

 fn main() {
@ -46,6 +49,8 @@ fn main() {

    let args = Args::parse();
    let all_available_examples = list_all_examples().unwrap();
+    let languages = args.languages.unwrap_or_else(|| vec!["rs".to_string()]);
+
    let example_paths = all_available_examples
        .iter()
        .filter_map(|example_path| {
@ -94,6 +99,17 @@ fn main() {
            let mut examples = Vec::new();
            for example_path in example_paths {
                let example = Example::load_from_directory(&example_path, &run_dir)?;
+
+                if !example
+                    .base
+                    .language_extension
+                    .as_ref()
+                    .map_or(false, |lang| languages.contains(lang))
+                {
+                    println!("Skipping {}", example.name);
+                    continue;
+                }
+
                examples.push((example_path, example));
            }
            let mut repo_urls = HashSet::new();
@ -133,6 +149,10 @@ fn main() {

            future::join_all(clone_tasks).await;

+            for (_, example) in examples.iter() {
+                example.setup().await?;
+            }
+
            let tasks = examples
                .into_iter()
                .map(|(example_path, example)| {
@ -197,7 +217,6 @@ async fn run_example(
    app_state: Arc<AgentAppState>,
    cx: &mut AsyncApp,
 ) -> Result<JudgeOutput> {
-    example.setup().await?;
    cx.update(|cx| example.run(model.clone(), app_state, cx))?
        .await?;
    let diff = example.repository_diff().await?;
--- a/crates/eval/src/example.rs
+++ b/crates/eval/src/example.rs
@ -115,6 +115,8 @@ impl Example {
    pub async fn setup(&self) -> Result<()> {
        let repo_path = repo_path_for_url(&self.base.url);

+        println!("{}> Fetching", self.name);
+
        run_git(
            &repo_path,
            &["fetch", "--depth", "1", "origin", &self.base.revision],
--- a/crates/eval/src/judge_prompt.hbs
+++ b/crates/eval/src/judge_prompt.hbs
@ -11,6 +11,7 @@ Use the following criteria to score the above changes.
 </criteria>

 Based on these criteria, give the test output a score between 0 and 5.
+The output score should ONLY INCLUDE whole numbers. DO NOT return decimals or floats.

 - 5 means: changes meet all criteria
 - 0 means: changes don't meet any criteria
--- a/typos.toml
+++ b/typos.toml
@ -41,6 +41,10 @@ extend-exclude = [
    "docs/theme/css/",
    # Spellcheck triggers on `|Fixe[sd]|` regex part.
    "script/danger/dangerfile.ts",
+    # Eval examples for prompts and criteria
+    "crates/eval/examples/checkpoint_stability/criteria.md",
+    "crates/eval/examples/tax_id_validation/prompt.md",
+    "crates/eval/examples/tax_id_validation/criteria.md"
 ]

 [default]
				`@ -0,0 +1 @@`
				I need to make several improvements to our Debian image-building scripts. First, fix the typo in the `Makefile` where `--keyrign` is incorrectly used instead of `--keyring` across all build targets, including the standard Debian image and Java variants like `debian11-java`, `debian11-graal`, and `debian11-corretto`. Second, enhance the `debian/mkimage.sh` script to include proper error handling, usage documentation, and command-line argument validation. The script should check for required tools like `debootstrap`, `unzip`, and `trivy`, and provide installation instructions if they're missing. Improve the GPG key setup by ensuring the `/root/.gnupg` directory is properly configured before importing keys. Add structured logging with timestamps, warnings, and informational messages. Implement better package installation with retries and proper cleanup. Finally, include clear instructions at the end on how to load and run the generated Docker image, with example commands for verification. The script should be robust, well-documented, and fail early with meaningful error messages if system requirements aren't met.
				`@ -0,0 +1 @@`
				We need to improve how our language model handles file attachments by making the file type definitions more modular and reusable. Currently, file-related properties are defined inline within the model’s response and stream types, which makes maintenance harder and duplicates documentation. The goal is to extract these definitions into a dedicated type that can be shared consistently across both static responses and streaming payloads. The new type should include clear documentation about media types (referencing IANA standards) and support both base64 and binary data formats without unnecessary conversions. This change should maintain backward compatibility while centralizing the file structure definition for better type safety and readability. Focus on clean separation of concerns, and ensure the extracted type is properly exported and imported where needed.
				`@ -0,0 +1 @@`
				I'd like to improve the configurability and usability of the text generation script for the CodeGeeX4-9B model. Please refactor the argument parsing to set more user-friendly defaults where possible, especially for generation parameters like temperature and top-p, and change fields like verbose_prompt to a more general verbose flag. Simplify the handling of optional paths like cache or weight paths, making them truly optional with fallbacks. I also want the model config to support deserialization from a JSON file instead of relying on hardcoded defaults, including support for a rope_ratio parameter with a sensible default. Lastly, please clean up the code for consistency—such as import ordering—and ensure everything aligns with these improvements without changing the overall functionality.
				`@ -0,0 +1 @@`
				The goal of this update is to fix NaN value handling in our JavaScript object diffing functionality. Currently, the diff function fails to properly recognize that two NaN values should be treated as equal due to JavaScript's native behavior where `NaN !== NaN`. This causes incorrect change detection when comparing objects or arrays containing NaN values. The solution involves modifying the diff function to explicitly check for NaN values using `Number.isNaN()` during comparisons of object keys and values, ensuring NaN values are treated as equivalent. The implementation requires adding specific NaN equivalence checks while maintaining existing comparison logic. Additionally, comprehensive unit tests are being added to verify correct handling across various scenarios: creating objects/arrays with NaN values, changing NaN values to other values, ensuring no false positives when NaN values remain unchanged, and properly tracking removal of NaN values from both objects and arrays. This change will bring the diff behavior in line with mathematical expectations for NaN comparisons while maintaining all other existing functionality.
				`@ -0,0 +1 @@`
				I need to refactor our codebase to improve the clarity and consistency of our data model, particularly around how we handle labeled evaluation data for our threshold optimization system. Currently, the naming and structure might imply that this data is only used for testing, when in reality it represents labeled examples that power both training and evaluation. The changes should better reflect that these are curated data points with known outcomes, not just test cases. Focus on updating the core data model and ensuring all dependent components—like the cache optimizer, router, and evaluation utilities—properly reference this updated concept. The implementation should maintain all existing functionality while making the naming more semantically accurate. Where relevant, consider adding parameters to fine-tune optimization behavior, like allowing control over the granularity of threshold searches.
				`@ -0,0 +1 @@`
				I need help modernizing the HTTP client in my Node.js tunneling service. The current implementation uses the older `request` library, which is now deprecated, and I'd like to switch to a more modern, promise-based alternative like `axios`. The changes should maintain all existing functionality—including error handling, retry logic, and response parsing—but improve readability and maintainability by using async/await or proper promise chaining where possible. The request parameters and response handling should be updated to match the new library's conventions while preserving the same behavior for downstream consumers. Additionally, ensure the package.json dependencies are updated accordingly, removing deprecated packages and cleaning up the dependency list. The core tunneling logic should remain unchanged; this is purely about updating the HTTP client layer to be more future-proof.
				`@ -0,0 +1 @@`
				I need to improve how our codebase handles Docker dependency checks and error reporting. Right now, the logic for verifying Docker and buildx installations is scattered across different classes, and the error messages aren't consistent. I'd like a more unified approach where we centralize these checks in a single place, making it easier to maintain and reuse. Additionally, we should introduce a dedicated error type for dependency-related failures instead of repurposing existing errors like BuildError. The changes should ensure that any command requiring Docker (like builds or registry logins) properly validates dependencies first, with clear error messages if something is missing. The solution should be clean, follow existing patterns in the codebase, and include any necessary test updates to reflect the new behavior.
				`@ -0,0 +1 @@`
				I need to enhance our data table functionality to support more advanced sorting capabilities, particularly for columns that display formatted values (like readable sizes or percentages) but should sort based on their underlying raw numeric values. The table should also include additional metrics for average and maximum part sizes (both compressed and uncompressed) to give better insights into table storage characteristics. These new metrics should follow the same pattern as existing columns, with formatted readable versions, percentage calculations relative to the dataset maximum, and proper sorting behavior. The sorting system should be flexible enough to support both custom sorting logic (like comparing raw numbers behind formatted strings) and built-in sorting methods, with a clean way to configure which columns use which sorting approach. The implementation should maintain consistency with our existing column formatting system and integrate smoothly with the React Table setup we already have in place.
				`@ -0,0 +1 @@`
				Refactor the monitoring daemon integration tests to use local Docker-managed dependencies instead of direct AWS dependencies. First, create a docker-compose.yml file with two services: a PostgreSQL container with test credentials exposed on port 5432, and a LocalStack S3 container exposed on port 4566. Next, modify the test file to remove reliance on external AWS credentials and replace them with fixtures that configure a LocalStack S3 mock. The fixtures should include session-scoped setup for hostnames, PostgreSQL connections, and AWS environment variables with hardcoded test credentials (e.g., fake access keys). Ensure the S3 fixture initializes a test bucket. Move the AWS environment fixture to align with other fixtures and update the test logic to use the new LocalStack endpoint URL, handling both local and Buildkite environments. Keep the core test cases (like monitoring run attempts) intact but adapt them to use the new Docker-based dependencies.
				`@ -0,0 +1 @@`
				Upgrade all Goose-related packages and apps from version 1.0.17 to 1.0.18 throughout the codebase. This includes updating version references in Cargo.lock, package-lock.json, and source files where applicable. In addition, streamline the addExtension logic in App.tsx by removing the outdated addExtensionToConfig references and replacing them with the new unified addExtension function. Ensure that all function dependencies and hooks reflect this updated usage. The goal is to improve maintainability and consistency across the codebase without introducing any functional changes.