Richard Feldman
2b431d3e9d
Re-add code block formatting instructions (#29574)
Re-enabled instructions about code block formatting.
In practice, the model doesn't seem to use these very often, but there's
no negative effect on evals. In a future PR, I'll experiment with adding
more evals around the model actually using the code blocks.
2 runs before: (`--repetitions=8`)
```
=================================================================
AGGREGATE
=================================================================
4 examples failed to run!
Average programmatic score: 37%
Average diff score: 66%
Average thread score: 93%
-----------------------------------------------------------------
CUMULATIVE TOOL METRICS
-----------------------------------------------------------------
┌──────────────────────────────┬──────────┬──────────┬──────────┐
│ Tool │ Uses │ Failures │ Rate │
├──────────────────────────────┼──────────┼──────────┼──────────┤
│edit_file │ 398 │ 53 │ 13% │
│terminal │ 11 │ 1 │ 9% │
│create_file │ 40 │ 2 │ 5% │
│read_file │ 245 │ 8 │ 3% │
│find_path │ 48 │ 0 │ 0% │
│list_directory │ 13 │ 0 │ 0% │
│grep │ 133 │ 0 │ 0% │
│thinking │ 18 │ 0 │ 0% │
│diagnostics │ 130 │ 0 │ 0% │
```
```
=================================================================
AGGREGATE
=================================================================
1 examples failed to run!
Average programmatic score: 41%
Average diff score: 68%
Average thread score: 96%
-----------------------------------------------------------------
CUMULATIVE TOOL METRICS
-----------------------------------------------------------------
┌──────────────────────────────┬──────────┬──────────┬──────────┐
│ Tool │ Uses │ Failures │ Rate │
├──────────────────────────────┼──────────┼──────────┼──────────┤
│fetch │ 1 │ 1 │ 100% │
│edit_file │ 553 │ 63 │ 11% │
│read_file │ 349 │ 3 │ 1% │
│diagnostics │ 158 │ 0 │ 0% │
│find_path │ 70 │ 0 │ 0% │
│list_directory │ 10 │ 0 │ 0% │
│thinking │ 45 │ 0 │ 0% │
│grep │ 213 │ 0 │ 0% │
│create_file │ 24 │ 0 │ 0% │
│terminal │ 17 │ 0 │ 0% │
└──────────────────────────────┴──────────┴──────────┴──────────┘
```
1 run after this change:
```
=================================================================
AGGREGATE
=================================================================
Average programmatic score: 42%
Average diff score: 74%
Average thread score: 100%
-----------------------------------------------------------------
CUMULATIVE TOOL METRICS
-----------------------------------------------------------------
┌──────────────────────────────┬──────────┬──────────┬──────────┐
│ Tool │ Uses │ Failures │ Rate │
├──────────────────────────────┼──────────┼──────────┼──────────┤
│edit_file │ 534 │ 92 │ 17% │
│read_file │ 325 │ 6 │ 2% │
│list_directory │ 6 │ 0 │ 0% │
│thinking │ 12 │ 0 │ 0% │
│create_file │ 16 │ 0 │ 0% │
│diagnostics │ 49 │ 0 │ 0% │
│grep │ 234 │ 0 │ 0% │
│find_path │ 65 │ 0 │ 0% │
│terminal │ 38 │ 0 │ 0% │
└──────────────────────────────┴──────────┴──────────┴──────────┘
```
Release Notes:
- N/A