114 lines
3.5 KiB
Markdown
114 lines
3.5 KiB
Markdown
# Math Operations Test Results
|
|
|
|
## Highscores
|
|
|
|
### Performance Rankings (Duration)
|
|
|
|
| Test | Model | Duration (ms) | Duration (s) |
|
|
|------|-------|--------------|--------------|
|
|
| quadratic | openai/gpt-4o-mini | 514 | 0.51 |
|
|
| quadratic | anthropic/claude-sonnet-4 | 1120 | 1.12 |
|
|
| factorial | openai/gpt-4o-mini | 512 | 0.51 |
|
|
| factorial | anthropic/claude-sonnet-4 | 877 | 0.88 |
|
|
| fibonacci | openai/gpt-4o-mini | 494 | 0.49 |
|
|
| fibonacci | anthropic/claude-sonnet-4 | 4093 | 4.09 |
|
|
| square_root | openai/gpt-4o-mini | 483 | 0.48 |
|
|
| square_root | anthropic/claude-sonnet-4 | 969 | 0.97 |
|
|
| power | anthropic/claude-sonnet-4 | 1129 | 1.13 |
|
|
| power | openai/gpt-4o-mini | 1308 | 1.31 |
|
|
|
|
## Summary
|
|
|
|
- Total Tests: 15
|
|
- Passed: 12
|
|
- Failed: 3
|
|
- Success Rate: 80.00%
|
|
- Average Duration: 1189ms (1.19s)
|
|
|
|
## Failed Tests
|
|
|
|
### quadratic - anthropic/claude-sonnet-4
|
|
|
|
- Prompt: `Solve the quadratic equation x² + 5x + 6 = 0. Respond ONLY with the solutions as comma-separated numbers (e.g., -3,-2). No other text.`
|
|
- Expected: `-3,-2`
|
|
- Actual: `-2,-3`
|
|
- Duration: 1120ms (1.12s)
|
|
- Reason: Expected -3,-2, but got -2,-3
|
|
- Timestamp: 3/19/2026, 4:39:00 PM
|
|
|
|
### quadratic - openai/gpt-4o-mini
|
|
|
|
- Prompt: `Solve the quadratic equation x² + 5x + 6 = 0. Respond ONLY with the solutions as comma-separated numbers (e.g., -3,-2). No other text.`
|
|
- Expected: `-3,-2`
|
|
- Actual: `-2,-3`
|
|
- Duration: 514ms (0.51s)
|
|
- Reason: Expected -3,-2, but got -2,-3
|
|
- Timestamp: 3/19/2026, 4:38:59 PM
|
|
|
|
## Passed Tests
|
|
|
|
### factorial - anthropic/claude-sonnet-4
|
|
|
|
- Prompt: `Calculate 5! (factorial of 5). Respond ONLY with the final numerical answer. No explanation, no other text.`
|
|
- Expected: `120`
|
|
- Actual: `120`
|
|
- Duration: 877ms (0.88s)
|
|
- Timestamp: 3/19/2026, 4:39:03 PM
|
|
|
|
### factorial - openai/gpt-4o-mini
|
|
|
|
- Prompt: `Calculate 5! (factorial of 5). Respond ONLY with the final numerical answer. No explanation, no other text.`
|
|
- Expected: `120`
|
|
- Actual: `120`
|
|
- Duration: 512ms (0.51s)
|
|
- Timestamp: 3/19/2026, 4:39:02 PM
|
|
|
|
### fibonacci - anthropic/claude-sonnet-4
|
|
|
|
- Prompt: `Calculate the 6th number in the Fibonacci sequence (assuming F(1)=1, F(2)=1). Respond ONLY with the final numerical answer. No other text.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 4093ms (4.09s)
|
|
- Timestamp: 3/19/2026, 4:39:08 PM
|
|
|
|
### fibonacci - openai/gpt-4o-mini
|
|
|
|
- Prompt: `Calculate the 6th number in the Fibonacci sequence (assuming F(1)=1, F(2)=1). Respond ONLY with the final numerical answer. No other text.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 494ms (0.49s)
|
|
- Timestamp: 3/19/2026, 4:39:04 PM
|
|
|
|
### square_root - anthropic/claude-sonnet-4
|
|
|
|
- Prompt: `Calculate the square root of 16. Respond ONLY with the final numerical answer. No other text.`
|
|
- Expected: `4`
|
|
- Actual: `4`
|
|
- Duration: 969ms (0.97s)
|
|
- Timestamp: 3/19/2026, 4:39:11 PM
|
|
|
|
### square_root - openai/gpt-4o-mini
|
|
|
|
- Prompt: `Calculate the square root of 16. Respond ONLY with the final numerical answer. No other text.`
|
|
- Expected: `4`
|
|
- Actual: `4`
|
|
- Duration: 483ms (0.48s)
|
|
- Timestamp: 3/19/2026, 4:39:10 PM
|
|
|
|
### power - anthropic/claude-sonnet-4
|
|
|
|
- Prompt: `Calculate 2 raised to the power of 3. Respond ONLY with the final numerical answer. No other text.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 1129ms (1.13s)
|
|
- Timestamp: 3/19/2026, 4:39:15 PM
|
|
|
|
### power - openai/gpt-4o-mini
|
|
|
|
- Prompt: `Calculate 2 raised to the power of 3. Respond ONLY with the final numerical answer. No other text.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 1308ms (1.31s)
|
|
- Timestamp: 3/19/2026, 4:39:14 PM
|
|
|