262 lines
5.4 KiB
Markdown
262 lines
5.4 KiB
Markdown
# Basic Operations Test Results
|
|
|
|
## Highscores
|
|
|
|
| Test | Model | Duration (ms) | Duration (s) |
|
|
|------|-------|--------------|--------------|
|
|
| addition | openai/gpt-4o-mini | 885 | 0.89 |
|
|
| division | openai/gpt-3.5-turbo | 889 | 0.89 |
|
|
| division | qwen/qwq-32b | 917 | 0.92 |
|
|
| multiplication | openai/gpt-3.5-turbo | 984 | 0.98 |
|
|
| division | openai/gpt-4o-mini | 1104 | 1.10 |
|
|
| multiplication | openai/gpt-4o-mini | 1111 | 1.11 |
|
|
| multiplication | anthropic/claude-3.5-sonnet | 1190 | 1.19 |
|
|
| division | anthropic/claude-3.5-sonnet | 1405 | 1.41 |
|
|
| multiplication | deepseek/deepseek-r1-distill-qwen-14b:free | 1558 | 1.56 |
|
|
| addition | anthropic/claude-3.5-sonnet | 1689 | 1.69 |
|
|
| division | deepseek/deepseek-r1-distill-qwen-14b:free | 3646 | 3.65 |
|
|
| addition | qwen/qwq-32b | 3807 | 3.81 |
|
|
| multiplication | qwen/qwq-32b | 5008 | 5.01 |
|
|
| division | deepseek/deepseek-r1 | 7130 | 7.13 |
|
|
| addition | openai/gpt-3.5-turbo | 10455 | 10.46 |
|
|
| addition | deepseek/deepseek-r1 | 12064 | 12.06 |
|
|
|
|
## Summary
|
|
|
|
- Total Tests: 18
|
|
- Passed: 16
|
|
- Failed: 2
|
|
- Success Rate: 88.89%
|
|
- Average Duration: 3639ms (3.64s)
|
|
|
|
## Failed Tests
|
|
|
|
### addition - deepseek/deepseek-r1-distill-qwen-14b:free
|
|
|
|
- Prompt: `add 5 and 3. Return only the number, no explanation.`
|
|
- Expected: `8`
|
|
- Actual: `The sum of 5 and 3 is 8. Therefore, the result is \boxed{8}.`
|
|
- Duration: 6405ms (6405.00s)
|
|
- Reason: Expected 8, but got the sum of 5 and 3 is 8. therefore, the result is \boxed{8}.
|
|
- Timestamp: 4/2/2025, 3:44:40 PM
|
|
|
|
### multiplication - deepseek/deepseek-r1
|
|
|
|
- Prompt: `multiply 8 and 3. Return only the number, no explanation.`
|
|
- Expected: `24`
|
|
- Actual: `24
|
|
|
|
24
|
|
|
|
The result is 24.
|
|
|
|
24
|
|
|
|
Here's the answer: 24
|
|
|
|
The answer will be 24.
|
|
|
|
24
|
|
|
|
24
|
|
|
|
The product of 8 and 3 is 24.
|
|
|
|
24
|
|
|
|
The answer is 24.
|
|
|
|
24
|
|
|
|
24
|
|
|
|
24
|
|
|
|
The result is 24.
|
|
|
|
24
|
|
|
|
Here's the numerical result: 24
|
|
|
|
The answer is 24.
|
|
|
|
24
|
|
|
|
24
|
|
|
|
The answer is 24.`
|
|
- Duration: 5258ms (5258.00s)
|
|
- Reason: Expected 24, but got 24
|
|
|
|
24
|
|
|
|
the result is 24.
|
|
|
|
24
|
|
|
|
here's the answer: 24
|
|
|
|
the answer will be 24.
|
|
|
|
24
|
|
|
|
24
|
|
|
|
the product of 8 and 3 is 24.
|
|
|
|
24
|
|
|
|
the answer is 24.
|
|
|
|
24
|
|
|
|
24
|
|
|
|
24
|
|
|
|
the result is 24.
|
|
|
|
24
|
|
|
|
here's the numerical result: 24
|
|
|
|
the answer is 24.
|
|
|
|
24
|
|
|
|
24
|
|
|
|
the answer is 24.
|
|
- Timestamp: 4/2/2025, 3:44:53 PM
|
|
|
|
## Passed Tests
|
|
|
|
### addition - anthropic/claude-3.5-sonnet
|
|
|
|
- Prompt: `add 5 and 3. Return only the number, no explanation.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 1689ms (1689.00s)
|
|
- Timestamp: 4/2/2025, 3:44:06 PM
|
|
|
|
### addition - qwen/qwq-32b
|
|
|
|
- Prompt: `add 5 and 3. Return only the number, no explanation.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 3807ms (3807.00s)
|
|
- Timestamp: 4/2/2025, 3:44:10 PM
|
|
|
|
### addition - openai/gpt-4o-mini
|
|
|
|
- Prompt: `add 5 and 3. Return only the number, no explanation.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 885ms (885.00s)
|
|
- Timestamp: 4/2/2025, 3:44:11 PM
|
|
|
|
### addition - openai/gpt-3.5-turbo
|
|
|
|
- Prompt: `add 5 and 3. Return only the number, no explanation.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 10455ms (10455.00s)
|
|
- Timestamp: 4/2/2025, 3:44:21 PM
|
|
|
|
### addition - deepseek/deepseek-r1
|
|
|
|
- Prompt: `add 5 and 3. Return only the number, no explanation.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 12064ms (12064.00s)
|
|
- Timestamp: 4/2/2025, 3:44:33 PM
|
|
|
|
### multiplication - anthropic/claude-3.5-sonnet
|
|
|
|
- Prompt: `multiply 8 and 3. Return only the number, no explanation.`
|
|
- Expected: `24`
|
|
- Actual: `24`
|
|
- Duration: 1190ms (1190.00s)
|
|
- Timestamp: 4/2/2025, 3:44:41 PM
|
|
|
|
### multiplication - qwen/qwq-32b
|
|
|
|
- Prompt: `multiply 8 and 3. Return only the number, no explanation.`
|
|
- Expected: `24`
|
|
- Actual: `24`
|
|
- Duration: 5008ms (5008.00s)
|
|
- Timestamp: 4/2/2025, 3:44:46 PM
|
|
|
|
### multiplication - openai/gpt-4o-mini
|
|
|
|
- Prompt: `multiply 8 and 3. Return only the number, no explanation.`
|
|
- Expected: `24`
|
|
- Actual: `24`
|
|
- Duration: 1111ms (1111.00s)
|
|
- Timestamp: 4/2/2025, 3:44:47 PM
|
|
|
|
### multiplication - openai/gpt-3.5-turbo
|
|
|
|
- Prompt: `multiply 8 and 3. Return only the number, no explanation.`
|
|
- Expected: `24`
|
|
- Actual: `24`
|
|
- Duration: 984ms (984.00s)
|
|
- Timestamp: 4/2/2025, 3:44:48 PM
|
|
|
|
### multiplication - deepseek/deepseek-r1-distill-qwen-14b:free
|
|
|
|
- Prompt: `multiply 8 and 3. Return only the number, no explanation.`
|
|
- Expected: `24`
|
|
- Actual: `24`
|
|
- Duration: 1558ms (1558.00s)
|
|
- Timestamp: 4/2/2025, 3:44:55 PM
|
|
|
|
### division - anthropic/claude-3.5-sonnet
|
|
|
|
- Prompt: `divide 15 by 3. Return only the number, no explanation.`
|
|
- Expected: `5`
|
|
- Actual: `5`
|
|
- Duration: 1405ms (1405.00s)
|
|
- Timestamp: 4/2/2025, 3:44:56 PM
|
|
|
|
### division - qwen/qwq-32b
|
|
|
|
- Prompt: `divide 15 by 3. Return only the number, no explanation.`
|
|
- Expected: `5`
|
|
- Actual: `5`
|
|
- Duration: 917ms (917.00s)
|
|
- Timestamp: 4/2/2025, 3:44:57 PM
|
|
|
|
### division - openai/gpt-4o-mini
|
|
|
|
- Prompt: `divide 15 by 3. Return only the number, no explanation.`
|
|
- Expected: `5`
|
|
- Actual: `5`
|
|
- Duration: 1104ms (1104.00s)
|
|
- Timestamp: 4/2/2025, 3:44:58 PM
|
|
|
|
### division - openai/gpt-3.5-turbo
|
|
|
|
- Prompt: `divide 15 by 3. Return only the number, no explanation.`
|
|
- Expected: `5`
|
|
- Actual: `5`
|
|
- Duration: 889ms (889.00s)
|
|
- Timestamp: 4/2/2025, 3:44:59 PM
|
|
|
|
### division - deepseek/deepseek-r1
|
|
|
|
- Prompt: `divide 15 by 3. Return only the number, no explanation.`
|
|
- Expected: `5`
|
|
- Actual: `5`
|
|
- Duration: 7130ms (7130.00s)
|
|
- Timestamp: 4/2/2025, 3:45:06 PM
|
|
|
|
### division - deepseek/deepseek-r1-distill-qwen-14b:free
|
|
|
|
- Prompt: `divide 15 by 3. Return only the number, no explanation.`
|
|
- Expected: `5`
|
|
- Actual: `5`
|
|
- Duration: 3646ms (3646.00s)
|
|
- Timestamp: 4/2/2025, 3:45:10 PM
|
|
|