137 lines
4.0 KiB
Markdown
137 lines
4.0 KiB
Markdown
# Basic Operations Test Results
|
|
|
|
## Highscores
|
|
|
|
### Performance Rankings (Duration)
|
|
|
|
| Test | Model | Duration (ms) | Duration (s) |
|
|
|------|-------|--------------|--------------|
|
|
| addition | openai/gpt-4o-mini | 634 | 0.63 |
|
|
| addition | anthropic/claude-sonnet-4 | 1522 | 1.52 |
|
|
| addition | deepseek/deepseek-r1:free | 3394 | 3.39 |
|
|
| multiplication | anthropic/claude-sonnet-4 | 702 | 0.70 |
|
|
| multiplication | openai/gpt-4o-mini | 2765 | 2.77 |
|
|
| multiplication | deepseek/deepseek-r1:free | 3425 | 3.42 |
|
|
| division | openai/gpt-4o-mini | 564 | 0.56 |
|
|
| division | anthropic/claude-sonnet-4 | 1252 | 1.25 |
|
|
| division | deepseek/deepseek-r1:free | 4619 | 4.62 |
|
|
| web_content | anthropic/claude-sonnet-4 | 6161 | 6.16 |
|
|
| web_content | openai/gpt-4o-mini | 6225 | 6.22 |
|
|
| web_content | deepseek/deepseek-r1:free | 6879 | 6.88 |
|
|
|
|
## Summary
|
|
|
|
- Total Tests: 12
|
|
- Passed: 9
|
|
- Failed: 3
|
|
- Success Rate: 75.00%
|
|
- Average Duration: 3179ms (3.18s)
|
|
|
|
## Failed Tests
|
|
|
|
### multiplication - deepseek/deepseek-r1:free
|
|
|
|
- Prompt: `multiply 8 and 3. Return only the number, no explanation.`
|
|
- Expected: `24`
|
|
- Actual: `The result of multiplying 8 and 3 is \boxed{24}.`
|
|
- Duration: 3425ms (3.42s)
|
|
- Reason: Expected 24, but got The result of multiplying 8 and 3 is \boxed{24}.
|
|
- Timestamp: 6/6/2025, 12:54:59 AM
|
|
|
|
### web_content - anthropic/claude-sonnet-4
|
|
|
|
- Prompt: `Check if the content contains a section about Human prehistory. Reply with "yes" if it does, "no" if it does not.`
|
|
- Expected: `yes`
|
|
- Actual: `Looking through the table of contents in the Wikipedia article on Kenya, I can see that there is indeed a section titled "Prehistory" under the History section.
|
|
|
|
yes`
|
|
- Duration: 6161ms (6.16s)
|
|
- Reason: Expected yes, but got Looking through the table of contents in the Wikipedia article on Kenya, I can see that there is indeed a section titled "Prehistory" under the History section.
|
|
|
|
yes
|
|
- Timestamp: 6/6/2025, 12:55:12 AM
|
|
|
|
### web_content - deepseek/deepseek-r1:free
|
|
|
|
- Prompt: `Check if the content contains a section about Human prehistory. Reply with "yes" if it does, "no" if it does not.`
|
|
- Expected: `yes`
|
|
- Actual: ``
|
|
- Duration: 6879ms (6.88s)
|
|
- Reason: Model returned empty response
|
|
- Timestamp: 6/6/2025, 12:55:25 AM
|
|
|
|
## Passed Tests
|
|
|
|
### addition - anthropic/claude-sonnet-4
|
|
|
|
- Prompt: `add 5 and 3. Return only the number, no explanation.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 1522ms (1.52s)
|
|
- Timestamp: 6/6/2025, 12:54:48 AM
|
|
|
|
### addition - openai/gpt-4o-mini
|
|
|
|
- Prompt: `add 5 and 3. Return only the number, no explanation.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 634ms (0.63s)
|
|
- Timestamp: 6/6/2025, 12:54:49 AM
|
|
|
|
### addition - deepseek/deepseek-r1:free
|
|
|
|
- Prompt: `add 5 and 3. Return only the number, no explanation.`
|
|
- Expected: `8`
|
|
- Actual: `8`
|
|
- Duration: 3394ms (3.39s)
|
|
- Timestamp: 6/6/2025, 12:54:53 AM
|
|
|
|
### multiplication - anthropic/claude-sonnet-4
|
|
|
|
- Prompt: `multiply 8 and 3. Return only the number, no explanation.`
|
|
- Expected: `24`
|
|
- Actual: `24`
|
|
- Duration: 702ms (0.70s)
|
|
- Timestamp: 6/6/2025, 12:54:53 AM
|
|
|
|
### multiplication - openai/gpt-4o-mini
|
|
|
|
- Prompt: `multiply 8 and 3. Return only the number, no explanation.`
|
|
- Expected: `24`
|
|
- Actual: `24`
|
|
- Duration: 2765ms (2.77s)
|
|
- Timestamp: 6/6/2025, 12:54:56 AM
|
|
|
|
### division - anthropic/claude-sonnet-4
|
|
|
|
- Prompt: `divide 15 by 3. Return only the number, no explanation.`
|
|
- Expected: `5`
|
|
- Actual: `5`
|
|
- Duration: 1252ms (1.25s)
|
|
- Timestamp: 6/6/2025, 12:55:01 AM
|
|
|
|
### division - openai/gpt-4o-mini
|
|
|
|
- Prompt: `divide 15 by 3. Return only the number, no explanation.`
|
|
- Expected: `5`
|
|
- Actual: `5`
|
|
- Duration: 564ms (0.56s)
|
|
- Timestamp: 6/6/2025, 12:55:01 AM
|
|
|
|
### division - deepseek/deepseek-r1:free
|
|
|
|
- Prompt: `divide 15 by 3. Return only the number, no explanation.`
|
|
- Expected: `5`
|
|
- Actual: `5`
|
|
- Duration: 4619ms (4.62s)
|
|
- Timestamp: 6/6/2025, 12:55:06 AM
|
|
|
|
### web_content - openai/gpt-4o-mini
|
|
|
|
- Prompt: `Check if the content contains a section about Human prehistory. Reply with "yes" if it does, "no" if it does not.`
|
|
- Expected: `yes`
|
|
- Actual: `yes`
|
|
- Duration: 6225ms (6.22s)
|
|
- Timestamp: 6/6/2025, 12:55:18 AM
|
|
|