4.8 KiB
4.8 KiB
Basic Operations Test Results
Highscores
Performance Rankings (Duration)
| Test | Model | Duration (ms) | Duration (s) |
|---|---|---|---|
| addition | openai/gpt-4o-mini | 893 | 0.89 |
| addition | deepseek/deepseek-r1-distill-qwen-14b:free | 1657 | 1.66 |
| addition | openai/gpt-3.5-turbo | 1930 | 1.93 |
| addition | openrouter/quasar-alpha | 2215 | 2.21 |
| multiplication | openai/gpt-3.5-turbo | 868 | 0.87 |
| multiplication | openrouter/quasar-alpha | 930 | 0.93 |
| multiplication | openai/gpt-4o-mini | 967 | 0.97 |
| multiplication | deepseek/deepseek-r1-distill-qwen-14b:free | 1139 | 1.14 |
| division | openai/gpt-4o-mini | 752 | 0.75 |
| division | openai/gpt-3.5-turbo | 913 | 0.91 |
| division | openrouter/quasar-alpha | 1182 | 1.18 |
| division | deepseek/deepseek-r1-distill-qwen-14b:free | 1626 | 1.63 |
| web_content | deepseek/deepseek-r1-distill-qwen-14b:free | 261 | 0.26 |
| web_content | openai/gpt-3.5-turbo | 3352 | 3.35 |
| web_content | openai/gpt-4o-mini | 8906 | 8.91 |
| web_content | openrouter/quasar-alpha | 11391 | 11.39 |
Summary
- Total Tests: 16
- Passed: 12
- Failed: 4
- Success Rate: 75.00%
- Average Duration: 2436ms (2.44s)
Failed Tests
web_content - openai/gpt-3.5-turbo
- Prompt:
Check if the content contains a section about Human prehistory. Reply with "yes" if it does, "no" if it does not. - Expected:
yes - Actual: ``
- Duration: 3352ms (3.35s)
- Reason: Model returned empty response
- Timestamp: 4/7/2025, 7:03:50 PM
web_content - deepseek/deepseek-r1-distill-qwen-14b:free
- Prompt:
Check if the content contains a section about Human prehistory. Reply with "yes" if it does, "no" if it does not. - Expected:
yes - Actual: ``
- Duration: 261ms (0.26s)
- Reason: Model returned empty response
- Timestamp: 4/7/2025, 7:03:50 PM
web_content - openai/gpt-4o-mini
- Prompt:
Check if the content contains a section about Human prehistory. Reply with "yes" if it does, "no" if it does not. - Expected:
yes - Actual: ``
- Duration: 8906ms (8.91s)
- Reason: Model returned empty response
- Timestamp: 4/7/2025, 7:03:59 PM
web_content - openrouter/quasar-alpha
- Prompt:
Check if the content contains a section about Human prehistory. Reply with "yes" if it does, "no" if it does not. - Expected:
yes - Actual: ``
- Duration: 11391ms (11.39s)
- Reason: Model returned empty response
- Timestamp: 4/7/2025, 7:04:11 PM
Passed Tests
addition - openai/gpt-3.5-turbo
- Prompt:
add 5 and 3. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 1930ms (1.93s)
- Timestamp: 4/7/2025, 7:03:33 PM
addition - deepseek/deepseek-r1-distill-qwen-14b:free
- Prompt:
add 5 and 3. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 1657ms (1.66s)
- Timestamp: 4/7/2025, 7:03:35 PM
addition - openai/gpt-4o-mini
- Prompt:
add 5 and 3. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 893ms (0.89s)
- Timestamp: 4/7/2025, 7:03:36 PM
addition - openrouter/quasar-alpha
- Prompt:
add 5 and 3. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 2215ms (2.21s)
- Timestamp: 4/7/2025, 7:03:38 PM
multiplication - openai/gpt-3.5-turbo
- Prompt:
multiply 8 and 3. Return only the number, no explanation. - Expected:
24 - Actual:
24 - Duration: 868ms (0.87s)
- Timestamp: 4/7/2025, 7:03:39 PM
multiplication - deepseek/deepseek-r1-distill-qwen-14b:free
- Prompt:
multiply 8 and 3. Return only the number, no explanation. - Expected:
24 - Actual:
24 - Duration: 1139ms (1.14s)
- Timestamp: 4/7/2025, 7:03:40 PM
multiplication - openai/gpt-4o-mini
- Prompt:
multiply 8 and 3. Return only the number, no explanation. - Expected:
24 - Actual:
24 - Duration: 967ms (0.97s)
- Timestamp: 4/7/2025, 7:03:41 PM
multiplication - openrouter/quasar-alpha
- Prompt:
multiply 8 and 3. Return only the number, no explanation. - Expected:
24 - Actual:
24 - Duration: 930ms (0.93s)
- Timestamp: 4/7/2025, 7:03:42 PM
division - openai/gpt-3.5-turbo
- Prompt:
divide 15 by 3. Return only the number, no explanation. - Expected:
5 - Actual:
5 - Duration: 913ms (0.91s)
- Timestamp: 4/7/2025, 7:03:43 PM
division - deepseek/deepseek-r1-distill-qwen-14b:free
- Prompt:
divide 15 by 3. Return only the number, no explanation. - Expected:
5 - Actual:
5 - Duration: 1626ms (1.63s)
- Timestamp: 4/7/2025, 7:03:45 PM
division - openai/gpt-4o-mini
- Prompt:
divide 15 by 3. Return only the number, no explanation. - Expected:
5 - Actual:
5 - Duration: 752ms (0.75s)
- Timestamp: 4/7/2025, 7:03:45 PM
division - openrouter/quasar-alpha
- Prompt:
divide 15 by 3. Return only the number, no explanation. - Expected:
5 - Actual:
5 - Duration: 1182ms (1.18s)
- Timestamp: 4/7/2025, 7:03:47 PM