8.5 KiB
8.5 KiB
Math Operations Test Results
Highscores
| Test | Model | Duration (ms) | Duration (s) |
|---|---|---|---|
| factorial | openai/gpt-3.5-turbo | 827 | 0.83 |
| factorial | openai/gpt-4o-mini | 956 | 0.96 |
| square_root | openai/gpt-4o-mini | 964 | 0.96 |
| square_root | openai/gpt-3.5-turbo | 1080 | 1.08 |
| power | anthropic/claude-3.5-sonnet | 1136 | 1.14 |
| power | openai/gpt-4o-mini | 1259 | 1.26 |
| power | openai/gpt-3.5-turbo | 1498 | 1.50 |
| fibonacci | openai/gpt-3.5-turbo | 1543 | 1.54 |
| fibonacci | openai/gpt-4o-mini | 1673 | 1.67 |
| factorial | anthropic/claude-3.5-sonnet | 1853 | 1.85 |
| fibonacci | anthropic/claude-3.5-sonnet | 2004 | 2.00 |
| square_root | anthropic/claude-3.5-sonnet | 2012 | 2.01 |
| factorial | deepseek/deepseek-r1-distill-qwen-14b:free | 4814 | 4.81 |
| power | deepseek/deepseek-r1 | 5414 | 5.41 |
| square_root | qwen/qwq-32b | 5888 | 5.89 |
| square_root | deepseek/deepseek-r1-distill-qwen-14b:free | 6114 | 6.11 |
| quadratic | qwen/qwq-32b | 6795 | 6.79 |
| factorial | qwen/qwq-32b | 6892 | 6.89 |
| power | qwen/qwq-32b | 7572 | 7.57 |
| power | deepseek/deepseek-r1-distill-qwen-14b:free | 9891 | 9.89 |
| square_root | deepseek/deepseek-r1 | 10309 | 10.31 |
| factorial | deepseek/deepseek-r1 | 11193 | 11.19 |
Summary
- Total Tests: 29
- Passed: 22
- Failed: 7
- Success Rate: 75.86%
- Average Duration: 4745ms (4.75s)
Failed Tests
quadratic - anthropic/claude-3.5-sonnet
- Prompt:
Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation. - Expected:
-3,-2 - Actual:
-2,-3 - Duration: 1892ms (1892.00s)
- Reason: Expected -3,-2, but got -2,-3
- Timestamp: 4/2/2025, 3:32:51 PM
quadratic - openai/gpt-4o-mini
- Prompt:
Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation. - Expected:
-3,-2 - Actual:
-2, -3 - Duration: 853ms (853.00s)
- Reason: Expected -3,-2, but got -2, -3
- Timestamp: 4/2/2025, 3:32:59 PM
quadratic - openai/gpt-3.5-turbo
- Prompt:
Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation. - Expected:
-3,-2 - Actual:
-2, -3 - Duration: 832ms (832.00s)
- Reason: Expected -3,-2, but got -2, -3
- Timestamp: 4/2/2025, 3:32:59 PM
quadratic - deepseek/deepseek-r1
- Prompt:
Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation. - Expected:
-3,-2 - Actual:
-2, -3 - Duration: 19850ms (19850.00s)
- Reason: Expected -3,-2, but got -2, -3
- Timestamp: 4/2/2025, 3:33:19 PM
quadratic - deepseek/deepseek-r1-distill-qwen-14b:free
- Prompt:
Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation. - Expected:
-3,-2 - Actual: `The solutions to the quadratic equation x² + 5x + 6 = 0 are x = -2 and x = -3.
-2,-3`
- Duration: 15811ms (15811.00s)
- Reason: Expected -3,-2, but got the solutions to the quadratic equation x² + 5x + 6 = 0 are x = -2 and x = -3.
-2,-3
- Timestamp: 4/2/2025, 3:33:35 PM
fibonacci - qwen/qwq-32b
- Prompt:
Calculate the 6th number in the Fibonacci sequence. Return only the number, no explanation. - Expected:
8 - Actual:
5 - Duration: 1509ms (1509.00s)
- Reason: Expected 8, but got 5
- Timestamp: 4/2/2025, 3:34:05 PM
fibonacci - deepseek/deepseek-r1-distill-qwen-14b:free
- Prompt:
Calculate the 6th number in the Fibonacci sequence. Return only the number, no explanation. - Expected:
8 - Actual:
5 - Duration: 5171ms (5171.00s)
- Reason: Expected 8, but got 5
- Timestamp: 4/2/2025, 3:34:44 PM
Passed Tests
quadratic - qwen/qwq-32b
- Prompt:
Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation. - Expected:
-3,-2 - Actual:
-3,-2 - Duration: 6795ms (6795.00s)
- Timestamp: 4/2/2025, 3:32:58 PM
factorial - anthropic/claude-3.5-sonnet
- Prompt:
Calculate 5! (factorial of 5). Return only the number, no explanation. - Expected:
120 - Actual:
120 - Duration: 1853ms (1853.00s)
- Timestamp: 4/2/2025, 3:33:37 PM
factorial - qwen/qwq-32b
- Prompt:
Calculate 5! (factorial of 5). Return only the number, no explanation. - Expected:
120 - Actual:
120 - Duration: 6892ms (6892.00s)
- Timestamp: 4/2/2025, 3:33:44 PM
factorial - openai/gpt-4o-mini
- Prompt:
Calculate 5! (factorial of 5). Return only the number, no explanation. - Expected:
120 - Actual:
120 - Duration: 956ms (956.00s)
- Timestamp: 4/2/2025, 3:33:45 PM
factorial - openai/gpt-3.5-turbo
- Prompt:
Calculate 5! (factorial of 5). Return only the number, no explanation. - Expected:
120 - Actual:
120 - Duration: 827ms (827.00s)
- Timestamp: 4/2/2025, 3:33:46 PM
factorial - deepseek/deepseek-r1
- Prompt:
Calculate 5! (factorial of 5). Return only the number, no explanation. - Expected:
120 - Actual:
120 - Duration: 11193ms (11193.00s)
- Timestamp: 4/2/2025, 3:33:57 PM
factorial - deepseek/deepseek-r1-distill-qwen-14b:free
- Prompt:
Calculate 5! (factorial of 5). Return only the number, no explanation. - Expected:
120 - Actual:
120 - Duration: 4814ms (4814.00s)
- Timestamp: 4/2/2025, 3:34:02 PM
fibonacci - anthropic/claude-3.5-sonnet
- Prompt:
Calculate the 6th number in the Fibonacci sequence. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 2004ms (2004.00s)
- Timestamp: 4/2/2025, 3:34:04 PM
fibonacci - openai/gpt-4o-mini
- Prompt:
Calculate the 6th number in the Fibonacci sequence. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 1673ms (1673.00s)
- Timestamp: 4/2/2025, 3:34:07 PM
fibonacci - openai/gpt-3.5-turbo
- Prompt:
Calculate the 6th number in the Fibonacci sequence. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 1543ms (1543.00s)
- Timestamp: 4/2/2025, 3:34:08 PM
square_root - anthropic/claude-3.5-sonnet
- Prompt:
Calculate the square root of 16. Return only the number, no explanation. - Expected:
4 - Actual:
4 - Duration: 2012ms (2012.00s)
- Timestamp: 4/2/2025, 3:34:46 PM
square_root - qwen/qwq-32b
- Prompt:
Calculate the square root of 16. Return only the number, no explanation. - Expected:
4 - Actual:
4 - Duration: 5888ms (5888.00s)
- Timestamp: 4/2/2025, 3:34:52 PM
square_root - openai/gpt-4o-mini
- Prompt:
Calculate the square root of 16. Return only the number, no explanation. - Expected:
4 - Actual:
4 - Duration: 964ms (964.00s)
- Timestamp: 4/2/2025, 3:34:52 PM
square_root - openai/gpt-3.5-turbo
- Prompt:
Calculate the square root of 16. Return only the number, no explanation. - Expected:
4 - Actual:
4 - Duration: 1080ms (1080.00s)
- Timestamp: 4/2/2025, 3:34:54 PM
square_root - deepseek/deepseek-r1
- Prompt:
Calculate the square root of 16. Return only the number, no explanation. - Expected:
4 - Actual:
4 - Duration: 10309ms (10309.00s)
- Timestamp: 4/2/2025, 3:35:04 PM
square_root - deepseek/deepseek-r1-distill-qwen-14b:free
- Prompt:
Calculate the square root of 16. Return only the number, no explanation. - Expected:
4 - Actual:
4 - Duration: 6114ms (6114.00s)
- Timestamp: 4/2/2025, 3:35:10 PM
power - anthropic/claude-3.5-sonnet
- Prompt:
Calculate 2 raised to the power of 3. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 1136ms (1136.00s)
- Timestamp: 4/2/2025, 3:35:11 PM
power - qwen/qwq-32b
- Prompt:
Calculate 2 raised to the power of 3. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 7572ms (7572.00s)
- Timestamp: 4/2/2025, 3:35:19 PM
power - openai/gpt-4o-mini
- Prompt:
Calculate 2 raised to the power of 3. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 1259ms (1259.00s)
- Timestamp: 4/2/2025, 3:35:20 PM
power - openai/gpt-3.5-turbo
- Prompt:
Calculate 2 raised to the power of 3. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 1498ms (1498.00s)
- Timestamp: 4/2/2025, 3:35:21 PM
power - deepseek/deepseek-r1
- Prompt:
Calculate 2 raised to the power of 3. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 5414ms (5414.00s)
- Timestamp: 4/2/2025, 3:35:27 PM
power - deepseek/deepseek-r1-distill-qwen-14b:free
- Prompt:
Calculate 2 raised to the power of 3. Return only the number, no explanation. - Expected:
8 - Actual:
8 - Duration: 9891ms (9891.00s)
- Timestamp: 4/2/2025, 3:35:37 PM