mono/packages/kbot/tests/unit/reports/math.md
2025-04-02 16:26:00 +02:00

8.5 KiB

Math Operations Test Results

Highscores

Test Model Duration (ms) Duration (s)
factorial openai/gpt-3.5-turbo 827 0.83
factorial openai/gpt-4o-mini 956 0.96
square_root openai/gpt-4o-mini 964 0.96
square_root openai/gpt-3.5-turbo 1080 1.08
power anthropic/claude-3.5-sonnet 1136 1.14
power openai/gpt-4o-mini 1259 1.26
power openai/gpt-3.5-turbo 1498 1.50
fibonacci openai/gpt-3.5-turbo 1543 1.54
fibonacci openai/gpt-4o-mini 1673 1.67
factorial anthropic/claude-3.5-sonnet 1853 1.85
fibonacci anthropic/claude-3.5-sonnet 2004 2.00
square_root anthropic/claude-3.5-sonnet 2012 2.01
factorial deepseek/deepseek-r1-distill-qwen-14b:free 4814 4.81
power deepseek/deepseek-r1 5414 5.41
square_root qwen/qwq-32b 5888 5.89
square_root deepseek/deepseek-r1-distill-qwen-14b:free 6114 6.11
quadratic qwen/qwq-32b 6795 6.79
factorial qwen/qwq-32b 6892 6.89
power qwen/qwq-32b 7572 7.57
power deepseek/deepseek-r1-distill-qwen-14b:free 9891 9.89
square_root deepseek/deepseek-r1 10309 10.31
factorial deepseek/deepseek-r1 11193 11.19

Summary

  • Total Tests: 29
  • Passed: 22
  • Failed: 7
  • Success Rate: 75.86%
  • Average Duration: 4745ms (4.75s)

Failed Tests

quadratic - anthropic/claude-3.5-sonnet

  • Prompt: Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation.
  • Expected: -3,-2
  • Actual: -2,-3
  • Duration: 1892ms (1892.00s)
  • Reason: Expected -3,-2, but got -2,-3
  • Timestamp: 4/2/2025, 3:32:51 PM

quadratic - openai/gpt-4o-mini

  • Prompt: Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation.
  • Expected: -3,-2
  • Actual: -2, -3
  • Duration: 853ms (853.00s)
  • Reason: Expected -3,-2, but got -2, -3
  • Timestamp: 4/2/2025, 3:32:59 PM

quadratic - openai/gpt-3.5-turbo

  • Prompt: Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation.
  • Expected: -3,-2
  • Actual: -2, -3
  • Duration: 832ms (832.00s)
  • Reason: Expected -3,-2, but got -2, -3
  • Timestamp: 4/2/2025, 3:32:59 PM

quadratic - deepseek/deepseek-r1

  • Prompt: Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation.
  • Expected: -3,-2
  • Actual: -2, -3
  • Duration: 19850ms (19850.00s)
  • Reason: Expected -3,-2, but got -2, -3
  • Timestamp: 4/2/2025, 3:33:19 PM

quadratic - deepseek/deepseek-r1-distill-qwen-14b:free

  • Prompt: Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation.
  • Expected: -3,-2
  • Actual: `The solutions to the quadratic equation x² + 5x + 6 = 0 are x = -2 and x = -3.

-2,-3`

  • Duration: 15811ms (15811.00s)
  • Reason: Expected -3,-2, but got the solutions to the quadratic equation x² + 5x + 6 = 0 are x = -2 and x = -3.

-2,-3

  • Timestamp: 4/2/2025, 3:33:35 PM

fibonacci - qwen/qwq-32b

  • Prompt: Calculate the 6th number in the Fibonacci sequence. Return only the number, no explanation.
  • Expected: 8
  • Actual: 5
  • Duration: 1509ms (1509.00s)
  • Reason: Expected 8, but got 5
  • Timestamp: 4/2/2025, 3:34:05 PM

fibonacci - deepseek/deepseek-r1-distill-qwen-14b:free

  • Prompt: Calculate the 6th number in the Fibonacci sequence. Return only the number, no explanation.
  • Expected: 8
  • Actual: 5
  • Duration: 5171ms (5171.00s)
  • Reason: Expected 8, but got 5
  • Timestamp: 4/2/2025, 3:34:44 PM

Passed Tests

quadratic - qwen/qwq-32b

  • Prompt: Solve the quadratic equation x² + 5x + 6 = 0. Return only the solutions as comma-separated numbers, no explanation.
  • Expected: -3,-2
  • Actual: -3,-2
  • Duration: 6795ms (6795.00s)
  • Timestamp: 4/2/2025, 3:32:58 PM

factorial - anthropic/claude-3.5-sonnet

  • Prompt: Calculate 5! (factorial of 5). Return only the number, no explanation.
  • Expected: 120
  • Actual: 120
  • Duration: 1853ms (1853.00s)
  • Timestamp: 4/2/2025, 3:33:37 PM

factorial - qwen/qwq-32b

  • Prompt: Calculate 5! (factorial of 5). Return only the number, no explanation.
  • Expected: 120
  • Actual: 120
  • Duration: 6892ms (6892.00s)
  • Timestamp: 4/2/2025, 3:33:44 PM

factorial - openai/gpt-4o-mini

  • Prompt: Calculate 5! (factorial of 5). Return only the number, no explanation.
  • Expected: 120
  • Actual: 120
  • Duration: 956ms (956.00s)
  • Timestamp: 4/2/2025, 3:33:45 PM

factorial - openai/gpt-3.5-turbo

  • Prompt: Calculate 5! (factorial of 5). Return only the number, no explanation.
  • Expected: 120
  • Actual: 120
  • Duration: 827ms (827.00s)
  • Timestamp: 4/2/2025, 3:33:46 PM

factorial - deepseek/deepseek-r1

  • Prompt: Calculate 5! (factorial of 5). Return only the number, no explanation.
  • Expected: 120
  • Actual: 120
  • Duration: 11193ms (11193.00s)
  • Timestamp: 4/2/2025, 3:33:57 PM

factorial - deepseek/deepseek-r1-distill-qwen-14b:free

  • Prompt: Calculate 5! (factorial of 5). Return only the number, no explanation.
  • Expected: 120
  • Actual: 120
  • Duration: 4814ms (4814.00s)
  • Timestamp: 4/2/2025, 3:34:02 PM

fibonacci - anthropic/claude-3.5-sonnet

  • Prompt: Calculate the 6th number in the Fibonacci sequence. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 2004ms (2004.00s)
  • Timestamp: 4/2/2025, 3:34:04 PM

fibonacci - openai/gpt-4o-mini

  • Prompt: Calculate the 6th number in the Fibonacci sequence. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 1673ms (1673.00s)
  • Timestamp: 4/2/2025, 3:34:07 PM

fibonacci - openai/gpt-3.5-turbo

  • Prompt: Calculate the 6th number in the Fibonacci sequence. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 1543ms (1543.00s)
  • Timestamp: 4/2/2025, 3:34:08 PM

square_root - anthropic/claude-3.5-sonnet

  • Prompt: Calculate the square root of 16. Return only the number, no explanation.
  • Expected: 4
  • Actual: 4
  • Duration: 2012ms (2012.00s)
  • Timestamp: 4/2/2025, 3:34:46 PM

square_root - qwen/qwq-32b

  • Prompt: Calculate the square root of 16. Return only the number, no explanation.
  • Expected: 4
  • Actual: 4
  • Duration: 5888ms (5888.00s)
  • Timestamp: 4/2/2025, 3:34:52 PM

square_root - openai/gpt-4o-mini

  • Prompt: Calculate the square root of 16. Return only the number, no explanation.
  • Expected: 4
  • Actual: 4
  • Duration: 964ms (964.00s)
  • Timestamp: 4/2/2025, 3:34:52 PM

square_root - openai/gpt-3.5-turbo

  • Prompt: Calculate the square root of 16. Return only the number, no explanation.
  • Expected: 4
  • Actual: 4
  • Duration: 1080ms (1080.00s)
  • Timestamp: 4/2/2025, 3:34:54 PM

square_root - deepseek/deepseek-r1

  • Prompt: Calculate the square root of 16. Return only the number, no explanation.
  • Expected: 4
  • Actual: 4
  • Duration: 10309ms (10309.00s)
  • Timestamp: 4/2/2025, 3:35:04 PM

square_root - deepseek/deepseek-r1-distill-qwen-14b:free

  • Prompt: Calculate the square root of 16. Return only the number, no explanation.
  • Expected: 4
  • Actual: 4
  • Duration: 6114ms (6114.00s)
  • Timestamp: 4/2/2025, 3:35:10 PM

power - anthropic/claude-3.5-sonnet

  • Prompt: Calculate 2 raised to the power of 3. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 1136ms (1136.00s)
  • Timestamp: 4/2/2025, 3:35:11 PM

power - qwen/qwq-32b

  • Prompt: Calculate 2 raised to the power of 3. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 7572ms (7572.00s)
  • Timestamp: 4/2/2025, 3:35:19 PM

power - openai/gpt-4o-mini

  • Prompt: Calculate 2 raised to the power of 3. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 1259ms (1259.00s)
  • Timestamp: 4/2/2025, 3:35:20 PM

power - openai/gpt-3.5-turbo

  • Prompt: Calculate 2 raised to the power of 3. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 1498ms (1498.00s)
  • Timestamp: 4/2/2025, 3:35:21 PM

power - deepseek/deepseek-r1

  • Prompt: Calculate 2 raised to the power of 3. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 5414ms (5414.00s)
  • Timestamp: 4/2/2025, 3:35:27 PM

power - deepseek/deepseek-r1-distill-qwen-14b:free

  • Prompt: Calculate 2 raised to the power of 3. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 9891ms (9891.00s)
  • Timestamp: 4/2/2025, 3:35:37 PM