mono/packages/kbot/tests/unit/reports/basic.md
2025-06-28 10:37:04 +02:00

4.0 KiB

Basic Operations Test Results

Highscores

Performance Rankings (Duration)

Test Model Duration (ms) Duration (s)
addition openai/gpt-4o-mini 634 0.63
addition anthropic/claude-sonnet-4 1522 1.52
addition deepseek/deepseek-r1:free 3394 3.39
multiplication anthropic/claude-sonnet-4 702 0.70
multiplication openai/gpt-4o-mini 2765 2.77
multiplication deepseek/deepseek-r1:free 3425 3.42
division openai/gpt-4o-mini 564 0.56
division anthropic/claude-sonnet-4 1252 1.25
division deepseek/deepseek-r1:free 4619 4.62
web_content anthropic/claude-sonnet-4 6161 6.16
web_content openai/gpt-4o-mini 6225 6.22
web_content deepseek/deepseek-r1:free 6879 6.88

Summary

  • Total Tests: 12
  • Passed: 9
  • Failed: 3
  • Success Rate: 75.00%
  • Average Duration: 3179ms (3.18s)

Failed Tests

multiplication - deepseek/deepseek-r1:free

  • Prompt: multiply 8 and 3. Return only the number, no explanation.
  • Expected: 24
  • Actual: The result of multiplying 8 and 3 is \boxed{24}.
  • Duration: 3425ms (3.42s)
  • Reason: Expected 24, but got The result of multiplying 8 and 3 is \boxed{24}.
  • Timestamp: 6/6/2025, 12:54:59 AM

web_content - anthropic/claude-sonnet-4

  • Prompt: Check if the content contains a section about Human prehistory. Reply with "yes" if it does, "no" if it does not.
  • Expected: yes
  • Actual: `Looking through the table of contents in the Wikipedia article on Kenya, I can see that there is indeed a section titled "Prehistory" under the History section.

yes`

  • Duration: 6161ms (6.16s)
  • Reason: Expected yes, but got Looking through the table of contents in the Wikipedia article on Kenya, I can see that there is indeed a section titled "Prehistory" under the History section.

yes

  • Timestamp: 6/6/2025, 12:55:12 AM

web_content - deepseek/deepseek-r1:free

  • Prompt: Check if the content contains a section about Human prehistory. Reply with "yes" if it does, "no" if it does not.
  • Expected: yes
  • Actual: ``
  • Duration: 6879ms (6.88s)
  • Reason: Model returned empty response
  • Timestamp: 6/6/2025, 12:55:25 AM

Passed Tests

addition - anthropic/claude-sonnet-4

  • Prompt: add 5 and 3. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 1522ms (1.52s)
  • Timestamp: 6/6/2025, 12:54:48 AM

addition - openai/gpt-4o-mini

  • Prompt: add 5 and 3. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 634ms (0.63s)
  • Timestamp: 6/6/2025, 12:54:49 AM

addition - deepseek/deepseek-r1:free

  • Prompt: add 5 and 3. Return only the number, no explanation.
  • Expected: 8
  • Actual: 8
  • Duration: 3394ms (3.39s)
  • Timestamp: 6/6/2025, 12:54:53 AM

multiplication - anthropic/claude-sonnet-4

  • Prompt: multiply 8 and 3. Return only the number, no explanation.
  • Expected: 24
  • Actual: 24
  • Duration: 702ms (0.70s)
  • Timestamp: 6/6/2025, 12:54:53 AM

multiplication - openai/gpt-4o-mini

  • Prompt: multiply 8 and 3. Return only the number, no explanation.
  • Expected: 24
  • Actual: 24
  • Duration: 2765ms (2.77s)
  • Timestamp: 6/6/2025, 12:54:56 AM

division - anthropic/claude-sonnet-4

  • Prompt: divide 15 by 3. Return only the number, no explanation.
  • Expected: 5
  • Actual: 5
  • Duration: 1252ms (1.25s)
  • Timestamp: 6/6/2025, 12:55:01 AM

division - openai/gpt-4o-mini

  • Prompt: divide 15 by 3. Return only the number, no explanation.
  • Expected: 5
  • Actual: 5
  • Duration: 564ms (0.56s)
  • Timestamp: 6/6/2025, 12:55:01 AM

division - deepseek/deepseek-r1:free

  • Prompt: divide 15 by 3. Return only the number, no explanation.
  • Expected: 5
  • Actual: 5
  • Duration: 4619ms (4.62s)
  • Timestamp: 6/6/2025, 12:55:06 AM

web_content - openai/gpt-4o-mini

  • Prompt: Check if the content contains a section about Human prehistory. Reply with "yes" if it does, "no" if it does not.
  • Expected: yes
  • Actual: yes
  • Duration: 6225ms (6.22s)
  • Timestamp: 6/6/2025, 12:55:18 AM