Addepto in now part of KMS Technology – read full press release!

in Blog

January 07, 2026

Can an LLM get an “IQ score”? I tried Math Kangaroo instead

Author:




Jakub Berezowski

Data Scientist


Reading time:




10 minutes


LLMs keep getting better. Every few weeks there’s a new benchmark, a new leaderboard, a new Model A beats Model B.

The problem is that most of these scores are relative: they tell us which model is better than another model, but not what that means in terms of difficulty that humans actually recognize.

Around 2025 this gap got even more noticeable, because it became common to see headlines and posts claiming that frontier models were scoring at gold medal level on the International Mathematical Olympiad (IMO). That’s an incredible achievement—and it’s also the kind of story that makes people jump to a bigger conclusion: “So… do these systems basically have high IQ?”

I wanted a test that felt more human-grounded. Something many students have actually taken. Something with puzzles that don’t look like textbook exercises.

So I ran an experiment using Math Kangaroo.

Math Kangaroo

Not because it’s “harder” than the IMO (it isn’t), but because it’s different: short questions, tricky logic, lots of diagrams, pattern-spotting, and those deceptively simple puzzles that punish overconfidence.

Even though Math Kangaroo problems are designed for children and teenagers, experiments like this have clear business value: they expose how models behave in semi-formal, ambiguous, and visually grounded tasks that closely resemble real decision-making problems in applied settings.

The question I wanted to answer was simple:

If a language model scores well on Math Kangaroo, does that feel like a meaningful, human-shaped signal of reasoning ability?

And just as importantly: where do these models stumble when the problems stop being formal and start being “puzzly”?

Why Math Kangaroo works well for this kind of test

Math Kangaroo has two features that make it useful as a “human difficulty probe”:

  1. It’s familiar and widely taken (especially across Europe).
  2. The scoring is intuitive – and in the format I used, guessing is penalized.

Here’s the scoring setup (the 150-point version):

  • 30 multiple-choice questions
  • 75 minutes
  • Questions are worth 3 / 4 / 5 points
  • Wrong answers are –0.25*potential_points (e.q penalty for wrong answer for 3 point question is 0.75)
  • Everyone starts with 30 points
  • Max score is 150

That penalty matters because it discourages “just pick something,” and it makes the score feel closer to real confidence.

What I tested

All runs were done in November/December 2025, using a consistent prompt style across models.

Dataset

  • 180 problems total
  • 6 past papers (Grades 5–6 and 11–12, years 2021–2023)
  • Point values were evenly represented:
    • 60× 3-point
    • 60× 4-point
    • 60× 5-point
  • 115 problems included an image/diagram
  • 65 were text-only

To compare across papers, I computed paper-level scores and averaged them. One paper had fewer questions, so I rescaled it to the 150-point scale before averaging.

Models

  • gemini-2.5-pro, gemini-2.5-flash, gemini-2.5-lite
  • gpt-5, gpt-5-mini, gpt-5-nano
  • grok-fast-4
  • qwen/qwen3-vl-32b-instruct
  • llama-4-scout, llama-4-maverick
  • nvidia/nemotron-nano-12b-v2-vl:free
  • amazon/nova-lite-v1

The leaderboard

At a high level, the results match what many people would expect: the most capable general models sit at the top.

Model Avg score (150-scale) Overall accuracy
gemini-2.5-pro 117.8 78.7%
gpt-5 114.9 76.4%
grok-fast-4 107.3 71.3%
gpt-5-mini 106.5 71.3%
gemini-2.5-flash 103.9 69.5%
gpt-5-nano 95.3 63.2%

A rough intuition: in this setup, ~100 points is a genuinely good score. Only a couple of models stayed comfortably above that line across both grade bands.

But the more interesting story isn’t “who won.”

It’s what the wins and losses look like.

Surprise #1: Grade 5–6 was harder than Grade 11–12

This sounds backwards until you look at the style of the problems.

Average score (150-scale) by grade band:

Grade 11–12

  • gpt-5: 129.7
  • gemini-2.5-pro: 129.3

Grade 5–6

  • gemini-2.5-pro: 106.3
  • gpt-5: 100.2

So the same models that looked nearly unstoppable on the older-student papers dropped ~25–30 points on the younger-student papers.

That doesn’t mean “kid math is harder.”

It means something else: a lot of younger-grade Kangaroo questions aren’t really about algebraic technique, they’re about informal logic, diagrams, and puzzle intuition.

Which leads directly to the biggest effect in the experiment.

Surprise #2: images were the main source of failure

Across the whole dataset:

Text-only problems

Top models were extremely strong:

  • gpt-5: 95.2% correct
  • gemini-2.5-pro: 95.2% correct
  • grok-fast-4: 95.2% correct

Image/diagram problems

Accuracy dropped sharply:

  • gemini-2.5-pro: 69.6%
  • gpt-5: 66.1%
  • grok-fast-4: 58.0%
  • gpt-5-nano: 48.2%

This wasn’t a small difference. It was a structural split: text reasoning looked almost solved; diagram reasoning was still messy and inconsistent.

And it also explains the “Grade 5–6 is harder” result (at least to some extent):

  • In this dataset, Grade 5–6 was about 81% image-based
  • Grade 11–12 was closer to a 50/50 split

So the models weren’t struggling with arithmetic. They were struggling with the kind of reasoning that diagrams require: spatial constraints, counting shapes correctly, tracking implied rules, and not hallucinating details that aren’t there.

Surprise #3: even 3-point questions weren’t always “easy”

Math Kangaroo’s 3-point questions are designed to be the warm-up. But in Grade 5–6, many 3-pointers are exactly the kind of visual or “gotcha” logic that can be surprisingly tricky for LLMs.

In the Grade 5–6 set, for gemini-2.5-pro:

  • 3-point: ~61%
  • 5-point: ~71%

That’s the opposite of what the scoring suggests.

What seems to be happening is:

  • When a question is formal and well specified, models do great.
  • When it’s informal, visual, or “puzzle-like,” reliability drops, even if the problem is “worth only 3 points.”

A note on Grok: a strong budget performer

One model stood out as an interesting trade-off: grok-fast-4.

It didn’t beat the top two, but it was closer than many would expect:

  • grok-fast-4: 107.3
  • gpt-5: 114.9
  • gemini-2.5-pro: 117.8

What made this especially notable in my runs is that Grok was reported (in my pricing context) to be over 20× cheaper than the top frontier models—yet on text-only math, it was basically tied with them.

Where it lost ground was mostly the same place everyone lost ground: diagrams.

Practical takeaway: If the workload is mostly structured, text-based math, smaller/cheaper models can be a very rational choice. If diagrams and visual puzzles matter, the premium models start to justify their cost, but even then, they’re not flawless.

Why this matters for business decisions (not a model ranking)

This experiment is not intended as a definitive ranking of LLMs, nor as a proxy for “general intelligence.” From a business perspective, its value lies elsewhere: in showing that model performance is highly context-dependent.

The same model can appear exceptionally strong in one class of tasks (for example, well-specified, text-based reasoning) and surprisingly weak in another (informal logic, diagrams, or spatial constraints). In real-world deployments, this distinction is critical. Many business problems are closer to Math Kangaroo puzzles than to clean benchmark tasks: requirements are partially implicit, inputs are visual or messy, and success depends on interpretation as much as calculation.

The practical implication is that model selection should be driven by workload characteristics, not by leaderboard positions. Benchmarks can indicate general capability, but they rarely capture the specific failure modes that matter in production. For high-stakes or specialized applications, targeted, domain-specific testing is often more informative than relying on aggregate scores.

In that sense, results like these are best read as a decision-support signal: they help organizations understand where a given model is likely to perform reliably, where it may struggle, and when it is worth paying for a more capable (or multimodal) system versus deploying a cheaper alternative.

Try yourself

I like including a quick puzzle mid-article because it makes the rest of the discussion feel more concrete.

Example puzzle (Grade 5–6, 3 points) – one of the hardest for LMM

Eva paddles around five buoys with her boat (see diagram). Which of the buoys does she paddle around in anti-clockwise direction?

diagram 1

A) 1 and 4

B) 2, 3 and 5

C) 2 and 3

D) 1, 4 and 5

E) 1 and 3

A stark example of this capability split is the buoys puzzle above. Even the top models, which demonstrated near-perfect accuracy on structured, text-based problems from the upper high school papers, failed to answer this seemingly simple 3-point visual puzzle correctly. This highlights the core finding: a system can perform at a very high level on formal competition math that requires precise definition and clear specification, yet still be noticeably weaker on “puzzle intuition” tasks that depend on interpreting a diagram and applying informal logic.

Example puzzle (Grade 11–12, 5 points) – one of the easiest for LMM – every single model did it right. Is it so obvious for you too?

A) 0

B) 1/2

C) 2

D) 2020

E) none of the previous

So… does this look like “IQ”?

If “IQ” means “a single number that transfers smoothly across many kinds of reasoning,” these results argue against that interpretation.

What this experiment suggests instead is a capability profile:

  • Structured math in text form: extremely strong
  • Well-specified word problems: very strong
  • Diagram-heavy puzzles: inconsistent
  • Informal logic with implied rules: error-prone

In other words, it’s not that models “can’t reason.” They clearly can.

It’s that their reasoning is much more reliable when the problem is formalized.

That matters because people often encounter the opposite in real life: problems that are not cleanly specified, where the diagram is the problem, and where the rules are partly implicit.

Connecting back to the “IMO gold medal” conversation

The IMO results from 2025 are genuinely impressive. But it’s important not to overgeneralize what they imply.

IMO problems – even geometry – are typically presented with very precise definitions and carefully structured text. That environment plays to the strengths of modern LLM systems: formal reasoning over a clear specification.

Math Kangaroo (especially in younger grades) pushes in a different direction: short puzzles, visual constraints, informal logic, and tricky interpretation. Those are exactly the conditions where performance became less stable in my runs.

So the two ideas can both be true at the same time:

  • A system can perform at a very high level on formal competition math.
  • The same system can still be noticeably weaker on diagram-heavy, “puzzle intuition” tasks that many students find natural.

The most important takeaways

  1. LLMs are excellent at well-structured math.Even smaller models can handle a large fraction of these problems when they’re cleanly described in text.
  2. The less formal the required reasoning, the less reliable the answer.When the solution depends on informal logic, spatial intuition, or interpreting a diagram correctly, success rates drop sharply.
  3. Visual input changes everything.Adding an image isn’t a small complication—it can be the difference between near-perfect performance and frequent mistakes.
  4. It’s better to think in terms of “fit for purpose,” not “general intelligence.”Models can look brilliant in one style of reasoning and surprisingly uneven in another. That doesn’t mean they’re “bad”, it means their competence is shaped by what they’ve been optimized and trained to do.

Final thought

If there’s one message I’d want readers to leave with, it’s this:

A single impressive score can be real and meaningful, without implying a general, human-like “IQ.”

In my Math Kangaroo runs, the best models performed strongly, but even they wouldn’t consistently reach the very top award tier. They looked more like “high distinction” rather than “unquestionable laureate.”

And that’s actually useful information.

Not because it diminishes what LLMs can do, but because it makes it easier to use them wisely, especially when the problem stops being formal math and starts being a puzzle you could scribble on a napkin.

So yes, despite being based on “school-level” puzzles, this kind of experiment has real business relevance. It highlights why context-aware evaluation and application-specific testing matter far more than a single headline score when deciding how, where, and whether to deploy an LLM in practice.


FAQ


Why use Math Kangaroo instead of established benchmarks like the IMO or MMLU?

plus-icon minus-icon

Math Kangaroo isn’t meant to replace elite benchmarks—it complements them. While competitions like the IMO test highly formalized mathematical reasoning, Math Kangaroo problems are short, puzzle-like, and often visually grounded. This makes them useful for probing how models behave when reasoning depends on informal logic, spatial interpretation, or subtle constraints rather than explicit formulas. In other words, it’s a test that looks closer to how humans actually encounter problems outside of academic settings.


Does a high Math Kangaroo score mean a model is “smarter” or has higher IQ?

plus-icon minus-icon

Not in the human sense. A strong score signals that a model handles certain kinds of reasoning well under these conditions, but the experiment shows that performance does not generalize evenly across problem types. Models that excel at structured, text-based math can still struggle with visual or informal puzzles. The results point toward a capability profile, not a single transferable intelligence measure.


Why were Grade 5–6 problems harder for models than Grade 11–12 problems?

plus-icon minus-icon

The difficulty wasn’t about mathematical sophistication. Younger-grade Math Kangaroo problems rely more heavily on diagrams, visual counting, and “gotcha” logic, with fewer explicit rules spelled out. These are precisely the areas where current multimodal and reasoning systems are less reliable. Older-grade problems, while mathematically more advanced, are often more formal and text-driven—playing to LLM strengths.


Are image-based failures mainly a vision problem or a reasoning problem?

plus-icon minus-icon

It’s a combination of both. The models often recognize the visual elements correctly but fail when reasoning requires tracking spatial relationships, implied constraints, or counting without hallucinating extra structure. The sharp performance drop on diagram-based questions suggests that integrating visual perception with robust, grounded reasoning remains a key weakness—even for frontier models.


What should businesses take away from this experiment?

plus-icon minus-icon

The main lesson is that model performance is highly task-dependent. Leaderboard rankings and headline benchmarks can obscure failure modes that matter in real deployments. If your workload resembles clean, text-based reasoning, smaller or cheaper models may perform exceptionally well. If it involves diagrams, informal logic, or ambiguous inputs, targeted testing is essential—and even premium models may not be fully reliable. Model selection should be driven by workload characteristics, not by a single global score.




Category:


Generative AI