Author:
Data Scientist
Reading time:
LLMs keep getting better. Every few weeks there’s a new benchmark, a new leaderboard, a new Model A beats Model B.
The problem is that most of these scores are relative: they tell us which model is better than another model, but not what that means in terms of difficulty that humans actually recognize.
Around 2025 this gap got even more noticeable, because it became common to see headlines and posts claiming that frontier models were scoring at gold medal level on the International Mathematical Olympiad (IMO). That’s an incredible achievement—and it’s also the kind of story that makes people jump to a bigger conclusion: “So… do these systems basically have high IQ?”
I wanted a test that felt more human-grounded. Something many students have actually taken. Something with puzzles that don’t look like textbook exercises.
So I ran an experiment using Math Kangaroo.

Not because it’s “harder” than the IMO (it isn’t), but because it’s different: short questions, tricky logic, lots of diagrams, pattern-spotting, and those deceptively simple puzzles that punish overconfidence.
Even though Math Kangaroo problems are designed for children and teenagers, experiments like this have clear business value: they expose how models behave in semi-formal, ambiguous, and visually grounded tasks that closely resemble real decision-making problems in applied settings.
The question I wanted to answer was simple:
If a language model scores well on Math Kangaroo, does that feel like a meaningful, human-shaped signal of reasoning ability?
And just as importantly: where do these models stumble when the problems stop being formal and start being “puzzly”?
Math Kangaroo has two features that make it useful as a “human difficulty probe”:
Here’s the scoring setup (the 150-point version):
That penalty matters because it discourages “just pick something,” and it makes the score feel closer to real confidence.
All runs were done in November/December 2025, using a consistent prompt style across models.
Dataset
To compare across papers, I computed paper-level scores and averaged them. One paper had fewer questions, so I rescaled it to the 150-point scale before averaging.
Models
At a high level, the results match what many people would expect: the most capable general models sit at the top.
| Model | Avg score (150-scale) | Overall accuracy |
|---|---|---|
| gemini-2.5-pro | 117.8 | 78.7% |
| gpt-5 | 114.9 | 76.4% |
| grok-fast-4 | 107.3 | 71.3% |
| gpt-5-mini | 106.5 | 71.3% |
| gemini-2.5-flash | 103.9 | 69.5% |
| gpt-5-nano | 95.3 | 63.2% |
A rough intuition: in this setup, ~100 points is a genuinely good score. Only a couple of models stayed comfortably above that line across both grade bands.
But the more interesting story isn’t “who won.”
It’s what the wins and losses look like.
This sounds backwards until you look at the style of the problems.
Average score (150-scale) by grade band:
Grade 11–12
Grade 5–6
So the same models that looked nearly unstoppable on the older-student papers dropped ~25–30 points on the younger-student papers.
That doesn’t mean “kid math is harder.”
It means something else: a lot of younger-grade Kangaroo questions aren’t really about algebraic technique, they’re about informal logic, diagrams, and puzzle intuition.
Which leads directly to the biggest effect in the experiment.
Across the whole dataset:
Top models were extremely strong:
Accuracy dropped sharply:
This wasn’t a small difference. It was a structural split: text reasoning looked almost solved; diagram reasoning was still messy and inconsistent.
And it also explains the “Grade 5–6 is harder” result (at least to some extent):
So the models weren’t struggling with arithmetic. They were struggling with the kind of reasoning that diagrams require: spatial constraints, counting shapes correctly, tracking implied rules, and not hallucinating details that aren’t there.
Math Kangaroo’s 3-point questions are designed to be the warm-up. But in Grade 5–6, many 3-pointers are exactly the kind of visual or “gotcha” logic that can be surprisingly tricky for LLMs.
In the Grade 5–6 set, for gemini-2.5-pro:
That’s the opposite of what the scoring suggests.
What seems to be happening is:
One model stood out as an interesting trade-off: grok-fast-4.
It didn’t beat the top two, but it was closer than many would expect:
What made this especially notable in my runs is that Grok was reported (in my pricing context) to be over 20× cheaper than the top frontier models—yet on text-only math, it was basically tied with them.
Where it lost ground was mostly the same place everyone lost ground: diagrams.
Practical takeaway: If the workload is mostly structured, text-based math, smaller/cheaper models can be a very rational choice. If diagrams and visual puzzles matter, the premium models start to justify their cost, but even then, they’re not flawless.
This experiment is not intended as a definitive ranking of LLMs, nor as a proxy for “general intelligence.” From a business perspective, its value lies elsewhere: in showing that model performance is highly context-dependent.
The same model can appear exceptionally strong in one class of tasks (for example, well-specified, text-based reasoning) and surprisingly weak in another (informal logic, diagrams, or spatial constraints). In real-world deployments, this distinction is critical. Many business problems are closer to Math Kangaroo puzzles than to clean benchmark tasks: requirements are partially implicit, inputs are visual or messy, and success depends on interpretation as much as calculation.
The practical implication is that model selection should be driven by workload characteristics, not by leaderboard positions. Benchmarks can indicate general capability, but they rarely capture the specific failure modes that matter in production. For high-stakes or specialized applications, targeted, domain-specific testing is often more informative than relying on aggregate scores.
In that sense, results like these are best read as a decision-support signal: they help organizations understand where a given model is likely to perform reliably, where it may struggle, and when it is worth paying for a more capable (or multimodal) system versus deploying a cheaper alternative.
I like including a quick puzzle mid-article because it makes the rest of the discussion feel more concrete.
Example puzzle (Grade 5–6, 3 points) – one of the hardest for LMM
Eva paddles around five buoys with her boat (see diagram). Which of the buoys does she paddle around in anti-clockwise direction?

A) 1 and 4
B) 2, 3 and 5
C) 2 and 3
D) 1, 4 and 5
E) 1 and 3
A stark example of this capability split is the buoys puzzle above. Even the top models, which demonstrated near-perfect accuracy on structured, text-based problems from the upper high school papers, failed to answer this seemingly simple 3-point visual puzzle correctly. This highlights the core finding: a system can perform at a very high level on formal competition math that requires precise definition and clear specification, yet still be noticeably weaker on “puzzle intuition” tasks that depend on interpreting a diagram and applying informal logic.
Example puzzle (Grade 11–12, 5 points) – one of the easiest for LMM – every single model did it right. Is it so obvious for you too?

A) 0
B) 1/2
C) 2
D) 2020
E) none of the previous
If “IQ” means “a single number that transfers smoothly across many kinds of reasoning,” these results argue against that interpretation.
What this experiment suggests instead is a capability profile:
In other words, it’s not that models “can’t reason.” They clearly can.
It’s that their reasoning is much more reliable when the problem is formalized.
That matters because people often encounter the opposite in real life: problems that are not cleanly specified, where the diagram is the problem, and where the rules are partly implicit.
The IMO results from 2025 are genuinely impressive. But it’s important not to overgeneralize what they imply.
IMO problems – even geometry – are typically presented with very precise definitions and carefully structured text. That environment plays to the strengths of modern LLM systems: formal reasoning over a clear specification.
Math Kangaroo (especially in younger grades) pushes in a different direction: short puzzles, visual constraints, informal logic, and tricky interpretation. Those are exactly the conditions where performance became less stable in my runs.
So the two ideas can both be true at the same time:
If there’s one message I’d want readers to leave with, it’s this:
A single impressive score can be real and meaningful, without implying a general, human-like “IQ.”
In my Math Kangaroo runs, the best models performed strongly, but even they wouldn’t consistently reach the very top award tier. They looked more like “high distinction” rather than “unquestionable laureate.”
And that’s actually useful information.
Not because it diminishes what LLMs can do, but because it makes it easier to use them wisely, especially when the problem stops being formal math and starts being a puzzle you could scribble on a napkin.
So yes, despite being based on “school-level” puzzles, this kind of experiment has real business relevance. It highlights why context-aware evaluation and application-specific testing matter far more than a single headline score when deciding how, where, and whether to deploy an LLM in practice.
Math Kangaroo isn’t meant to replace elite benchmarks—it complements them. While competitions like the IMO test highly formalized mathematical reasoning, Math Kangaroo problems are short, puzzle-like, and often visually grounded. This makes them useful for probing how models behave when reasoning depends on informal logic, spatial interpretation, or subtle constraints rather than explicit formulas. In other words, it’s a test that looks closer to how humans actually encounter problems outside of academic settings.
Not in the human sense. A strong score signals that a model handles certain kinds of reasoning well under these conditions, but the experiment shows that performance does not generalize evenly across problem types. Models that excel at structured, text-based math can still struggle with visual or informal puzzles. The results point toward a capability profile, not a single transferable intelligence measure.
The difficulty wasn’t about mathematical sophistication. Younger-grade Math Kangaroo problems rely more heavily on diagrams, visual counting, and “gotcha” logic, with fewer explicit rules spelled out. These are precisely the areas where current multimodal and reasoning systems are less reliable. Older-grade problems, while mathematically more advanced, are often more formal and text-driven—playing to LLM strengths.
It’s a combination of both. The models often recognize the visual elements correctly but fail when reasoning requires tracking spatial relationships, implied constraints, or counting without hallucinating extra structure. The sharp performance drop on diagram-based questions suggests that integrating visual perception with robust, grounded reasoning remains a key weakness—even for frontier models.
The main lesson is that model performance is highly task-dependent. Leaderboard rankings and headline benchmarks can obscure failure modes that matter in real deployments. If your workload resembles clean, text-based reasoning, smaller or cheaper models may perform exceptionally well. If it involves diagrams, informal logic, or ambiguous inputs, targeted testing is essential—and even premium models may not be fully reliable. Model selection should be driven by workload characteristics, not by a single global score.
Category:
Discover how AI turns CAD files, ERP data, and planning exports into structured knowledge graphs-ready for queries in engineering and digital twin operations.