How a Startup Turned AI Benchmarks Into an Intelligence Score
For more than a century, the IQ test has served as both a celebrated tool and a deeply contested measure of human cognitive ability. Now, a small startup called AI IQ has borrowed that same framework and applied it to artificial intelligence. The project assigns estimated intelligence quotients to over 50 of the most powerful language models in the world, then plots them along a standard bell curve. The resulting interactive charts at aiiq.org have ricocheted across social media, drawing both enthusiastic praise and sharp condemnation. The debate around these ai iq results has become nearly as heated as the technology itself. Some enterprise technologists find the visualizations clarifying. Researchers, however, warn that compressing a model’s jagged capabilities into a single number creates a misleading picture. This article examines five of the most divisive findings from the project and explains what they actually mean for anyone tracking the fast-moving AI landscape.

The Mechanics Behind the Numbers
AI IQ was built by Ryan Shea, an engineer and entrepreneur best known as a co-founder of the blockchain platform Stacks. Shea also co-founded Voterbase and has invested early in several unicorns including OpenSea, Lattice, Anchorage, and Mercury. He holds a Bachelor of Science in Mechanical Engineering from Princeton University. The methodology behind the site is surprisingly straightforward on the surface, yet it contains several nuances that critics have seized upon.
The system groups twelve well-known benchmarks into four reasoning dimensions. Abstract reasoning draws from ARC-AGI-1 and ARC-AGI-2, the notoriously difficult pattern-recognition tests designed to measure general fluid intelligence. Mathematical reasoning includes FrontierMath split into Tiers 1 through 3 and Tier 4, plus AIME and ProofBench. Programmatic reasoning uses Terminal-Bench 2.0, SWE-Bench Verified, and SciCode. Academic reasoning pulls from Humanity’s Last Exam, CritPt, and GPQA Diamond.
Each raw benchmark score is mapped to an implied IQ using what the site describes as hand-calibrated difficulty curves. The composite IQ is a straight average of the four dimension scores: IQ equals one-fourth of the sum of IQ Abstract, IQ Math, IQ Prog, and IQ Acad. Crucially, the methodology compresses the ceiling for benchmarks considered easier or more susceptible to data contamination, preventing them from inflating scores above 100. Harder, less gameable benchmarks retain higher ceilings. The system also handles missing data conservatively: a model must have scores on at least two of the four dimensions to receive a derived IQ, and when benchmarks are absent, the pipeline deliberately pulls scores downward rather than upward. The site states plainly that every derived IQ averages all four dimensions, so missing coverage cannot make a model look better by omission. These design choices shape every one of the ai iq results that have sparked such intense debate.
The 5 Most Divisive AI IQ Results
The following five findings from the AI IQ project have generated the strongest reactions, ranging from enthusiastic agreement to outright dismissal. Each reveals something distinct about the state of frontier AI today.
1. GPT-5.5 Sits at the Peak With an Estimated IQ of 136
According to the Frontier IQ Over Time chart on the site, OpenAI’s GPT-5.5 currently holds the top position with an estimated IQ near 136. That is the highest score of any model tracked as of mid-May 2026. On its own, this result seems straightforward: the latest flagship from OpenAI leads the pack. What makes it divisive, however, is how small the margin actually is. GPT-5.4 trails at approximately 131, while Anthropic’s Opus 4.7 sits around 132 and Google’s Gemini 3.1 Pro hovers near 131. The gap between first and fourth place is only about five IQ points.
Supporters of the AI IQ approach argue that this compression is itself a valuable insight. It confirms what many practitioners have sensed: the frontier is no longer dominated by a single player. A lead of five points on this scale may be statistically meaningful, but it does not translate to the kind of overwhelming dominance that earlier models like GPT-4 once enjoyed. Critics counter that the tight clustering actually undermines the whole exercise. If the top models are nearly indistinguishable by this metric, they argue, then the metric may simply lack the resolution to differentiate them meaningfully. Either way, the result has forced a conversation about whether the race for AI supremacy is still a race at all, or whether the field has entered a phase of convergence.
2. The Top Cluster Is Extraordinarily Tight
Beyond the single leader, the AI IQ charts reveal something striking: the top five or six models are packed into an extremely narrow band. GPT-5.5 at 136, Opus 4.7 at roughly 132, GPT-5.4 at about 131, Gemini 3.1 Pro near 131, and Opus 4.6 around 129 all sit within a range of just seven points. This clustering has become one of the most discussed features of the entire project.
For enterprise buyers trying to decide which model to build into their products, this tight grouping presents both good news and a challenge. The good news is that several models now deliver comparable high-end capability. The challenge is that choosing between them requires looking beyond a single IQ number. One X user, ovsky, noted that the data confirms experience with Sonnet 4.6 being an absolute workhorse compared to Opus 4.5, a distinction that the aggregate IQ score alone would not capture. The tightness of the cluster has also fueled the criticism that the methodology compresses variance. Researchers argue that a seven-point spread among the world’s best AI systems may understate real differences in how these models handle specific tasks such as long-context reasoning, multilingual fluency, or tool use. The ai iq results, in other words, may be smoothing over precisely the jagged edges that matter most in practice.
3. Opus 4.7 Outperforms GPT-5.4 on Certain Dimensions
Anthropic’s Opus 4.7 has an overall estimated IQ of approximately 132, placing it just behind GPT-5.5 but ahead of GPT-5.4. What makes this result particularly divisive is that Opus 4.7 does not lead across all four reasoning dimensions. It appears to excel specifically in abstract and academic reasoning while scoring slightly lower on mathematical and programmatic dimensions. This uneven profile has generated conflicting interpretations.
Proponents of the AI IQ framework say this kind of dimensional breakdown is exactly what makes the project useful. Rather than relying on a single leaderboard rank, a practitioner can see that Opus 4.7 might be the better choice for tasks requiring deep pattern recognition or complex academic-style queries, while GPT-5.4 may be preferable for coding or mathematical workflows. Critics, however, point out that the final IQ number still averages these dimensions, which means a model with spiky strengths can end up with a score that looks similar to a more balanced model. The worry is that decision-makers will grab the headline IQ figure without digging into the dimensional breakdown, thereby missing the nuance. This result has become a flashpoint in the broader argument about whether a single metric can ever do justice to a system as multifaceted as a large language model.
4. Chinese Models Cluster in the 112 to 118 Range, Below the Frontier
The AI IQ charts show several prominent Chinese language models occupying a distinct tier below the frontier. Kimi K2.6, GLM-5, DeepSeek-V3.2, Qwen3.6, and MiniMax-M2.7 all score between approximately 112 and 118. For context, that range places them in the above-average to superior band on a human IQ scale, but clearly behind the top Western models that cluster above 129.
This result has generated two very different reactions. Some observers interpret it as a realistic snapshot of the current competitive landscape. Chinese models have made remarkable progress in recent years, and a score in the 112 to 118 range still represents highly capable AI. Others argue that the benchmark selection may systematically disadvantage models trained primarily on Chinese-language data. The twelve benchmarks used by AI IQ are overwhelmingly English-centric, which could suppress scores for models that perform strongly in Chinese but lack equivalent exposure to English test sets. The site’s methodology does not explicitly account for language bias in benchmark design, and critics say this gap undermines the validity of the comparison. The result has reopened a longstanding debate about how to evaluate AI systems across languages and cultural contexts, and whether English-only benchmarks can fairly measure global progress.
5. The Fundamental Controversy: Can a Single Number Capture Intelligence?
The most divisive result of all is not any specific score but the very premise of the project. AI commentator Thibaut Mélen wrote on X that the charts make model progress much easier to understand when mapped this way instead of reading another giant leaderboard table. Business strategist Brian Vellmure echoed that sentiment, calling the approach helpful and noting that it anecdotally tracks with personal experience. But the backlash arrived just as quickly. One AI commentary account posted bluntly that the whole framework is nonsense because AI is far too jagged, adding that the map is not the territory.
This philosophical divide runs through every discussion of the ai iq results. On one side are practitioners who find immense value in any tool that reduces complexity. Enterprise buyers, product managers, and CTOs often need a single reference point to compare dozens of rapidly evolving models. A chart with a bell curve and a single number per model provides exactly that. On the other side are researchers and methodologists who argue that compressing a high-dimensional capability space into one scalar creates a false sense of precision. They point out that a model can score highly on academic benchmarks while failing catastrophically on basic common-sense tasks, and a single IQ figure would never reveal that flaw. The debate is unlikely to be resolved soon. What is clear is that AI IQ has succeeded in one important way: it has forced a public conversation about how we evaluate intelligence, whether human or machine, and whether our existing tools are up to the task.
You may also enjoy reading: Day One Now Makes Switching Easier: 5 Key Upgrades.
Why These Results Matter Beyond the Buzz
The controversy around AI IQ is not merely academic. For organizations that rely on large language models for product development, customer service, coding assistance, or content generation, the question of how to choose between models is deeply practical. A single IQ number, however imperfect, offers a starting point. The dimensional breakdowns offered by AI IQ provide additional texture. An enterprise building a coding assistant, for example, can look at the programmatic reasoning scores specifically rather than the composite IQ. A company focused on legal document analysis might prioritize academic reasoning scores.
The project also highlights a broader trend: as AI models become more capable and more numerous, the demand for intelligible comparisons will only grow. Leaderboards already exist, but they typically rank models by performance on individual benchmarks. AI IQ attempts something more ambitious by combining multiple benchmarks into a single metric. Whether that attempt succeeds or fails, it has already influenced the way many people talk about model capability. The phrase AI IQ has entered the vocabulary of the field, and that alone is a notable achievement for a small startup project.
What the Critics Get Right
Critics of AI IQ raise several valid points that deserve attention. The first is that IQ, even for humans, is a contested construct. Applying it to AI inherits all the same philosophical problems plus new ones specific to machines. A language model does not have a childhood, a cultural background, or a test-taking history. It has a training dataset, a parameter count, and a sampling strategy. Mapping its output to a human intelligence scale may be misleading on a fundamental level.
The second criticism concerns the benchmarks themselves. The twelve tests selected by AI IQ are a reasonable set, but they are not exhaustive. Models that excel at creative writing, dialogue coherence, or multimodal understanding may not be fully represented. The academic and mathematical dimensions dominate the composite score, which could disadvantage models optimized for other use cases. The hand-calibrated difficulty curves also introduce a subjective element that is hard to verify independently. Without open access to the calibration data, the scores rest on the judgment of a small group of people.
The third criticism is about transparency. AI IQ publishes its methodology, but the mapping from benchmark scores to IQ values involves proprietary judgments. Researchers accustomed to open benchmarks and reproducible results find this opacity frustrating. The site has not released the full dataset of raw benchmark scores alongside the derived IQs, which makes independent validation difficult. For a project that aspires to inform enterprise decision-making, this lack of openness is a significant limitation.
A Modest Defending of the Approach
Despite the criticisms, the AI IQ project offers something that the research community has not yet provided: a concise, visual, and intuitive summary of relative model capability. The academic literature is filled with papers that compare models across dozens of metrics, often with conflicting results. Practitioners do not have time to read every paper. A chart that plots models on a bell curve, color-coded by company and ranked by a single score, is immediately legible. It may not be perfect, but it is useful.
The project also surfaces patterns that are not obvious from individual benchmark scores. The tight clustering at the top, the distinct tier occupied by Chinese models, and the relative standing of older versus newer models all become visible at a glance. These patterns can validate or challenge the intuitions that practitioners have developed through hands-on experience. The X user ovsky noted that the data confirmed their experience with Sonnet 4.6 being a workhorse compared to Opus 4.5, a qualitative judgment that the IQ charts made visible.
Furthermore, the methodology includes safeguards that address some of the most obvious pitfalls. The conservative handling of missing data, the ceiling compression for contaminated benchmarks, and the requirement that a model score on at least two dimensions all demonstrate a thoughtful approach. The methodology is not perfect, but it is far from naive. Ryan Shea and his team have built something that invites scrutiny, and that in itself is a contribution to the field.
What Comes Next for AI IQ and Similar Projects
The reaction to AI IQ suggests that the appetite for simplified AI evaluation is strong. Whether this specific project endures or fades, the underlying need will not disappear. As models multiply and their capabilities grow more sophisticated, the problem of comparison becomes harder, not easier. Future projects may refine the methodology by incorporating multilingual benchmarks, adding multimodal dimensions, or using adaptive testing rather than fixed benchmarks. The debate between holistic scores and dimensional profiles will continue, but the outcome is likely to be a middle ground: dashboards that offer both an overall score and a detailed breakdown, allowing users to drill down as needed.
For now, AI IQ has achieved something rare: it has made a technical topic accessible to a broad audience and provoked a genuinely useful argument about how to measure intelligence. The ai iq results are divisive, but they are also productive. They force model providers to defend their scores, they push practitioners to think carefully about what they actually need from an AI system, and they remind everyone that intelligence, whether human or machine, resists easy quantification. That is a conversation worth having, even if the numbers themselves remain imperfect.






