Fugu vs GLM-5.2 vs Mythos: Why AI benchmarks crown different winners
New AI models from Japan's Sakana AI and China's Z.ai are posting higher scores than Anthropic's Claude Mythos on select benchmarks; here's what it actually means:
Japan's Fugu, China's GLM-5.2 and Anthropic's Claude Mythos showcase how different AI architectures excel in different benchmark tests. (Image credit - Company logos taken from their respective webistes)
The latest race among advanced artificial intelligence (AI) systems is increasingly being fought through benchmark scores, with Japan's Sakana AI and China's Z.ai (formerly Zhipu AI) reporting stronger performance than Anthropic's Claude Mythos on select evaluations.
However, the comparison is not entirely straightforward because the systems being measured are designed for different purposes.
Different models, different strengths
Claude Mythos is Anthropic's flagship frontier AI model, built as part of the company's broader Claude family of products.
Anthropic offers different Claude models for different use cases, including Opus for complex reasoning and coding tasks, Sonnet for general-purpose workloads, and Haiku for faster and lower-cost applications. Mythos sits at the top end of that stack and is positioned as Anthropic's most capable model for software engineering, reasoning, and agentic tasks.
On the other hand, Sakana AI's Fugu is not a standalone frontier model in the traditional sense. Instead, it functions as an orchestration layer that coordinates multiple frontier AI models and dynamically routes tasks to whichever model is best suited to solve them. In its technical report, Sakana AI said Fugu models are trained to "adaptively and dynamically orchestrate a team of more powerful frontier agent workers" and can achieve performance beyond a single model through what it describes as collective intelligence.
China's GLM-5.2 occupies yet another position in the market. Rather than focusing primarily on raw benchmark performance, Z.ai has marketed the model around long-horizon task completion, autonomous coding workflows and agentic execution.
Why the distinction matters
It matters because many of the benchmark results now being cited compare fundamentally different approaches to AI development. While Anthropic is evaluating the capabilities of a frontier foundation model, Sakana AI is testing an orchestration system built on top of multiple models, and Z.ai is focusing on agentic systems designed for longer-duration workflows.
As a result, the benchmark leaders vary depending on what is being measured.
Where the models differ
The benchmark comparisons being cited by Anthropic, Sakana AI and Z.ai are not entirely like-for-like because the systems being evaluated are built differently and, in some cases, the companies report results for different models within their product families.
Benchmark
What it measures
Claude Mythos / Opus
Fugu / Fugu Ultra
GLM-5.2
SWE-Bench Pro
Real-world software engineering
80.3% (Mythos)
73.70%
62.10%
Terminal Bench 2.1
Agentic coding via terminal use
80.4% (Mythos Preview)
82.10%
82.70%
GPQA Diamond
Graduate-level science reasoning
94.6 (Mythos Preview)
95.5
91.2
CharXiv Reasoning
Scientific charts and figures
86.1 (Mythos Preview)
86.6
Not reported
Humanity's Last Exam*
Multidisciplinary reasoning
64.7% (with tools)
50.00%
54.7% (with tools)
FrontierSWE
Long-horizon coding tasks (20 hrs)
75.1% (Opus 4.8)
Not reported
74.40%
PostTrainBench
Long-horizon agentic tasks
37.2% (Opus 4.8)
Not reported
34.30%
MCP-Atlas
Multi-step agent workflows
77.8% (Opus 4.8)
Not reported
76.80%
Tool-Decathlon
Tool use and workflows
59.9% (Opus 4.8)
Not reported
48.20%
ExploitBench
Cybersecurity and vulnerability tasks
78.0% (Mythos)
Not reported
Not reported
*Scores for Humanity's Last Exam are reported under different evaluation settings and may not be directly comparable. Anthropic and Z.ai report tool-enabled scores, while Sakana AI reports a text-only score.
What does the comparison show
Anthropic's models continue to lead several software engineering, cybersecurity and agentic workflow evaluations. The company reported an 80.3 per cent score for Claude Mythos on SWE-Bench Pro, a benchmark that tests whether AI systems can resolve real-world software issues in code repositories. Anthropic's Opus 4.8 also leads benchmarks such as FrontierSWE, MCP-Atlas and Tool-Decathlon, according to the figures disclosed by the respective companies.
Sakana AI reported stronger performance on some scientific reasoning and agentic coding benchmarks. According to the company's technical report, Fugu Ultra scored 95.5 on GPQA Diamond, a benchmark that evaluates graduate-level scientific reasoning, compared with 94.6 for Mythos Preview. Fugu Ultra also achieved 86.6 on CharXiv Reasoning, ahead of Mythos Preview's 86.1. On Terminal Bench 2.1, which evaluates how effectively models can interact with computing environments through terminal commands, Fugu Ultra scored 82.1 compared with 80.4 for Mythos Preview.
Z.ai's GLM-5.2 also reported stronger performance than Mythos Preview on Terminal Bench 2.1, scoring 82.7. The company further reported competitive results on long-horizon coding and agentic workflow benchmarks, including 74.4 on FrontierSWE and 76.8 on MCP-Atlas.
However, Anthropic's Opus 4.8 remained ahead on both benchmarks, posting scores of 75.1 and 77.8, respectively.
Why benchmark wins do not always translate into overall leadership
The results illustrate why multiple companies can simultaneously claim state-of-the-art performance.
A system that excels at software engineering may not necessarily lead in scientific reasoning, while a model optimised for long-duration autonomous execution may perform differently on knowledge or cybersecurity evaluations.
Anthropic's models continue to hold advantages across software engineering, cybersecurity and several workflow-oriented benchmarks. Fugu has reported stronger performance on selected scientific reasoning and agentic coding tests, while GLM-5.2 has reported competitive or leading results on some agentic coding and long-horizon software engineering evaluations.
The results point to an increasingly fragmented AI scenario, where different systems are optimised for different tasks. Rather than producing a single dominant model, the latest generation of AI products is creating specialists in areas such as software engineering, scientific reasoning, cybersecurity and long-horizon agentic execution.