Fugu vs GLM-5.2 vs Mythos: Why AI benchmarks crown different winners

New AI models from Japan's Sakana AI and China's Z.ai are posting higher scores than Anthropic's Claude Mythos on select benchmarks; here's what it actually means:

Japan's Fugu, China's GLM-5.2 and Anthropic's Claude Mythos showcase how different AI architectures excel in different benchmark tests.
Japan's Fugu, China's GLM-5.2 and Anthropic's Claude Mythos showcase how different AI architectures excel in different benchmark tests. (Image credit - Company logos taken from their respective webistes)
Akshita Singh New Delhi
5 min read Last Updated : Jul 01 2026 | 1:08 PM IST
The latest race among advanced artificial intelligence (AI) systems is increasingly being fought through benchmark scores, with Japan's Sakana AI and China's Z.ai (formerly Zhipu AI) reporting stronger performance than Anthropic's Claude Mythos on select evaluations.
 
However, the comparison is not entirely straightforward because the systems being measured are designed for different purposes.

Different models, different strengths

Claude Mythos is Anthropic's flagship frontier AI model, built as part of the company's broader Claude family of products.
 
Anthropic offers different Claude models for different use cases, including Opus for complex reasoning and coding tasks, Sonnet for general-purpose workloads, and Haiku for faster and lower-cost applications. Mythos sits at the top end of that stack and is positioned as Anthropic's most capable model for software engineering, reasoning, and agentic tasks.
 
On the other hand, Sakana AI's Fugu is not a standalone frontier model in the traditional sense. Instead, it functions as an orchestration layer that coordinates multiple frontier AI models and dynamically routes tasks to whichever model is best suited to solve them. In its technical report, Sakana AI said Fugu models are trained to "adaptively and dynamically orchestrate a team of more powerful frontier agent workers" and can achieve performance beyond a single model through what it describes as collective intelligence.
 
China's GLM-5.2 occupies yet another position in the market. Rather than focusing primarily on raw benchmark performance, Z.ai has marketed the model around long-horizon task completion, autonomous coding workflows and agentic execution.

Why the distinction matters

It matters because many of the benchmark results now being cited compare fundamentally different approaches to AI development. While Anthropic is evaluating the capabilities of a frontier foundation model, Sakana AI is testing an orchestration system built on top of multiple models, and Z.ai is focusing on agentic systems designed for longer-duration workflows.
 
As a result, the benchmark leaders vary depending on what is being measured.

Where the models differ

The benchmark comparisons being cited by Anthropic, Sakana AI and Z.ai are not entirely like-for-like because the systems being evaluated are built differently and, in some cases, the companies report results for different models within their product families.
 
Benchmark What it measures Claude Mythos / Opus Fugu / Fugu Ultra GLM-5.2
SWE-Bench Pro Real-world software engineering 80.3% (Mythos) 73.70% 62.10%
Terminal Bench 2.1 Agentic coding via terminal use 80.4% (Mythos Preview) 82.10% 82.70%
GPQA Diamond Graduate-level science reasoning 94.6 (Mythos Preview) 95.5 91.2
CharXiv Reasoning Scientific charts and figures 86.1 (Mythos Preview) 86.6 Not reported
Humanity's Last Exam* Multidisciplinary reasoning 64.7% (with tools) 50.00% 54.7% (with tools)
FrontierSWE Long-horizon coding tasks (20 hrs) 75.1% (Opus 4.8) Not reported 74.40%
PostTrainBench Long-horizon agentic tasks 37.2% (Opus 4.8) Not reported 34.30%
MCP-Atlas Multi-step agent workflows 77.8% (Opus 4.8) Not reported 76.80%
Tool-Decathlon Tool use and workflows 59.9% (Opus 4.8) Not reported 48.20%
ExploitBench Cybersecurity and vulnerability tasks 78.0% (Mythos) Not reported Not reported
*Scores for Humanity's Last Exam are reported under different evaluation settings and may not be directly comparable. Anthropic and Z.ai report tool-enabled scores, while Sakana AI reports a text-only score.

What does the comparison show

Anthropic's models continue to lead several software engineering, cybersecurity and agentic workflow evaluations. The company reported an 80.3 per cent score for Claude Mythos on SWE-Bench Pro, a benchmark that tests whether AI systems can resolve real-world software issues in code repositories. Anthropic's Opus 4.8 also leads benchmarks such as FrontierSWE, MCP-Atlas and Tool-Decathlon, according to the figures disclosed by the respective companies.
 
Sakana AI reported stronger performance on some scientific reasoning and agentic coding benchmarks. According to the company's technical report, Fugu Ultra scored 95.5 on GPQA Diamond, a benchmark that evaluates graduate-level scientific reasoning, compared with 94.6 for Mythos Preview. Fugu Ultra also achieved 86.6 on CharXiv Reasoning, ahead of Mythos Preview's 86.1. On Terminal Bench 2.1, which evaluates how effectively models can interact with computing environments through terminal commands, Fugu Ultra scored 82.1 compared with 80.4 for Mythos Preview.
 
Z.ai's GLM-5.2 also reported stronger performance than Mythos Preview on Terminal Bench 2.1, scoring 82.7. The company further reported competitive results on long-horizon coding and agentic workflow benchmarks, including 74.4 on FrontierSWE and 76.8 on MCP-Atlas.
 
However, Anthropic's Opus 4.8 remained ahead on both benchmarks, posting scores of 75.1 and 77.8, respectively.

Why benchmark wins do not always translate into overall leadership

The results illustrate why multiple companies can simultaneously claim state-of-the-art performance.
 
A system that excels at software engineering may not necessarily lead in scientific reasoning, while a model optimised for long-duration autonomous execution may perform differently on knowledge or cybersecurity evaluations.
 
Anthropic's models continue to hold advantages across software engineering, cybersecurity and several workflow-oriented benchmarks. Fugu has reported stronger performance on selected scientific reasoning and agentic coding tests, while GLM-5.2 has reported competitive or leading results on some agentic coding and long-horizon software engineering evaluations.
 
The results point to an increasingly fragmented AI scenario, where different systems are optimised for different tasks. Rather than producing a single dominant model, the latest generation of AI products is creating specialists in areas such as software engineering, scientific reasoning, cybersecurity and long-horizon agentic execution.

More From This Section

Topics :Artificial intelligenceChinaJapanLatest Technology NewsBS Web Reports

First Published: Jul 01 2026 | 1:07 PM IST

Next Story