Fugu vs GLM-5.2 vs Mythos: Why AI benchmarks crown different winners

New AI models from Japan's Sakana AI and China's Z.ai are posting higher scores than Anthropic's Claude Mythos on select benchmarks; here's what it actually means:

Japan's Fugu, China's GLM-5.2 and Anthropic's Claude Mythos showcase how different AI architectures excel in different benchmark tests. (Image credit - Company logos taken from their respective webistes)

Akshita Singh New Delhi

5 min read Last Updated : Jul 01 2026 | 1:08 PM IST

Add as Preferred source

The latest race among advanced artificial intelligence (AI) systems is increasingly being fought through benchmark scores, with Japan's Sakana AI and China's Z.ai (formerly Zhipu AI) reporting stronger performance than Anthropic's Claude Mythos on select evaluations.

However, the comparison is not entirely straightforward because the systems being measured are designed for different purposes.

Different models, different strengths

Claude Mythos is Anthropic's flagship frontier AI model, built as part of the company's broader Claude family of products.

Anthropic offers different Claude models for different use cases, including Opus for complex reasoning and coding tasks, Sonnet for general-purpose workloads, and Haiku for faster and lower-cost applications. Mythos sits at the top end of that stack and is positioned as Anthropic's most capable model for software engineering, reasoning, and agentic tasks.

On the other hand, Sakana AI's Fugu is not a standalone frontier model in the traditional sense. Instead, it functions as an orchestration layer that coordinates multiple frontier AI models and dynamically routes tasks to whichever model is best suited to solve them. In its technical report, Sakana AI said Fugu models are trained to "adaptively and dynamically orchestrate a team of more powerful frontier agent workers" and can achieve performance beyond a single model through what it describes as collective intelligence.

China's GLM-5.2 occupies yet another position in the market. Rather than focusing primarily on raw benchmark performance, Z.ai has marketed the model around long-horizon task completion, autonomous coding workflows and agentic execution.

Why the distinction matters

It matters because many of the benchmark results now being cited compare fundamentally different approaches to AI development. While Anthropic is evaluating the capabilities of a frontier foundation model, Sakana AI is testing an orchestration system built on top of multiple models, and Z.ai is focusing on agentic systems designed for longer-duration workflows.

As a result, the benchmark leaders vary depending on what is being measured.

Where the models differ

The benchmark comparisons being cited by Anthropic, Sakana AI and Z.ai are not entirely like-for-like because the systems being evaluated are built differently and, in some cases, the companies report results for different models within their product families.

Benchmark	What it measures	Claude Mythos / Opus	Fugu / Fugu Ultra	GLM-5.2
SWE-Bench Pro	Real-world software engineering	80.3% (Mythos)	73.70%	62.10%
Terminal Bench 2.1	Agentic coding via terminal use	80.4% (Mythos Preview)	82.10%	82.70%
GPQA Diamond	Graduate-level science reasoning	94.6 (Mythos Preview)	95.5	91.2
CharXiv Reasoning	Scientific charts and figures	86.1 (Mythos Preview)	86.6	Not reported
Humanity's Last Exam*	Multidisciplinary reasoning	64.7% (with tools)	50.00%	54.7% (with tools)
FrontierSWE	Long-horizon coding tasks (20 hrs)	75.1% (Opus 4.8)	Not reported	74.40%
PostTrainBench	Long-horizon agentic tasks	37.2% (Opus 4.8)	Not reported	34.30%
MCP-Atlas	Multi-step agent workflows	77.8% (Opus 4.8)	Not reported	76.80%
Tool-Decathlon	Tool use and workflows	59.9% (Opus 4.8)	Not reported	48.20%
ExploitBench	Cybersecurity and vulnerability tasks	78.0% (Mythos)	Not reported	Not reported
*Scores for Humanity's Last Exam are reported under different evaluation settings and may not be directly comparable. Anthropic and Z.ai report tool-enabled scores, while Sakana AI reports a text-only score.

What does the comparison show

Anthropic's models continue to lead several software engineering, cybersecurity and agentic workflow evaluations. The company reported an 80.3 per cent score for Claude Mythos on SWE-Bench Pro, a benchmark that tests whether AI systems can resolve real-world software issues in code repositories. Anthropic's Opus 4.8 also leads benchmarks such as FrontierSWE, MCP-Atlas and Tool-Decathlon, according to the figures disclosed by the respective companies.

Sakana AI reported stronger performance on some scientific reasoning and agentic coding benchmarks. According to the company's technical report, Fugu Ultra scored 95.5 on GPQA Diamond, a benchmark that evaluates graduate-level scientific reasoning, compared with 94.6 for Mythos Preview. Fugu Ultra also achieved 86.6 on CharXiv Reasoning, ahead of Mythos Preview's 86.1. On Terminal Bench 2.1, which evaluates how effectively models can interact with computing environments through terminal commands, Fugu Ultra scored 82.1 compared with 80.4 for Mythos Preview.

Z.ai's GLM-5.2 also reported stronger performance than Mythos Preview on Terminal Bench 2.1, scoring 82.7. The company further reported competitive results on long-horizon coding and agentic workflow benchmarks, including 74.4 on FrontierSWE and 76.8 on MCP-Atlas.

However, Anthropic's Opus 4.8 remained ahead on both benchmarks, posting scores of 75.1 and 77.8, respectively.

Why benchmark wins do not always translate into overall leadership

The results illustrate why multiple companies can simultaneously claim state-of-the-art performance.

A system that excels at software engineering may not necessarily lead in scientific reasoning, while a model optimised for long-duration autonomous execution may perform differently on knowledge or cybersecurity evaluations.

Anthropic's models continue to hold advantages across software engineering, cybersecurity and several workflow-oriented benchmarks. Fugu has reported stronger performance on selected scientific reasoning and agentic coding tests, while GLM-5.2 has reported competitive or leading results on some agentic coding and long-horizon software engineering evaluations.

The results point to an increasingly fragmented AI scenario, where different systems are optimised for different tasks. Rather than producing a single dominant model, the latest generation of AI products is creating specialists in areas such as software engineering, scientific reasoning, cybersecurity and long-horizon agentic execution.

Connect with us on WhatsApp

More From This Section

Premium

Fugu vs GLM-5.2 vs Mythos: Why AI benchmarks crown different winners

New AI models from Japan's Sakana AI and China's Z.ai are posting higher scores than Anthropic's Claude Mythos on select benchmarks; here's what it actually means:

Different models, different strengths

Why the distinction matters

Where the models differ

What does the comparison show

Why benchmark wins do not always translate into overall leadership

More From This Section

Coding to AI returns: IT's moment of truth as tech reshapes services

Experts explain why enterprise AI projects struggle to move beyond pilots

Beyond Sarvam: Where India's AI opportunity lies beyond language models

India's AI race: Why building infrastructure matters more than chatbots

Beyond AI engineers: Sovereign AI may redefine India's IT talent pyramid

Explore News