The Fugu and Fugu Ultra systems have caused a lot of commotion for their grandiose claims of being at Fable and GPT-5.5 level. What are they, should they really be compared to Fable, GPT-5.5, Mythos or Opus 4.8??? If they can, are they worth using right now??? Reddit
SUMMARYSakana AIβs Fugu is a closed-source router that chooses among other proprietary models, while Fugu Ultra is a multi-step orchestrator that plans agent workflows for tasks such as coding and research. The body argues that both systems are hard to evaluate because they lack transparency on model pools, cost, and token usage, and it claims they underperformed on benchmarks and a Rocket League-style coding task compared with frontier models.
Everything is based on interpretations from Sakana AI's own technical report ππ»
Fugu is a closed source orchestrator on top of closed source models. If before you didn't control the models, now you don't even control which ones are used or how much. So first things first, this is not the "AI sovereignty" some open source enthusiasts are expecting from it right now. Quite the contrary actually.
Fugu (not the ultra version) is a classifier LLM that selects which model at each turn is most likely to answer correctly (in other words a router model experiment, kind of like the one tried with GPT-5).
Unfortunately, this leads to -10 points on SWE Bench pro compared to opus, gets some gains on other benchmarks but very slight.
A lot of benchmarks in the SWE Bench family are problematic and corrupted though, so Mythos getting such a high score in it is actually the biggest highlight of the problem.
Some arguments could be made that it reduces the cost, but no information about this has been revealed so far, hence, it's most likely the opposite.
They also have an autoresearch benchmark where they compare it to frontier models "Model A, B and C" which is extremely shady to not be transparent about.
What models you compare against!!??
This system of models most likely doesn't support adding any new llm out of the box since you need to retrain the classifier.
Fugu Ultra is basically an advanced plan mode and orchestrator, this is a model that for a query outputs a plan with multiple "workflows".
They say: "spawn model A subagents to achieve this, then use model B to judge it, then summarize this with model C", which is just test time scaling compute strategy.
The biggest downside to this, contrary to approaches dedicating end-to-end test time compute to a single AI model???
They need to predict everything before the agents start working, which is why they limit this to 5 steps. The orchestrator would need to predict what to spawn at t+1 with the information it gets at t, not with the info it gets at t=0.
There are also other issues such as fable 5 score on terminal bench being wrong and them being super vague and unclear about the models is in the LLM pool (they only mention closed source api one)
The biggest and most obvious issue is that they are introducing a "test time scaling" method with "best of N" over models, and they literally NEVER REPORT the number of output tokens or cost to achieve a benchmark/task.
This is exactly the kind of thing which is extremely problematic and outdated in today's era where AI model performance continues to scale with test time compute/tokens/wall clock time/cost etc etc etc with no theoretical plateau in sight.
This is exactly what leading OpenAI research scientist Noam Brown also highlighted just recently which I also covered in depth in my post here, no time ago.
Single point benchmarks are extremely useless, any model at the frontier or close to the frontier in today's era can achieve any achieve benchmark scores, it's the y-x graph depicting model performance vs compute, latency and cost.
OpenAI is pretty much the only lab that has shown due diligence and faithfulness to this healthy habit over the past few months.
In my opinion, systems like this are better compared not with opus, but opus with ultracode/workflows enabled, not with kimi, but with kimi swarm etc.
Unfortunately, even if we were to assume that this system of models is actually Mythos, Fable and GPT-5.5 level amidst this extreme lack of transparency, would they be worth it right now???
The plain answer is a NO
LLMJUNKY used 100% of his 5 hour allowed usage in less than 1 prompt, without achieving any meaningful success.
Check out his tweet through the images I posted
Link to his prompt is in the comments πποΈ
The goal was to create a replica of rocket league and Fugu Ultra got severely out-mogged by GPT-5.5, Fable and Opus 4.8 while also crashing head on to a rate limit wall and not finishing anything
There are ways to go for now
Dawg πππ
