GPT-5.5 Pro Solves PhD Math in Hours; Claude Learns to Dream

By Artūras Malašauskas May 13, 2026 6 min read Share:

Fields medalist Timothy Gowers reports GPT-5.5 Pro produced doctoral-level number theory research in under two hours, while Anthropic introduces a memory-refinement process called "dreaming" for Claude agents.

The OpenAI model GPT-5.5 Pro has reportedly crossed a threshold that mathematicians have been tracking for years. Fields medalist Timothy Gowers documented the model solving open problems in additive number theory with what he called "no serious mathematical input" from himself. The work took roughly an hour to complete. Meanwhile, Anthropic unveiled a new capability for Claude Managed Agents that the company is calling "dreaming" — a memory-refinement process that runs while agents are idle.

Both developments emerged within the same week, according to coverage from R&D World. The timing is less coincidental than it appears. Frontier models are converging on the same capability: sustained, multi-step reasoning that compounds over time.

Gowers fed the model open problems from a recent paper by number theorist Mel Nathanson. The model produced what Gowers described as "PhD-level research in an hour or so." One problem took 17 minutes and 5 seconds. The model improved an existing exponential bound to a quadratic bound by swapping out a component in Nathanson's proof for a more efficient variant. The core idea was well known in combinatorics, but its application to this particular problem wasn't obvious.

A more generalized version proved harder. After 16 minutes and 41 seconds, the model delivered a first improvement. Junior researcher Isaac Rajagopal called this a routine modification of his own work. Gowers then asked for a stronger bound. After another 31 minutes and 40 seconds, the model improved the bound from exponential to polynomial. Rajagopal judged the results "almost certainly correct" and called the key idea "completely original."

The bar for mathematicians is now proving what LLMs cannot prove. Gowers put it bluntly: "The lower bound for contributing to mathematics will now be to prove something that LLMs can't prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting." He noted that PhD students could use LLMs as a tool. The real task will be creating something in collaboration with LLMs that the models can't do alone.

This is not the first time AI has touched mathematical research. Google DeepMind published an "AI co-mathematician" workbench designed around the actual workflow of open-ended math: ideation, literature search, computational exploration, theorem proving, failed hypothesis tracking and theory building. The paper reports state-of-the-art results on hard problem-solving benchmarks, including 48% on FrontierMath Tier 4.

Anthropic's dreaming feature operates on a different axis. The system reviews prior sessions, finds patterns, updates memory and helps agents improve across jobs. It runs as a scheduled asynchronous workflow while the agent is not actively working. The system looks at the agent's memory store, the transcripts of prior conversations, and the outcomes of prior tasks. Then it does four things: merges duplicate information across sessions, removes outdated entries that no longer apply, highlights recurring patterns, and reorganizes the agent's memory layer so future sessions can build on the previous ones.

Anthropic says Harvey saw roughly 6x higher completion rates in tests. Netflix is using multiagent orchestration to process logs from hundreds of builds. Dreaming is still a research-preview feature, while outcomes and multiagent orchestration have moved into public beta. The feature shipped at the Code with Claude developers' conference in San Francisco.

CEO Dario Amodei disclosed that Anthropic had planned for 10x annualized growth in the first quarter of 2026. The actual result was 80x. API volume on the Claude platform is up nearly 70 times year-over-year. The average developer using Claude Code now spends 20 hours per week working with the tool. That single disclosure explains nearly every news cycle Anthropic has been at the center of over the last 30 days.

The compute implications are stark. Anthropic signed an agreement to use all capacity at SpaceX's Colossus 1 data center. The collaboration adds more than 300 megawatts and more than 220,000 NVIDIA GPUs within the month. The company also has deals with Amazon, Microsoft Azure, and Google TPU agreements. None of it makes sense at 10x growth. All of it makes sense at 80x.

There is a useful frame for understanding why this matters more than typical product news, and it has to do with what an AI agent actually is. Until this week, the most accurate description of an AI agent was "a tool that performs tasks on your behalf." You give it a job. It does the job. It does the job again next time you ask. Each session is independent. The agent might be very good at the job, but it is not getting better at the job. It is repeating its performance.

What dreaming changes is that the agent now compounds. A Claude Managed Agent running in an enterprise that has dreaming enabled is improving every night while no one is using it. It is reviewing what worked, what did not work, what the team's actual preferences turned out to be, and updating its internal model of how to do the job. The next time you give it the job, it does it slightly better. Over enough cycles, the improvements compound. That is not a tool. That is a compounding asset. And it is a category of asset that did not exist on enterprise balance sheets a week ago.

The original session transcripts remain untouched. The dreaming process operates on a separate layer, which allows teams to safely review changes before letting them take effect. This is genuinely a different category of AI capability from anything that has shipped previously at this scale. Until now, AI agents were either stateless or had narrow context windows that filled up quickly. Dreaming creates something closer to long-term memory.

On the math side, the implications are more immediate. Open problems from recent papers used to be valuable training ground for early-career researchers. That bar just got raised. Gowers noted that LLMs have reached the point where, if an open problem has an easy argument that human mathematicians missed, there is a good chance the model will find it. A customized GPT-5.5 variant had helped discover a new Lean-verified proof related to asymptotic properties of off-diagonal Ramsey numbers.

Separately, OpenAI launched Daybreak for AI-assisted cyber defense. The system combines GPT-5.5, Codex Security and tiered access for verified defensive workflows. Daybreak offers secure code review, vulnerability triage, malware analysis and patch validation. The push also formalizes a split between general-purpose and specialized cyber access, with stronger account controls for high-risk security tasks.

Google flagged real adversaries using AI-augmented exploit research. The Google Threat Intelligence Group reported adversaries using AI-augmented research to identify and exploit a 2FA bypass vulnerability in a server administration tool. The report also describes threat actors using middleware projects to aggregate API keys and support persistent access, which makes agent permissions, identity and tool access the main failure surface.

Alphabet announced a range of news: It launched Googlebook, and announced Gemma 4 inference optimization. Meanwhile, its Isomorphic Labs spinoff raised $2.1 billion for AI drug design. The round links DeepMind's research lineage, Demis Hassabis' Nobel-era credibility, Alphabet/GV backing and pharma partnerships into a commercial AI-drug-discovery platform.

None of this is inevitable. The math results depend on problems having "easy arguments that human mathematicians missed." That's a narrow class. The dreaming feature is still in research preview. The compute deals depend on demand continuing to outstrip supply. Whether users actually pay for it remains the real question.

The industry keeps announcing breakthroughs. The models keep getting better. The question is whether the people who need to use them can actually afford the electricity bill.

Artūras Malašauskas is an AI Systems Integrator with 20+ years of production-grade web engineering experience. He has designed, shipped, and scaled enterprise Python/PHP systems for logistics, SaaS, and public-sector clients. For the past year, he has focused exclusively on AI integrations: deploying open-source LLMs, building generative media pipelines (image, audio, video), and engineering multi-agent workflows for real production environments. His standard: reproducibility, security, cost-efficient inference—no vaporware. He documents and evaluates emerging AI tooling, separating verified capabilities from marketing noise. Technical editor at: muza-ai.eu, ai-verslas.lt, ai-naujinos.lt Connect on LinkedIn

GPT-5.5 Pro Solves PhD Math in Hours; Claude Learns to Dream

Comments