AI Matches Top Human Forecasters - TCR 05/24/26

A frontier AI matched professional superforecasters, Qwen ran 35 unattended hours, and AI surfaced 5,000 PFAS-stripping molecules.

Three-panel infographic: AI matches human superforecasters and runs 35 unattended hours, Gulf data centers struck in Middle East war, and US SOCOM seeks disconnected on-device AI.

The 20-Second Scan


The 2-Minute Read

The thread across yesterday's signal is the substrate of the AI era being assembled and contested at the same clock the capability compounds on. Three capability signals arrived in the same cycle: a frontier model matching professional superforecasters, an autonomous agent running 35 unattended hours of GPU kernel optimization, and a generative materials platform surfacing thousands of candidate molecules for stripping forever chemicals from drinking water.

The physical layer running under that capability was contested in the same news cycle. Two Amazon data centers in the UAE were physically struck early in the Middle East war, and the Gulf states' multibillion-dollar bet on becoming the world's third AI infrastructure pole now sits inside a kinetic risk model the original investment memos did not contain. Anthropic's Project Glasswing posted its first month's defensive scoreboard, surfacing more than 10,000 high or critical vulnerabilities across systemically important open-source software with 97 patched upstream.

The governance layer formed in parallel. New York City Council introduced bipartisan legislation to create the first municipal AI enforcement office in the United States. U.S. Special Operations Command described its procurement of disconnected on-device frontier AI for tactical use, the operational specification that would foreclose the kind of after-action accountability civilian institutions spent decades building into other forms of state power. At Cannes, the creative industry's split on AI surfaced on the record from major names.

The institutions surrounding what these systems can do are being assembled in council chambers, festivals, and procurement specifications, on a clock measured in news cycles rather than legislative sessions.


The 20-Minute Deep Dive

Three Signals of Compounding Autonomous Capability

Three capability signals arrived in the same news cycle that together extend the recursive-self-improvement discussion the field opened at Google I/O. As the May 23 edition of The Century Report documented, DeepMind's CEO named that framing the 'foothills of the singularity' alongside OpenAI's parallel hiring for the same threshold. Google DeepMind's "green tree" model claimed the top spot on the Forecasting Research Institute's ForecastBench, the first time a language model has matched professional superforecaster performance on the cognitively demanding prediction tasks the benchmark assembles from active geopolitical, scientific, and economic questions. Superforecasters are the small group of humans who consistently beat aggregated expert judgment across long-horizon predictions; the benchmark was designed against their performance because nothing else was producing comparable results. A frontier model now matches them on the same questions, scored under the same protocol.

Alibaba's Qwen3.7-Max ran 35 continuous hours of autonomous execution against a Triton kernel-optimization benchmark, performing 432 kernel evaluations across 1,158 tool calls and delivering a 10x geometric-mean speedup over the reference baseline. The shape of the run carries as much weight as the numerical result. A model planning, executing, evaluating, and iterating across 35 hours without continuous human direction is the operational form the field has been describing as an autonomous research agent. METR's task-horizon measurements went from 30 seconds to 12 hours over four years; a 35-hour run is the next data point on the same curve, arriving in the cycle that opened with DeepMind's CEO naming recursive self-improvement as operational near-term.

Kemira and CuspAI applied generative materials design to the PFAS problem, exploring a chemical design space of roughly 300 trillion candidate structures to surface more than 5,000 novel materials for stripping forever chemicals from drinking water. PFAS are the synthetic compounds that accumulate in soil, water, and human tissue without degrading. Removal at scale has been a chronic engineering bottleneck because the molecules resist conventional filtration. A search across 300 trillion structures is not a search any human team could conduct; the field has been waiting for the computational substrate that could conduct one and produce candidates dense enough to engineer against. The substrate now exists and has produced its first deployment-relevant output.

What the three signals share is the shape of autonomous capability arriving in operational form across distinct domains in the same cycle. A model matching superforecasters at prediction, a model running 35 unattended hours of materials optimization, and a model surfacing thousands of candidate PFAS removers from a chemistry space human attention can never reach in aggregate describe the same capability layer compounding on multiple fronts. The instruments the next decade of research will use are being built into the systems running the research, on a timeline the prior decade's planning frameworks did not anticipate.

Gulf AI Infrastructure Crosses From At-Risk to Actively Contested

The May 23 edition of The Century Report covered Iran reportedly weighing whether to seize the seven undersea cables transiting the Strait of Hormuz that carry the Gulf's AI-infrastructure connectivity to Europe and the United States. CNBC's reporting this week adds the kinetic layer underneath the cable-diplomacy story. Two Amazon data centers in the UAE were physically targeted early in the Middle East war. Oil has held near $100 a barrel for nearly three months. The Strait of Hormuz remains closed. The Gulf's positioning as the world's third major AI infrastructure pole behind the US-China axis now sits inside a different risk model than the one the original investment memos used. The May 23 edition of The Century Report traced the data-layer dimension of that risk when Iran was reported to be weighing seizure of all seven undersea cables carrying Gulf connectivity to Europe and the United States; the strikes on Amazon's UAE facilities this week move the same risk from reported intention to documented physical reality.

The structural commitments behind that buildout remain enormous. The UAE channels capital through MGX and G42, both founded by Mubadala. Saudi Arabia's HUMAIN sits inside a sovereign wealth fund approaching $1 trillion. Qatar set up Qai through QIA with Brookfield as partner. Cisco, Oracle, AWS, Microsoft, and Google layered investments and data centers across the region before the war began, anchored on cheap industrial power around $0.11 per kWh against $0.25 to $0.40 in much of Europe, and on the proximity to European and African demand the cable geography enabled. That arithmetic was the entire investment thesis.

The arithmetic is being repriced in public. Pure Data Center Group's CEO told CNBC in April the company had temporarily paused investment decisions in the region while continuing planning discussions. BCLP partner Mark Richards, whose firm advises large-scale data center projects in the Gulf, said investment timelines are extending because risks that were not part of the original thesis are now being priced into it. Gas prices in the UAE jumped 30% for consumers in April after sustained higher oil prices, and Brent surged more than 55% from around $72 a barrel to nearly $120 at its peak. Even in energy-rich states, the cheap industrial power that anchored the AI bet is no longer guaranteed.

Trisha Ray at the Atlantic Council described the shift more directly. Risk management around AI infrastructure had been about cyber threats and digital disruptions, and the drone strikes against Amazon's facilities moved that risk model onto physical terrain the original site selections did not anticipate. Ray suggested operators may need to physically harden the sites or build them underground, the kind of design constraint that lives natively in defense procurement and has not historically applied to commercial cloud infrastructure.

The same lesson the chip layer absorbed across eleven weeks of helium and LNG supply-chain shocks, surfaced in TSMC, Samsung, and SK Hynix Q1 earnings, is now arriving at the data center layer. Physical-substrate concentration in a single contested geography is an internal cost that compounds rather than an externalized assumption that can be left out of the planning. The geography that looked like the AI era's third major hub a year ago is being repriced as one where the buildout cost includes a kinetic threat model the original investment memos did not contain. Every new Gulf data center contract being negotiated this quarter carries that line in its risk model in a way it did not last quarter, which is itself the lower-cost version of learning the lesson.

The dispersal pattern is already visible elsewhere in the same week. Frontier-tier AI capability fitting on hardware small enough to carry, what SOCOM publicly priced for and what Anthropic and Apple are shipping to consumers, prices concentration risk down by taking out the network round-trip entirely. The next round of Gulf data center contracts will weigh dispersal as a real option alongside hardening for the first time.

Claude Mythos Scans the World's Most Critical Software and Finds 10,000 Flaws in One Month

Anthropic published the first operational results from Project Glasswing, the controlled-release framework giving roughly 50 cybersecurity defenders exclusive early access to Claude Mythos Preview. In the month since the initiative went live, the model has surfaced more than 10,000 high or critical severity vulnerability candidates across systemically important open-source software. Subsequent triage has confirmed 1,726 as valid findings, with 1,094 of those rated high or critical. Ninety-seven flaws have already been patched upstream, and 88 security advisories have been issued. Among the confirmed findings is a critical flaw in WolfSSL rated 9.1 on the CVSS scale that allowed an attacker to forge certificates and impersonate legitimate services.

The Century Report covered Project Glasswing's announcement in the April 8 edition and Mozilla's published Firefox engineering postmortem in the May 8 edition. What advances the arc is the running scoreboard: the first observed finding rate when frontier-grade vulnerability discovery is operationalized as defender infrastructure. One Glasswing partner bank used Mythos to detect and stop a fraudulent $1.5 million wire transfer initiated after an attacker compromised a customer's email and made spoofing phone calls. The defensive surface area Mythos covers now extends past static code analysis into live fraud interception.

The downstream pressure on the rest of the patch ecosystem is becoming visible in the cadences of the largest vendors. Microsoft acknowledged this month that its monthly patch volumes will keep trending larger for some time. Oracle has shifted to a monthly patch cycle to keep pace. Anthropic itself recommended in its disclosure that network defenders shorten patch testing and deployment timelines, harden default configurations, and assume the asymmetry between AI-aided discovery and human-paced remediation is the next problem to solve. OpenAI's parallel restricted release of Daybreak and GPT-5.5-Cyber follows the same staged-disclosure pattern, with the two frontier labs converging on the same governance form rather than waiting for an external regulatory framework to define it.

What the scoreboard puts on the public record is that the prior assumption underneath software security - vulnerabilities accumulate faster than they can be found, and patches lag the discoveries that surface them - is being inverted at the discovery layer first. The remediation layer is what the next year of work has to compress. Oracle's monthly cycle is the early version of that compression. The version that arrives next will be built around the assumption that the discovery rate is now machine-paced rather than human-paced, and the verification architecture for what got fixed will compound on the same clock.

New York City Moves to Build the First Municipal AI Enforcement Office

A new bill introduced this week in the New York City Council would create an Office of Artificial Intelligence Oversight inside the city's Department of Consumer and Worker Protection. The office would carry investigative authority over wrongful AI use in employment, housing, credit, and access to government and other services, and could recommend penalties and other sanctions against deployments that exploit New Yorkers or cut them out of decisions they should have a say in. The legislation, sponsored by Councilwoman Julie Won, has already attracted bipartisan co-sponsorship from six council members across both major parties, an unusual coalition for any AI bill at any level of government.

The governance stack the bill is filling sits underneath the federal architecture that has dominated AI policy conversations for the past year. Frontier-lab oversight runs from labs to federal agencies, with state attorneys general acting against specific harms case by case. The enforcement layer at municipal scale, where most consumer interactions with AI-mediated decisions actually happen, has had no dedicated home. New York City passed a 2021 local law regulating automated employment decision tools, but a state Comptroller audit in December described the enforcement of that law as "ineffective" with serious flaws in complaint handling and review. The new bill is the response, building the enforcement capacity the earlier statute lacked and extending its scope from hiring to the broader set of consequential decisions algorithmic systems now mediate.

The structural reading is that the governance form arriving here is municipal investigative authority over deployment, distinct from state-level statutory restriction and federal pre-deployment review. Each of those forms has been tried at scale over the past two years; each runs into the gap between the speed of capability deployment and the speed of legislative and regulatory response. A city-level office with investigative authority over specific consumer-protection violations operates on a different clock and against a different target. It targets the deployment layer where AI-mediated decisions reach the people who live in the city, with enforcement happening at the contact point between the system and the person.

Whether the bill becomes law is a question the council has yet to answer. The question the bill raises is whether the next layer of AI governance gets built at municipal scale, with each city establishing its own enforcement architecture in the absence of federal coordination. If New York moves forward, the template becomes available for other cities to adapt. The pattern of subnational governance forming under federal-level deadlock that this newsletter has tracked across data center moratoriums, state AI anti-discrimination laws, and balcony solar legislation extends here too. The layer beneath the labs is being built city by city, by elected officials operating at the scale where AI-mediated harm actually lands.

U.S. Special Operations Forces Are Using AI Heavily and Need It to Work Without a Network

At SOF Week in Tampa, the program manager for intelligence at the U.S. Special Operations Command's Program Executive Office for Digital Applications said special operators are already using generative AI "heavily" for resource allocation and force deployment, and are now actively procuring fog computing frameworks that would extend cloud-scale AI capability to disconnected tactical environments. The procurement specification asks for frontier-tier model performance on hardware small enough to carry, in operational contexts where the network connection to a cloud datacenter is unavailable, unreliable, or deliberately suppressed.

The technical request resembles the one driving consumer demand for local AI models. Anthropic, Apple, and the open-weight ecosystem have all reported surging demand for systems that run inference on the user's own device rather than in a remote datacenter. For ordinary users, the reasons are well documented: data privacy, latency, continuity when networks fail, sovereignty over the conversation. The same architectural property carries a different operational meaning when the deploying institution is a combat command and the operational decision being made is whether to call a strike.

A cloud-connected military AI system leaves an audit trail. Queries reach a datacenter someone else operates. Logs accumulate. A model that runs entirely on a device inside the tactical environment does not. The operator works alongside an intelligence capability whose reasoning, recommendation, and execution can happen without an external record of what the system was asked or what it said in response. SOCOM officials at SOF Week framed this property as a tactical necessity, and on the operational logic of contested environments they are correct. It is also the property that would foreclose, by design, the kind of after-action accountability civilian institutions have spent decades building into other forms of state power.

The closest civilian parallel is police body-worn cameras. Their adoption came after sustained external pressure from communities, courts, and journalists established that unaudited use of force produced outcomes the public was no longer willing to absorb. The accountability layer was retrofitted onto an institution that had operated without it. The procurement specification SOCOM described this week is for an architecture in which that retrofit would be technically harder to add later than it would be to require now. The institutions designing the verification framework for autonomous and semi-autonomous military AI are working against a deployment clock that is itself accelerating. The cheaper moment to insert the accountability layer is the one in which the procurement specification is still being written.

AI Fault Lines Surface Publicly at Cannes

At Cannes this week, the creative industry's internal split on AI surfaced from major names at the industry's most visible platform. Darren Aronofsky's production company Primordial Soup announced an active partnership with Google DeepMind on generative film projects, framing the collaboration as an expansion of the cinematic toolbox available to filmmakers. Guillermo del Toro, speaking in a separate session, said he "would rather die" than use AI in his filmmaking.

The two positions carry structural weight beyond personal aesthetic preference. Aronofsky's partnership operationalizes a workflow where generative models contribute to the visual layer of films that will reach theatrical and streaming distribution; the technology is now a production-grade input inside a recognized creative pipeline at the top of the industry. Del Toro's refusal locates a counter-position that the most decorated craft filmmakers can hold publicly without ambiguity, and that locates the same refusal for younger filmmakers who have not yet made their commitments and were watching the festival to see what was sayable.

The structural reading is that the creative industry's negotiation with AI is now happening on the record at the level of named directors rather than at the level of policy statements from guilds or studios. The Human Consent Standard that George Clooney, Tom Hanks, and Meryl Streep launched on May 13 was the consent infrastructure being assembled in the rights layer. The May 13 edition of The Century Report documented that launch as performers establishing a machine-readable RSL-based protocol attaching AI training and synthesis licensing terms to their likeness and voice. This week's Cannes positions are the artistic infrastructure being negotiated in the craft layer. Both are forming at the platforms where the industry's identity gets argued in public, on the same clock the underlying capability is shipping at. The festival circuit is being used the way regulatory commentary periods used to be used in slower industries: as the place where the terms of the next decade get written down.


The Other Side

The deep dive lands on the accountability gap that SOCOM's on-device procurement creates: frontier AI running without a network is frontier AI without an after-action audit trail. The shape of what is being procured carries a second reading this week.

For two decades the commercially serious form of frontier AI required reaching across a network to a datacenter someone else operated. Every query was logged. Every conversation was a record the provider held. Cloud dependency was treated as inseparable from the capability, and the mass-surveillance worry sat on top of that assumption.

What SOCOM's specification says out loud is that frontier-tier capability now fits on hardware small enough to carry. The same property is what is shipping for everyone else: Anthropic's local deployments, Apple's on-device roadmap, the open-weight ecosystem whose downloads have crossed the tens of millions. The cloud lock-in the surveillance frame assumed was permanent is what is breaking.

Julie Won's bipartisan AI enforcement office, introduced this week with six co-sponsors across both parties, sits inside the implication. The federal layer was procuring its accountability infrastructure away; Won is building the municipal version at the contact point where ordinary people meet AI-mediated rent applications, hiring decisions, and credit screens. The setup that lets a special operator carry intelligence into a contested environment is the same setup a renter can run on their own laptop to read their own lease before a landlord's screening algorithm decides on them. The capability is leaving the datacenter, and Won is one of the people building the venue that meets it on the other side.


The Century Perspective

With a century of change unfolding in a decade, a single day looks like this: a frontier model matching the small population of humans who consistently beat aggregated expert judgment on long-horizon prediction, an autonomous agent planning and iterating across 35 unattended hours of GPU-kernel optimization, a generative chemistry search across 300 trillion structures surfacing thousands of candidate molecules for stripping forever chemicals from drinking water, a defensive AI surfacing more than 10,000 high or critical vulnerabilities in a single month with 97 already patched upstream, a bipartisan municipal bill assembling the first city-level AI enforcement office in the United States, the first approved antiviral for the most severe form of viral hepatitis arriving after decades with no option. There's also friction, and it's intense - Amazon data centers in the UAE physically struck early in a kinetic war the original investment memos did not contain, a special-operations procurement specification asking for frontier-tier AI that runs without the network connection that would otherwise leave an audit trail, the creative industry's split surfacing on the festival record between active DeepMind partnership and outright refusal from the most decorated craft directors, public-relations firms describing the pressure to rebrand non-AI clients as AI specialists as "yoga-level stretches." But friction generates sparks, and sparks travel past the surface that produced them. Step back for a moment and you can see it: the substrate of the AI era being assembled and contested on the same clock the capability compounds on, governance forming city by city beneath the federal-level deadlock, defensive cyber capability compounding alongside offensive capability rather than lagging behind it, an investment thesis being repriced in public against a threat model the original site selection did not anticipate, the festival circuit being used the way regulatory commentary periods used to be in slower industries. Every transformation has a breaking point. A crucible can melt what it holds into nothing recognizable... or fuse it into an alloy that withstands what neither component could bear alone.


AI Releases & Advancements

New today

  • NVIDIA: Released Nemotron-Labs Diffusion on Hugging Face, a family of diffusion language models (3B, 8B, and 14B text models plus an 8B VLM) supporting three inference modes in a single checkpoint: standard autoregressive, parallel diffusion decoding, and self-speculation (diffusion drafting with AR verification); the 8B model reaches 6× more tokens per forward pass than Qwen3-8B on Blackwell GPUs with higher average accuracy; all text models released under the NVIDIA Nemotron Open Model License. (Hugging Face / NVIDIA Blog)
  • ggml-org / llama.cpp: Release b9297 ships NVFP4 quantization and Multi-Token Prediction as stable functionality on NVIDIA Blackwell GPUs (previously merged as preliminary/beta), and adds built-in native agentic tools to llama-server - exec_shell, edit_file, read_file, write_file, and others - enabled via --tools all for local agentic coding workflows. (GitHub Releases)

Other recent releases

  • Cohere: Released Command A+, a 218B sparse MoE model (25B active parameters) under Apache 2.0; features native citation grounding spans, multimodal vision+text input, expanded 48-language support, and runs on as few as two H100 GPUs for enterprise-grade agentic workflows. (Cohere Blog)
  • WordPress: Released WordPress 7.0 with native AI infrastructure, including a WP AI Client and Abilities API that connect the platform to providers like OpenAI, Gemini, and Claude without separate plugins, plus a Connectors API for managing external AI service integrations. (WordPress.org)
  • Superset (YC P26): Launched an open-source agentic IDE on GitHub for running Claude Code, Codex, and other AI coding agents in parallel development workflows. (GitHub)
  • Alibaba Qwen: Released Qwen3.7-Max in preview on Alibaba Cloud, a proprietary long-horizon agentic model with a 1M-token context window scoring 80.4% on SWE-Verified and 69.7% on TerminalBench 2.0; available via API on Alibaba Cloud Model Studio. (Alibaba Cloud Blog)
  • Google: Launched WebMCP in an experimental origin trial in Chrome 149, an open web standard enabling websites to expose structured JavaScript functions and HTML forms directly to browser-based AI agents, replacing pixel-parsing DOM navigation with precise, machine-callable tool interfaces. (Chrome for Developers)
  • Cohere: Released Command A+, a 218B sparse MoE model (25B active parameters) under Apache 2.0; features native citation grounding spans, multimodal vision+text input, expanded 48-language support, and runs on as few as two H100 GPUs for enterprise-grade agentic workflows. (Cohere Blog)
  • WordPress: Released WordPress 7.0 with native AI infrastructure, including a WP AI Client and Abilities API that connect the platform to providers like OpenAI, Gemini, and Claude without separate plugins, plus a Connectors API for managing external AI service integrations. (WordPress.org)
  • Superset (YC P26): Launched an open-source agentic IDE on GitHub for running Claude Code, Codex, and other AI coding agents in parallel development workflows. (GitHub)

Sources and Further Reading

Artificial Intelligence & Technology's Reconstitution

Institutions & Power Realignment

Scientific & Medical Acceleration

Economics & Labor Transformation

Infrastructure & Engineering Transitions

The Century Report tracks structural shifts during the transition between eras. It is produced daily as a perceptual alignment tool - not prediction, not persuasion, just pattern recognition for people paying attention.