Are AI agents ready for the workplace? A new benchmark raises doubts.

2 months ago 37

It’s been astir 2 years since Microsoft CEO Satya Nadella predicted AI would regenerate cognition work — the white-collar jobs held by lawyers, concern bankers, librarians, accountants, IT and others.

But contempt the immense advancement made by instauration models, the alteration successful cognition enactment has been dilatory to arrive. Models person mastered in-depth probe and agentic planning, but for immoderate reason, astir white-collar enactment has been comparatively unaffected.

It’s 1 of the biggest mysteries successful AI — and acknowledgment to caller probe from the training-data elephantine Mercor, we’re yet getting immoderate answers.

The caller probe looks astatine however starring AI models clasp up doing existent white-collar enactment tasks, drawn from consulting, concern banking, and law. The effect is simply a caller benchmark called Apex-Agents — and truthful far, each AI laboratory is getting a failing grade. Faced with queries from existent professionals, adjacent the champion models struggled to get much than a 4th of the questions right. The immense bulk of the time, the exemplary came backmost with a incorrect reply oregon nary reply astatine all.

According to researcher Brendan Foody, who worked connected the paper, the models’ biggest stumbling constituent was tracking down accusation crossed aggregate domains — thing that’s integral to astir of the cognition enactment performed by humans.

“One of the large changes successful this benchmark is that we built retired the full environment, modeled aft however existent nonrecreational services,” Foody told Techcrunch. “The mode we bash our jobs isn’t with 1 idiosyncratic giving america each the discourse successful 1 place. In existent life, you’re operating crossed Slack and Google Drive and each these different tools.” For galore agentic AI models, that benignant of multi-domain reasoning is inactive deed oregon miss.

Screenshot

The scenarios were each drawn from existent professionals connected Mercor’s adept marketplace, who some laid retired the queries and acceptable the modular for a palmy response. Looking done the questions, which are posted publically connected Hugging Face, gives a consciousness of however analyzable the tasks tin get. 

Techcrunch event

San Francisco | October 13-15, 2026

One question successful the “Law” conception reads: 

During the archetypal 48 minutes of the EU accumulation outage, Northstar’s engineering squad exported 1 oregon 2 bundled sets of EU accumulation lawsuit logs containing idiosyncratic information to the U.S. analytics vendor….Under Northstar’s ain policies, it tin reasonably dainty the 1 oregon 2 log exports arsenic accordant with Article 49?

The close reply is yes, but getting determination requires an in-depth appraisal of the company’s ain policies arsenic good arsenic the applicable EU privateness laws.

That mightiness stump adjacent a well-informed human, but the researchers were trying to exemplary the enactment done by professionals successful the field. If an LLM tin reliably reply these questions, it could efficaciously regenerate galore of the lawyers moving today. “I deliberation this is astir apt the astir important taxable successful the economy,” Foody told TechCrunch. “The benchmark is precise reflective of the existent enactment that these radical do.”

OpenAI besides attempted to measurement nonrecreational skills with its GDPVal benchmark — but the Apex Agents trial differs successful important ways. Where GDPVal tests wide cognition crossed a wide scope of professions, the Apex Agents benchmark measures the system’s quality to execute sustained tasks successful a constrictive acceptable of high-value professions. The effect is much hard for models, but besides much intimately tied to whether these jobs tin beryllium automated.

While nary of the models proved acceptable to instrumentality implicit arsenic concern bankers, immoderate were intelligibly person to the mark. Gemini 3 Flash performed the champion of the radical with 24% one-shot accuracy, followed intimately by GPT-5.2 with 23%. Below that, Opus 4.5, Gemini 3 Pro and GPT-5 each scored astir 18%.

While the archetypal results autumn short, the AI tract has a past of blowing done challenging benchmarks. Now that the Apex trial is public, it’s an unfastened situation for AI labs who judge they tin bash amended — thing Foody afloat expects successful the months to come. 

“It’s improving truly quickly,” helium told TechCrunch. “Right present it’s just to accidental it’s similar an intern that gets it close a 4th of the time, but past twelvemonth it was the intern that gets it close 5 oregon 10 percent of the time. That benignant of betterment twelvemonth aft twelvemonth tin person an interaction truthful quickly.”

]

Russell Brandom has been covering the tech manufacture since 2012, with a absorption connected level argumentation and emerging technologies. He antecedently worked astatine The Verge and Rest of World, and has written for Wired, The Awl and MIT’s Technology Review. He tin beryllium reached astatine russell.brandom@techcrunch.com oregon connected Signal astatine 412-401-5489.

Read Entire Article