In Brief
Posted:
12:26 PM PST · February 6, 2026
Image Credits:TechCrunch / Getty ImagesLast month, I wrote astir Mercor’s caller benchmark measuring AI agents’ capabilities connected nonrecreational tasks similar instrumentality and firm analysis. At the time, the scores were beauteous dismal, with each large laboratory scoring nether 25%, truthful we concluded lawyers were harmless from AI displacement, astatine slightest for now.
But AI capabilities tin alteration a batch successful a mates of weeks.
This week’s merchandise of Opus 4.6 shook up the leaderboards, with Anthropic’s caller exemplary scoring conscionable shy of 30% successful one-shot trials, and an mean of 45% erstwhile fixed a fewer much cracks astatine the problem. Notably, the merchandise included a clump of caller agentic features, including “agent swarms,” which whitethorn person helped with this benignant of multi-step problem-solving.
Regardless, the people is simply a immense leap from the erstwhile state-of-the-art, and a motion that advancement connected instauration models isn’t slowing down. Mercor CEO Brendan Foody, who was peculiarly impressed, said, “jumping from 18.4% to 29.8% successful a fewer months is insane.”
The APEX-Agents LeaderboardThirty percent is inactive a agelong mode from 100%, truthful it’s not similar lawyers request to beryllium disquieted astir getting replaced by machines adjacent week. But they should beryllium a batch little assured than they were past month!
Subscribe for the industry’s biggest tech news















English (US) ·