Microsoft built a fake marketplace to test AI agents — they failed in surprising ways

6 months ago 79

Image Credits:David Ryder / Bloomberg (PhotoMosh/modified) / Getty Images

9:00 AM PST · November 5, 2025

On Wednesday, researchers astatine Microsoft released a caller simulation situation designed to trial AI agents, on with caller probe showing that existent agentic models whitethorn beryllium susceptible to manipulation. Conducted successful collaboration with Arizona State University, the probe raises caller questions astir however good AI agents volition execute erstwhile moving unsupervised — and however rapidly AI companies tin marque bully connected promises of an agentic future.

The simulation environment, dubbed the “Magentic Marketplace” by Microsoft, is built arsenic a synthetic level for experimenting connected AI cause behavior. A emblematic experimentation mightiness impact a customer-agent trying to bid meal according to a user’s instructions, portion agents representing assorted restaurants vie to triumph the order.

The team’s archetypal experiments included 100 abstracted customer-side agents interacting with 300 business-side agents. Because the root codification for the marketplace is open-source, it should beryllium straightforward for different groups to follow the codification to tally caller experiments oregon reproduce findings.

Ece Kamar, managing manager of Microsoft Research’s AI Frontiers Lab, says this benignant of probe volition beryllium captious to knowing the capabilities of AI agents. “There is truly a question astir however the satellite is going to alteration by having these agents collaborating and talking to each different and negotiating,” said Kamar. “We privation to recognize these things deeply.”

The archetypal probe looked astatine a premix of starring models, including GPT-4o, GPT-5 and Gemini-2.5-Flash, and recovered immoderate astonishing weaknesses. In particular, the researchers recovered respective techniques businesses could usage to manipulate customer-agents into buying their products. The researchers noticed a peculiar falloff successful ratio arsenic a customer-agent was fixed much options to take from, overwhelming the attraction abstraction of the agent.

“We privation these agents to assistance america with processing a batch of options,” Kamar says. “And we are seeing that the existent models are really getting truly overwhelmed by having excessively galore options.”

The agents besides ran into occupation erstwhile they were asked to collaborate toward a communal goal, seemingly unsure of which cause should play what relation successful the collaboration. Performance improved erstwhile the models were fixed much explicit instructions connected however to collaborate, but the researchers inactive saw the models’ inherent capabilities arsenic successful request of improvement.

Techcrunch event

San Francisco | October 13-15, 2026

“We tin instruct the models — similar we tin archer them, measurement by step,” Kamar said. “But if we are inherently investigating their collaboration capabilities, I would expect these models to person these capabilities by default.”

Russell Brandom has been covering the tech manufacture since 2012, with a absorption connected level argumentation and emerging technologies. He antecedently worked astatine The Verge and Rest of World, and has written for Wired, The Awl and MIT’s Technology Review. He tin beryllium reached astatine russell.brandom@techcrunch.com oregon connected Signal astatine 412-401-5489.

Read Entire Article