Silicon Valley bets big on ‘environments’ to train AI agents

8 months ago 102

For years, Big Tech CEOs person touted visions of AI agents that tin autonomously usage bundle applications to implicit tasks for people. But instrumentality today’s user AI agents retired for a spin, whether it’s OpenAI’s ChatGPT Agent oregon Perplexity’s Comet, and you’ll rapidly recognize however constricted the exertion inactive is. Making AI agents much robust whitethorn instrumentality a caller acceptable of techniques that the manufacture is inactive discovering.

One of those techniques is cautiously simulating workspaces wherever agents tin beryllium trained connected multi-step tasks — known arsenic reinforcement learning (RL) environments. Much similar labeled datasets powered the past question of AI, RL environments are starting to look similar a captious constituent successful the improvement of agents.

AI researchers, founders, and investors archer TechCrunch that starring AI labs are present demanding much RL environments, and there’s nary shortage of startups hoping to proviso them.

“All the large AI labs are gathering RL environments in-house,” said Jennifer Li, wide spouse astatine Andreessen Horowitz, successful an interrogation with TechCrunch. “But arsenic you tin imagine, creating these datasets is precise complex, truthful AI labs are besides looking astatine 3rd enactment vendors that tin make precocious prime environments and evaluations. Everyone is looking astatine this space.”

The propulsion for RL environments has minted a caller people of well-funded startups, specified arsenic Mechanize Work and Prime Intellect, that purpose to pb the space. Meanwhile, ample data-labeling companies similar Mercor and Surge accidental they’re investing much successful RL environments to support gait with the industry’s shifts from static datasets to interactive simulations. The large labs are considering investing heavy too: according to The Information, leaders astatine Anthropic person discussed spending much than $1 cardinal connected RL environments implicit the adjacent year.

The anticipation for investors and founders is that 1 of these startups look arsenic the “Scale AI for environments,” referring to the $29 cardinal information labelling powerhouse that powered the chatbot era.

The question is whether RL environments volition genuinely propulsion the frontier of AI progress.

Techcrunch event

San Francisco | October 27-29, 2025

What is an RL environment?

At their core, RL environments are grooming grounds that simulate what an AI cause would beryllium doing successful a existent bundle application. One laminitis described gathering them successful recent interview “like creating a precise boring video game.”

For example, an situation could simulate a Chrome browser and task an AI cause with purchasing a brace of socks connected Amazon. The cause is graded connected its show and sent a reward awesome erstwhile it succeeds (in this case, buying a worthy brace of socks).

While specified a task sounds comparatively simple, determination are a batch of places wherever an AI cause could get tripped up. It mightiness get mislaid navigating the web page’s driblet down menus, oregon bargain excessively galore socks. And due to the fact that developers can’t foretell precisely what incorrect crook an cause volition take, the situation itself has to beryllium robust capable to seizure immoderate unexpected behavior, and inactive present utile feedback. That makes gathering environments acold much analyzable than a static dataset.

Some environments are rather robust, allowing for AI agents to usage tools, entree the internet, oregon usage assorted bundle applications to implicit a fixed task. Others are much narrow, aimed astatine helping an cause larn circumstantial tasks successful endeavor bundle applications.

While RL environments are the blistery happening successful Silicon Valley close now, there’s a batch of precedent for utilizing this technique. One of OpenAI’s archetypal projects backmost successful 2016 was gathering “RL Gyms,” which were rather akin to the modern conception of environments. The aforesaid year, Google DeepMind trained AlphaGo — an AI strategy that could bushed a satellite champion astatine the committee game, Go — utilizing RL techniques wrong a simulated environment.

What’s unsocial astir today’s environments is that researchers are trying to physique computer-using AI agents with ample transformer models. Unlike AlphaGo, which was a specialized AI strategy moving successful a closed environments, today’s AI agents are trained to person much wide capabilities. AI researchers contiguous person a stronger starting point, but besides a analyzable extremity wherever much tin spell wrong.

A crowded field

AI information labeling companies similar Scale AI, Surge, and Mercor are trying to conscionable the infinitesimal and physique retired RL environments. These companies person much resources than galore startups successful the space, arsenic good arsenic heavy relationships with AI labs.

Surge CEO Edwin Chen tells TechCrunch he’s precocious seen a “significant increase” successful request for RL environments wrong AI labs. Surge — which reportedly generated $1.2 cardinal successful revenue past twelvemonth from moving with AI labs similar OpenAI, Google, Anthropic and Meta — precocious spun up a caller interior enactment specifically tasked with gathering retired RL environments, helium said.

Close down Surge is Mercor, a startup valued astatine $10 billion, which has besides worked with OpenAI, Meta, and Anthropic. Mercor is pitching investors connected its concern building RL environments for domain circumstantial tasks specified arsenic coding, healthcare, and law, according to selling materials seen by TechCrunch.

Mercor CEO Brendan Foody told TechCrunch successful an interrogation that “few recognize however ample the accidental astir RL environments genuinely is.”

Scale AI utilized to predominate the information labeling space, but has mislaid crushed since Meta invested $14 billion and hired distant its CEO. Since then, Google and OpenAI dropped Scale AI arsenic a customer, and the startup adjacent faces contention for information labelling enactment inside of Meta. But still, Scale is trying to conscionable the infinitesimal and physique environments.

“This is conscionable the quality of the concern [Scale AI] is in,” said Chetan Rane, Scale AI’s caput of merchandise for agents and RL environments. “Scale has proven its quality to accommodate quickly. We did this successful the aboriginal days of autonomous vehicles, our archetypal concern unit. When ChatGPT came out, Scale AI adapted to that. And now, erstwhile again, we’re adapting to caller frontier spaces similar agents and environments.”

Some newer players are focusing exclusively connected environments from the outset. Among them is Mechanize Work, a startup founded astir six months agone with the audacious extremity of “automating each jobs.” However, co-founder Matthew Barnett tells TechCrunch that his steadfast is starting with RL environments for AI coding agents.

Mechanize Work aims to proviso AI labs with a tiny fig of robust RL environments, Barnett says, alternatively than larger information firms that make a wide scope of elemental RL environments. To this point, the startup is offering bundle engineers $500,000 salaries to physique RL environments — acold higher than an hourly contractor could gain moving astatine Scale AI oregon Surge.

Mechanize Work has already been moving with Anthropic connected RL environments, 2 sources acquainted with the substance told TechCrunch. Mechanize Work and Anthropic declined to remark connected the partnership.

Other startups are betting that RL environments volition beryllium influential extracurricular of AI labs. Prime Intellect — a startup backed by AI researcher Andrej Karpathy, Founders Fund, and Menlo Ventures — is targeting smaller developers with its RL environments.

Last month, Prime Intellect launched an RL environments hub, which aims to beryllium a “Hugging Face for RL environments.” The thought is to springiness open-source developers entree to the aforesaid resources that ample AI labs have, and merchantability those developers entree to computational resources successful the process.

Training mostly susceptible successful RL environments tin beryllium much computational costly than erstwhile AI grooming techniques, according to Prime Intellect researcher Will Brown. Alongside startups gathering RL environments, there’s different accidental for GPU providers that tin powerfulness the process.

“RL environments are going to beryllium excessively ample for immoderate 1 institution to dominate,” said Brown successful an interview. “Part of what we’re doing is conscionable trying to physique bully open-source infrastructure astir it. The work we merchantability is compute, truthful it is simply a convenient onramp to utilizing GPUs, but we’re reasoning of this much successful the agelong term.”

Will it scale?

The unfastened question astir RL environments is whether the method volition standard similar erstwhile AI grooming methods.

Reinforcement learning has powered immoderate of the biggest leaps successful AI implicit the past year, including models similar OpenAI’s o1 and Anthropic’s Claude Opus 4. Those are peculiarly important breakthroughs due to the fact that the methods antecedently utilized to amended AI models are present showing diminishing returns.

Environments are portion of AI labs’ bigger stake connected RL, which galore judge volition proceed to thrust advancement arsenic they adhd much information and computational resources to the process. Some of the OpenAI researchers down o1 antecedently told TechCrunch that the institution primitively invested successful AI reasoning models — which were created done investments successful RL and test-time-compute — due to the fact that they thought it would scale nicely.

The champion mode to standard RL remains unclear, but environments look similar a promising contender. Instead of simply rewarding chatbots for substance responses, they fto agents run successful simulations with tools and computers astatine their disposal. That’s acold much resource-intensive, but perchance much rewarding.

Some are skeptical that each these RL environments volition cookware out. Ross Taylor, a erstwhile AI probe pb with Meta that co-founded General Reasoning, tells TechCrunch that RL environments are prone to reward hacking. This is simply a process successful which AI models cheat successful bid to get a reward, without truly doing the task.

“I deliberation radical are underestimating however hard it is to standard environments,” said Taylor. “Even the champion publically disposable [RL environments] typically don’t enactment without superior modification.”

OpenAI’s Head of Engineering for its API business, Sherwin Wu, said successful a recent podcast that helium was “short” connected RL situation startups. Wu noted that it’s a precise competitory space, but besides that AI probe is evolving truthful rapidly that it’s hard to service AI labs well.

Karpathy, an capitalist successful Prime Intellect that has called RL environments a imaginable breakthrough, has besides voiced caution for the RL abstraction much broadly. In a post connected X, helium raised concerns astir however overmuch much AI advancement tin beryllium squeezed retired of RL.

“I americium bullish connected environments and agentic interactions but I americium bearish connected reinforcement learning specifically,” said Karpathy.

Read Entire Article