OpenAI released a caller benchmark connected Thursday that tests however its AI models execute compared to quality professionals crossed a wide scope of industries and jobs. The test, GDPval, is an aboriginal effort astatine knowing however adjacent OpenAI’s systems are to outperforming humans astatine economically invaluable enactment — a cardinal portion of the company’s founding ngo to make artificial wide quality oregon AGI.
OpenAI says its recovered that its GPT-5 exemplary and Anthropic’s Claude Opus 4.1 “are already approaching the prime of enactment produced by manufacture experts.”
That’s not to accidental that OpenAI’s models are going to commencement replacing humans successful their jobs immediately. Despite immoderate CEOs’ predictions that AI volition instrumentality the jobs of humans successful conscionable a fewer years, OpenAI admits that GDPval contiguous covers a precise constricted fig of tasks radical bash successful their existent jobs. However, it is 1 of the latest ways the institution is measuring AI’s advancement towards this milestone.
GDPval is based connected 9 industries that lend the astir to America’s gross home product, including domains specified arsenic healthcare, finance, manufacturing, and government. The benchmark tests an AI model’s show successful 44 occupations among those industries, ranging from bundle engineers to nurses to journalists.
For OpenAI’s archetypal mentation of the test, GDPval-v0, OpenAI asked experienced professionals to comparison AI-generated reports with those produced by different professionals, and past take the champion one. For example, 1 punctual asked concern bankers to make a rival scenery for the past mile transportation industry, and comparison them to AI-generated reports. OpenAI past averages an AI model’s “win rate” against the quality reports crossed each 44 occupations.
For GPT-5-high, a souped up mentation of GPT-5 with other computational power, the institution says the AI exemplary was ranked arsenic amended than oregon connected par with manufacture experts 40.6% of the time.
OpenAI besides tested Anthropic’s Claude Opus 4.1 model, which was ranked arsenic amended than oregon connected par with manufacture experts successful 49% of tasks. OpenAI says that it believes Claude scored truthful precocious due to the fact that of its inclination to marque pleasing graphics, alternatively than sheer performance.
Techcrunch event
San Francisco | October 27-29, 2025
Credit: OpenAIIt’s worthy noting that astir moving professionals bash a batch much than taxable probe reports to their boss, which is each that GDPval-v0 tests for. OpenAI acknowledges this, and says it plans to make much robust tests successful the aboriginal that tin relationship for much industries and interactive workflows.
Nonetheless, the institution sees the advancement connected GDPval arsenic notable.
In an interrogation with TechCrunch, OpenAI’s main economist Dr. Aaron Chatterji said GDPval’s results suggest that radical successful these jobs tin present usage AI models to walk clip connected much meaningful tasks.
“[Because] the exemplary is getting bully astatine immoderate of these things,” Chatterji says, “people successful those jobs tin present usage the model, progressively arsenic capabilities get better, to offload immoderate of their enactment and bash perchance higher worth things.”
OpenAI’s evaluations pb Tejal Patwardhan tells TechCrunch that she’s encouraged by the complaint of advancement connected GDPval. OpenAI’s GPT-4o exemplary scored conscionable 13.7% (wins and ties versus humans), which was released astir 15 months ago. Now GPT-5 scores astir triple that, a inclination Patwardhan expects to continue.
Silicon Valley has a wide scope of benchmarks it uses to measurement the advancement of AI models, and measure whether a fixed exemplary is state-of-the-art. Among the astir fashionable are AIME 2025 (a trial of competitory mathematics problems) and GPQA Diamond (a trial of PhD level subject questions). However, respective AI models are nearing saturation connected immoderate of these benchmarks, and galore AI researchers person cited the request for better tests that tin measurement AI’s proficiency connected real-world tasks. Benchmarks similar GDPval could go progressively important successful that conversation, arsenic OpenAI makes the lawsuit that its AI models are invaluable for a wide scope of industries.















English (US) ·