this post was submitted on 28 Sep 2025
2 points (55.6% liked)

Futurology

3314 readers
48 users here now

founded 2 years ago
MODERATORS
 

The crash of the AI stock market bubble seems all but inevitable. If/when that happens, it won't end AI itself, just some of the AI companies. Ironically, the recession it will provoke will probably only accelerate the adoption of AI to replace human workers.

Our politics has yet to catch up to the coming realities of AI and employment, but I wonder how much longer that can last.

Measuring the performance of our models on real-world tasks: We’re introducing GDPval, a new evaluation that measures model performance on economically valuable, real-world tasks across 44 occupations.

top 4 comments
sorted by: hot top controversial new old
[–] Varyk@sh.itjust.works 11 points 1 day ago* (last edited 1 day ago) (1 children)

This "benchmark", gdpval, does not at all evaluate the effective real-world use of ai across 44 occupations.

Lawyers, police/detective supervisors, social workers are some of those jobs "evaluated", jobs which we know ai currently hallucinate datasets and context for

"GDPval is an early step that doesn’t reflect the full nuance of many economic tasks. While it spans 44 occupations and hundreds of knowledge work tasks, it is limited to one-shot evaluations, so it doesn’t capture cases where a model would need to build context or improve through multiple drafts. "

Openai uses expertly constructed and meticulously crafted questions suitable to ai and tests a single job task a single time specifically to avoid the problem of low ai accuracy and consistency.

Gdpval doesn't even grade those 44 jobs, it grades a single task within one of those jobs and then extrapolates that single task "success" as proficiency across an entire occupation.

This is more "despite the contrary evidence of real-world data, if you let openai cherrypick data within extremely constrained question-response conditions, it is possible to view single, non-holistic, hypothetical answers as a success of ai".

or

"If you pretend our limited, invalid data is comprehensive and valid, then you can pretend our ai is successful."

[–] WanderingThoughts@europe.pub 2 points 23 hours ago

where a model would need to build context or improve through multiple drafts

What is of course the majority of my work currently. My first try involved documents it refuses to parse because they're too big.

Riiiiiiight.

You know we can see and use it....right?

How many R's are in the word Battleship?