this post was submitted on 28 Sep 2025
2 points (55.6% liked)
Futurology
3314 readers
25 users here now
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
This "benchmark", gdpval, does not at all evaluate the effective real-world use of ai across 44 occupations.
Lawyers, police/detective supervisors, social workers are some of those jobs "evaluated", jobs which we know ai currently hallucinate datasets and context for
"GDPval is an early step that doesn’t reflect the full nuance of many economic tasks. While it spans 44 occupations and hundreds of knowledge work tasks, it is limited to one-shot evaluations, so it doesn’t capture cases where a model would need to build context or improve through multiple drafts. "
Openai uses expertly constructed and meticulously crafted questions suitable to ai and tests a single job task a single time specifically to avoid the problem of low ai accuracy and consistency.
Gdpval doesn't even grade those 44 jobs, it grades a single task within one of those jobs and then extrapolates that single task "success" as proficiency across an entire occupation.
This is more "despite the contrary evidence of real-world data, if you let openai cherrypick data within extremely constrained question-response conditions, it is possible to view single, non-holistic, hypothetical answers as a success of ai".
or
"If you pretend our limited, invalid data is comprehensive and valid, then you can pretend our ai is successful."
What is of course the majority of my work currently. My first try involved documents it refuses to parse because they're too big.