Futurology

3393 readers

1 users here now

founded 2 years ago

MODERATORS

OpenAI says AI is now approaching expert-level ability at 1,320 tasks across 44 occupations from 9 industries, and exceeds most junior employees. (self.futurology)

submitted 1 month ago by Lugh to c/futurology

4 comments fedilink hide all child comments

The crash of the AI stock market bubble seems all but inevitable. If/when that happens, it won't end AI itself, just some of the AI companies. Ironically, the recession it will provoke will probably only accelerate the adoption of AI to replace human workers.

Our politics has yet to catch up to the coming realities of AI and employment, but I wonder how much longer that can last.

Measuring the performance of our models on real-world tasks: We’re introducing GDPval, a new evaluation that measures model performance on economically valuable, real-world tasks across 44 occupations.

you are viewing a single comment's thread
view the rest of the comments

[–] Varyk@sh.itjust.works 12 points 1 month ago* (last edited 1 month ago) (1 children)

This "benchmark", gdpval, does not at all evaluate the effective real-world use of ai across 44 occupations.

Lawyers, police/detective supervisors, social workers are some of those jobs "evaluated", jobs which we know ai currently hallucinate datasets and context for

"GDPval is an early step that doesn’t reflect the full nuance of many economic tasks. While it spans 44 occupations and hundreds of knowledge work tasks, it is limited to one-shot evaluations, so it doesn’t capture cases where a model would need to build context or improve through multiple drafts. "

Openai uses expertly constructed and meticulously crafted questions suitable to ai and tests a single job task a single time specifically to avoid the problem of low ai accuracy and consistency.

Gdpval doesn't even grade those 44 jobs, it grades a single task within one of those jobs and then extrapolates that single task "success" as proficiency across an entire occupation.

This is more "despite the contrary evidence of real-world data, if you let openai cherrypick data within extremely constrained question-response conditions, it is possible to view single, non-holistic, hypothetical answers as a success of ai".

"If you pretend our limited, invalid data is comprehensive and valid, then you can pretend our ai is successful."

[–] WanderingThoughts@europe.pub 2 points 1 month ago

where a model would need to build context or improve through multiple drafts

What is of course the majority of my work currently. My first try involved documents it refuses to parse because they're too big.