this post was submitted on 24 Jan 2025
79 points (98.8% liked)

Futurology

1980 readers
180 users here now

founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] hendrik@palaver.p3x.de 4 points 1 week ago* (last edited 1 week ago) (3 children)

But that's kind of always the issue with AI... The datasets being contaminated with the data from validation, or the benchmarks... I don't see a fundamental change here? It's going to be a good benchmark at first, and once the dataset is contaminated, we need a new one... As it has been the case with the previous ones... Or an I missing something here? I mean I don't want to be overly negative... But up until autumn, you could just ask it to count the number of 'r's in 'strawberry' an it'd achieve a success rate of 10%. If this is something substantial, this isn't it.

[–] Lugh 7 points 1 week ago* (last edited 1 week ago) (2 children)
[–] hendrik@palaver.p3x.de 2 points 1 week ago (1 children)

I still don't get it. And under "Future Model Performance" they say benchmarks quickly get saturated. And maybe it's going to be the same for this one and models could achieve 50% by the end of this year.... Which doesn't really sound like the "last examn" to me. But maybe it's more the approach of coming up with good science questions. And not the exact dataset??

[–] Lugh 2 points 1 week ago

I think the easiest way to explain this, is to say they are testing the ability to reason your way to an answer, to a question so unique, that it doesn't exist anywhere on the internet.