The Tower of Hanoi is a classic game with three pegs and multiple discs, in which you need to move all the discs on the left peg to the right peg, never stacking a larger disc on top of a smaller one. With practice, though, a bright (and patient) seven-year-old can do it.
What Apple found was that leading generative models could barely do seven discs, getting less than 80% accuracy, and pretty much can’t get scenarios with eight discs correct at all.
It's funny because I created a scheme program to do this as a college course assignemnt many years ago.
It was the first of many increasingly complex assignments in an AI course. This was first because it has very basic logic requirements.