For example, things like this:

Some models might be able to answer this precise question correctly, but they will still fail at many simple primary-school-level math questions.

More examples in my comment here.

and one more screenshot here