Each LLM is given the same 1000 chess puzzles to solve. See puzzles.csv. Benchmarked on Mar 25, 2024.

Model Solved Solved % Illegal Moves Illegal Moves % Adjusted Elo
gpt-4-turbo-preview 229 22.9% 163 16.3% 1144
gpt-4 195 19.5% 183 18.3% 1047
claude-3-opus-20240229 72 7.2% 464 46.4% 521
claude-3-haiku-20240307 38 3.8% 590 59.0% 363
claude-3-sonnet-20240229 23 2.3% 663 66.3% 286
gpt-3.5-turbo 23 2.3% 683 68.3% 269
claude-instant-1.2 10 1.0% 707 66.3% 245
mistral-large-latest 4 0.4% 813 81.3% 149
mixtral-8x7b 9 0.9% 832 83.2% 136
gemini-1.5-pro-latest* FAIL - - - -

Published by the CEO of Kagi!

    • conciselyverbose@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      8
      ·
      edit-2
      8 months ago

      I wonder how many of the ones they “solved” were just because they’d seen it discussed somewhere in the data set, considering the puzzles are apparently from a public resource.

      • Blóðbók@slrpnk.net
        link
        fedilink
        English
        arrow-up
        10
        arrow-down
        2
        ·
        8 months ago

        Yeah, I don’t know why anyone knowledgeable would expect them to be good at chess. LLMs don’t generalise, reason or spot patterns, so unless they read a chess book where the problems came from…

      • Carrolade@lemmy.world
        link
        fedilink
        English
        arrow-up
        5
        ·
        8 months ago

        Likely close to 100%. If you read the (rather good) article, a little further down they test whether the LLM can play an extremely simplistic “Connect 4” game they devise, as a way of narrowing down on specifically reasoning capabilities.

        It cannot.

        Chess puzzles, in particular, are frequently shared and discussed in online chess spaces, so the LLM will have a significant amount of material to work with when it tries to predict the best response to give to the prompt.