• diz@awful.systems
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    20 days ago

    Further support for the memorization claim: I posted examples of novel river crossing puzzles where LLMs completely fail (on this forum).

    Note that Apple’s actors / agents river crossing is a well known “jealous husbands” variant, which you can ask a chatbot to explain to you. It gladly explains, even as it can’t follow its own explanation (since of course it isn’t its own explanation but a plagiarized one, even if changes words).

    edit: https://awful.systems/post/4027490 and earlier https://awful.systems/post/1769506

    I think what I need to do is to write up a bunch of puzzles, assign them randomly to 2 sets, and test & post one set, while holding back on the second set (not even testing it on any online chatbots). Then in a year or two see how much the set that’s public improves, vs the one that’s held back.

    • Soyweiser@awful.systems
      link
      fedilink
      English
      arrow-up
      0
      ·
      18 days ago

      Latter test fails if they write a specific bit of code to put out the ‘llms fail the river crossing’ fire btw. Still a good test.

      • diz@awful.systems
        link
        fedilink
        English
        arrow-up
        0
        ·
        17 days ago

        It would have to be more than just river crossings, yeah.

        Although I’m also dubious that their LLM is good enough for universal river crossing puzzle solving using a tool. It’s not that simple, the constraints have to be translated into the format that the tool understands, and the answer translated back. I got told that o3 solves my river crossing variant but the chat log they gave had incorrect code being run and then a correct answer magically appearing, so I think it wasn’t anything quite as general as that.