For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.

But recently I tried running a smaller model like llama3.2 3B with 8bit quant and qwen2.5-1.5B-coder on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).

So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?

What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.

  • Smorty [she/her]@lemmy.blahaj.zoneOP
    link
    fedilink
    English
    arrow-up
    1
    ·
    15 hours ago

    oooh a windows only feature, now I see why I haven’t heard of this yet. Well, too bad I guess. It’s time to switch to AMD for me anyway…

    • SGforce
      link
      fedilink
      English
      arrow-up
      1
      ·
      15 hours ago

      Oh, that part is. But the splitting tech is built into llama.cpp