For about half a year I stuck with using 7B models and got a strong 4 bit quantisation on them, because I had very bad experiences with an old qwen 0.5B model.

But recently I tried running a smaller model like llama3.2 3B with 8bit quant and qwen2.5-1.5B-coder on full 16bit floating point quants, and those performed super good aswell on my 6GB VRAM gpu (gtx1060).

So now I am wondering: Should I pull strong quants of big models, or low quants/raw 16bit fp versions of smaller models?

What are your experiences with strong quants? I saw a video by that technovangelist guy on youtube and he said that sometimes even 2bit quants can be perfectly fine.

  • SGforce
    link
    fedilink
    English
    arrow-up
    1
    ·
    18 hours ago

    fp8 would probably be fine, though the method used to make the quant would greatly influence that.

    I don’t know exactly how Ollama works but a more ideal model I would think would be one of these quants

    https://huggingface.co/bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF

    A GGUF model would also allow some overflow into system ram if ollama has that capability like some other inference backends.

    • Smorty [she/her]@lemmy.blahaj.zoneOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      17 hours ago

      Ollama does indeed have the ability to share the memory between VRAM and RAM, but I always assumed it wouldn’t make sense, since it would massively slow down the generation.

      I think ollama already uses GGUF, since that is how you import the model from HF to ollama anyway, you gotta use the *.GGUF file.

      As someone who has experience with shader development in glsl, I know very well that communication between the GPU and CPU is super slow, and sending data from the GPU to the CPU is a pretty heavy task. So I just assumed it wouldn’t make any sense. I will try a full 7B model (fp16) model now using my 32GB of normal RAM to check out the speed. I’ll edit this comment once I’m done and share results