What should I use: big model-small quant or small model-no quant?

Smorty [she/her]@lemmy.blahaj.zone · edit-2 5 months ago

What should I use: big model-small quant or small model-no quant?

SGforce · 6 months ago

With modern methods sometimes running a larger model split between GPU/CPU can be fast enough. Here’s an example https://dev.to/maximsaplin/llamacpp-cpu-vs-gpu-shared-vram-and-inference-speed-3jpl

Smorty [she/her]@lemmy.blahaj.zone · 6 months ago

oooh a windows only feature, now I see why I haven’t heard of this yet. Well, too bad I guess. It’s time to switch to AMD for me anyway…

ffhein@lemmy.world · 6 months ago

Article is written in a bit confusing way, but you’ll most likely want to turn off Nvidia’s automatic VRAM swapping if you’re on Windows, so it doesn’t happen by accident. Partial offloading with llama.cpp is much faster AFAIK if you want to split the model between GPU and CPU, and it’s easier to find how many layers you can offload if it fails to load instead when you set it too high.

Also if you want to experiment partial offload, maybe a 12B around Q4 would be more interesting than the same 7B model with higher precision? I haven’t checked if anything new has come out the last couple of months, but Mistral Nemo is fairly good IMO, though you might need to limit context to 4k or something.

SGforce · 6 months ago

Oh, that part is. But the splitting tech is built into llama.cpp