Incredibly, running a local LLM (large language model) on just the CPU is possible with Llama.cpp!— however, it can be pretty slow. I get about 1 token every 2 seconds with a 34 billion parameter model on an 11th gen Intel framework laptop with 64GB of RAM.

I have an external Nvidia GPU connected to my Pop!_OS laptop, and I’ve used the following technique to successfully compile Llama.cpp to use clblast (a BLAS adapter library) to speed up various LLMs (such as codellama-34b.Q4_K_M.gguf). As a rough estimate, the speed-up I get is about 5x on my Nvidia 3080 TI.

Here’s how to compile Llama.cpp inside a docker container on Pop!_OS.