I’ve been looking into self-hosting LLMs or stable diffusion models using something like LocalAI and / or Ollama and LibreChat.

Some questions to get a nice discussion going:

  • Any of you have experience with this?
  • What are your motivations?
  • What are you using in terms of hardware?
  • Considerations regarding energy efficiency and associated costs?
  • What about renting a GPU? Privacy implications?
  • robber@lemmy.mlOP
    link
    fedilink
    English
    arrow-up
    4
    ·
    7 months ago

    So you access the models directly via terminal? Is that convenient? Also, do you get satisfying inference speed and quality with a 16GB card?

    • Audalin@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      ·
      edit-2
      7 months ago

      Mostly via terminal, yeah. It’s convenient when you’re used to it - I am.

      Let’s see, my inference speed now is:

      • ~60-65 tok/s for a 8B model in Q_5_K/Q6_K (entirely in VRAM);
      • ~36 tok/s for a 14B model in Q6_K (entirely in VRAM);
      • ~4.5 tok/s for a 35B model in Q5_K_M (16/41 layers in VRAM);
      • ~12.5 tok/s for a 8x7B model in Q4_K_M (18/33 layers in VRAM);
      • ~4.5 tok/s for a 70B model in Q2_K (44/81 layers in VRAM);
      • ~2.5 tok/s for a 70B model in Q3_K_L (28/81 layers in VRAM).

      As of quality, I try to avoid quantisation below Q5 or at least Q4. I also don’t see any point in using Q8/f16/f32 - the difference with Q6 is minimal. Other than that, it really depends on the model - for instance, llama-3 8B is smarter than many older 30B+ models.

      • robber@lemmy.mlOP
        link
        fedilink
        English
        arrow-up
        3
        ·
        7 months ago

        Thanks! Glad to see the 8x7B performing not too bad - I assume that’s a Mistral model? Also, does the CPU significantly affect inference speed in such a setup, do you know?

        • Audalin@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          ·
          7 months ago

          If your CPU isn’t ancient, it’s mostly about memory speed. VRAM is very fast, DDR5 RAM is reasonably fast, swap is slow even on a modern SSD.

          8x7B is mixtral, yeah.