You can look at the stats on how much of the model fits in vram. The lower the percentage the slower it goes although I imagine that’s not the only constraint. Some models probably are faster than others regardless, but I really have not done a lot of experimenting. Too slow on my card to really even compare output quality across models. Once I have 2k tokens in context, even a 7B model is a token every second or more. I have about the slowest card that ollama even says you says use. I think there is one worse card.
ETA: I’m pulling the 14B Abliterated model now for testing. I haven’t had good luck running a model this big before, but I’ll let you know how it goes.
You can look at the stats on how much of the model fits in vram. The lower the percentage the slower it goes although I imagine that’s not the only constraint. Some models probably are faster than others regardless, but I really have not done a lot of experimenting. Too slow on my card to really even compare output quality across models. Once I have 2k tokens in context, even a 7B model is a token every second or more. I have about the slowest card that ollama even says you says use. I think there is one worse card.
ETA: I’m pulling the 14B Abliterated model now for testing. I haven’t had good luck running a model this big before, but I’ll let you know how it goes.