llama.cpp
The Only Inference Engine You'll Ever Need™
Welcome to Free Open-Source Artificial Intelligence!
We are a community dedicated to forwarding the availability and access to:
Free Open Source Artificial Intelligence (F.O.S.A.I.)
llama.cpp
The Only Inference Engine You'll Ever Need™
I found this guide which seems very comprehensive but has a few sections where it assumes knowledge I don't have and doesn't suggest a clear route by which to gain said knowledge.
For the section just following "Grab the content of SmolLM2 1.7B Instruct" I assume it boils down to run this prior program called MSYS and run this command through it? "GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct"
That's for quanting a model yourself. You can instead (read that as "should") download an already quantized model. You can find quantized models from the HuggingFace page of your model of choice. (Pro tip: quants by Bartowski, Unsloth and Mradermacher are high quality)
And then you just run it.
You can also use Kobold.cpp or OpenWebUI as friendly front ends for llama.cpp
Also, to answer your question, yes.
LM Studio has both Vulcan and ROCm but performance on Vulcan is better right now.
I have been running gpt-oss-20b on a 9060xt 16gb at a solid 20 tokens/sec.
I have the same GPU and I use koboldcpp with Vulkan as the backend. Works perfectly fine. I have a 12B model and it's extremely fast. I could probably even fit a bigger model into the VRAM. Using tabbyAPI for EXL2 models didn't work for me, it always generated gibberish (I tried 2 different models). For context, I'm on Linux, so maybe that's not an issue on other operating systems.