After trying multiple AI interfaces, including LMStudio, Msty.ai and Koboldcpp's built in UI, I finally settled on Open-WebUI as my daily driver. I installed the Docker version, and linked it to my Koboldcpp's server.
I also linked Open-WebUI with my Venice.ai API keys so I'm able to access State-of-the-Art models.
I liked Open-UI's follow-up questions as well as its tagging system, but I didn't like that it uses the current model to generate them. These induce extra costs as they send the whole conversation as context, and worst case, they'd cause the time and cost of inference to go 4x times.
So, I went into the settings and changed the Task model... Still, I needed a small enough local model with a big enough context.
I tested FunctionGemma, SmolLM2-360M and LFM2-350M. These models had enough speed for my purposes... While none of them gave me a 100% consistent result, LFM2-350M gave ne the most consistent best quality output out of the bunch.
For now, my daily driver local model is Ling-mini-2.0 at 14-20 token/s speed, and the task model LFM2-350M at 50-60 token/s. Both running at Q4 quantization level. For my AMD Ryzen 7640HS APU without a dedicated GPU, that's the best results I'm hoping for.
With the pace Small Language models are improving, I believe I'll have a good enough model in a few months. I'm so excited... What about you?
Related Threads
@ahmadmanga/re-leothreads-n1tb5evr
@ahmadmanga/re-leothreads-2zlbwapby