Ok, this is pretty deep in the weeds of geekville, so if you don't know a q8 quant is, you probably want to skip this one.
My current local AI server is an AMD Strix Halo 395+ 128GB. This is relatively new tech that has a lot of shared memory similar to a Mac that can be used to run very large AI models at moderately decent speeds.
The main model I run is OpenAI's Open Source model called GPT-OSS-120B which is a very large model that takes around 75GB of vram at q8 and around 50G at q4. Most gpus have 8-16GB of vram and even the top of the line nvidia 5090 only has 32Gb.
My old GPU (Nvidia 3090) has around 936GB/s memory bandwidth where the AMD 395+ (Strix Halo) has around 253Gb/s. While there is no way to add an additional GPU to the Strix Halo, you can use the 2nd m2 nvme slot to connect a GPU via Oculink. The m2 slot offers a 4x lane, quite a bit slower than a full 16x port you normally use for a GPU, but for inference (asking AI questions, not building new AI models), this is not a major concern. The reason is you load the model into the GPU then it doesn't talk to the cpu that much. If you were fine tuning (making new models) this would be a major problem.
This means hooking an external GPU via oculink was a very viable option. I've been looking around and talking to a lot of people with Strix Halos and no one has done sufficient testing on this option. I have seen a handful of users with eGPU setups, but no one has done testing on how it performs with AI.
Since I already had an Nvidia 3090 on a shelf, I thought I'd give it a try. The nvidia 3090 is a very good card for AI, it has 24G of vram, it is fairly quick, and cheaper than the current gen GPUs. I've seen some systems using 8-12 3090's to run large models.
I did expect to loose some performance due to the 4x bus and oculink protocol.
After some benchmarking, the results are less than I wanted, but somewhat what I expected.
The stock AMD 395+ scores with a small prompt.
prompt eval time = 1034.63 ms / 277 tokens ( 3.74 ms per token, 267.73 tokens per second)
eval time = 2328.85 ms / 97 tokens ( 24.01 ms per token, 41.65 tokens per second)
total time = 3363.48 ms / 374 tokens
With both the AMD 395+ and the 3090
prompt eval time = 864.31 ms / 342 tokens ( 2.53 ms per token, 395.69 tokens per second)
eval time = 994.16 ms / 55 tokens ( 18.08 ms per token, 55.32 tokens per second)
total time = 1858.47 ms / 397 tokens
That's 47.8% improved prompt processing (AMD 395+ weakness) and 32.8% faster token generation. Not bad really, getting 55 tokens/sec is pretty impressive considering how large this model is.
I'm not sure if i can squeeze more out of this setup. I don't really actually use it for much as I prefer even faster speeds for day to day stuff and I am planning a much bigger system when I get my stock system fully functional.
Since the EVO-X2 does not have an oculink port, I had to use this m.2. to oculink 4i cable from Amazon ($24.73).
I used this Miniforum DEG1 eGPU dock station to mount the power supply and GPU. Also picked up on Amazon for $100 shipped.
There is a slightly nicer but much more expensive eGPU dock that include a power supply for $260, also from Amazon.
I have a lot of high end power supplies lying around and I wasn't even sure this would work.
My next project is to try my 5090 and see what sort of gains I can get from that, it is not only twice as fast, it has an additional 8GB of vram which allow for many more layers to be loaded.
I might even try putting two 3090's in the machine using up both M2 slots and booting off a flash drive. This will be really slow to load the model, but once it is loaded, it really doesn't need any drive access. Although this is just getting silly, putting in a 5090 would be a much better option.