TheMightyCat

joined 2 weeks ago
[–] TheMightyCat@ani.social 1 points 1 day ago (1 children)

Why do core counts and memory type matter when the table includes memory bandwith and tflop16?

The H200 has HBM and alot of tensor cores which is reflected in its high stats in the table and the amd gpus don't have cuda cores.

I know a major deterioration is to be expected but how major? Even in extreme cases with only 10% efficiency of the total power then its still competitive against the H200 since you can get way more for the price, even if you can only use 10% of that.

[–] TheMightyCat@ani.social 1 points 1 day ago (3 children)

Thanks! Ill go check it out.

[–] TheMightyCat@ani.social 1 points 1 day ago (6 children)

My target model is Qwen/Qwen3-235B-A22B-FP8. Ideally its maxium context lenght of 131K but i'm willing to compromise. I find it hard to give an concrete t/s awnser, let's put it around 50. At max load probably around 8 concurrent users, but these situations will be rare enough that oprimizing for single user is probably more worth it.

My current setup is already: Xeon w7-3465X 128gb DDR5 2x 4090

It gets nice enough peformance loading 32B models completely in vram, but i am skeptical that a simillar system can run a 671B at higher speeds then a snails space, i currently run vLLM because it has higher peformance with tensor parrelism then lama.cpp but i shall check out ik_lama.cpp.

[–] TheMightyCat@ani.social 1 points 1 day ago

While I would still say it's excessive to respond with "😑" i was too quick in waving these issues away.

Another commenter explained that residential power physically does not suppply enough to match high end gpus is why even for selfhosters they could be worth it.

[–] TheMightyCat@ani.social 1 points 1 day ago

Thanks, While I still would like to know thr peformance scaling of a cheap cluster this does awnser the question, pay way more for high end cards like the H200 for greater efficiency, or pay less and have to deal with these issues.

[–] TheMightyCat@ani.social 2 points 1 day ago* (last edited 1 day ago) (3 children)
  • I know the more bandwidth the better, but i wonder how does it scale. I can only test my own setup which is less then optimal for this purpose with pcie 4.0 x16 and no p2p, but it goes as follows: a single 4090 gets 40.9 t/s while 2 get 58.5 t/s using tensor parrelism tested on Qwen/Qwen3-8B-FP8 with vLLM. I am really curious how this scales over more then 2 pcie 5.0 cards with p2p, which all cards here listed except the 5090 support.
  • The theory goes that yes while the H200 has a very impressive bandwith of 4.89 TB/s, but for the same price you can get 37 TB/s spread across 58 RX 9070s, but if this actually works in practice i don't know.
  • I don't need to build a datacenter, i'm fine with building a rack myself in my garage. And i don't think that requires higher volumes than just purchasing at different retailers
  • I intend to run at fp8 so i wanted to show that instead of fp16 but its surprisingly difficult to find the numbers for that, only the H200 datasheet, cleary displays FP8 Tensor Core, the RTX pro 6000 datasheet keeps it vague with only mentioning AI TOPS, which they define as Effective FP4 TOPS with sparsity, and they didn't even bother writing a datasheet for he 5090 only saying 3352 AI TOPS, which i suppose is fp4 then. the AMD datasheets only list fp16 and int8 matrix, whether int8 matrix is equal to fp8 i don't know. So FP16 was the common denominator for all the cards i could find without comparing apples with oranges.
[–] TheMightyCat@ani.social -3 points 1 day ago (4 children)

Well a scam for selfhosters, for datacenters it's different ofcourse.

Im looking to upgrade to my first dedicated built server coming from only SBCs so I'm not sure how much of a concern heat will be, but space and power shouldn't be an issue. (Within reason ofcourse)

 
GPU VRAM Price (€) Bandwidth (TB/s) TFLOP16 €/GB €/TB/s €/TFLOP16
NVIDIA H200 NVL 141GB 36284 4.89 1671 257 7423 21
NVIDIA RTX PRO 6000 Blackwell 96GB 8450 1.79 126.0 88 4720 67
NVIDIA RTX 5090 32GB 2299 1.79 104.8 71 1284 22
AMD RADEON 9070XT 16GB 665 0.6446 97.32 41 1031 7
AMD RADEON 9070 16GB 619 0.6446 72.25 38 960 8.5
AMD RADEON 9060XT 16GB 382 0.3223 51.28 23 1186 7.45

This post is part "hear me out" and part asking for advice.

Looking at the table above AI gpus are a pure scam, and it would make much more sense to (atleast looking at this) to use gaming gpus instead, either trough a frankenstein of pcie switches or high bandwith network.

so my question is if somebody has build a similar setup and what their experience has been. And what the expected overhead performance hit is and if it can be made up for by having just way more raw peformance for the same price.

[–] TheMightyCat@ani.social 12 points 3 days ago

Necessity is the mother of innovation, that is why the Chinese do have domestic manufacturing of processors and the EU doesn't.

What it will take in my opinion is American processors becoming unviablly expensive (tarrifs) or unavailable alltogether (export controls) for the will/market to arise for EU domestic processors.

[–] TheMightyCat@ani.social 29 points 3 days ago

https://dare-riscv.eu/

Which is exactly what's happening

[–] TheMightyCat@ani.social 5 points 1 week ago

As long as Russia is fighting China gets cheap oil

view more: next ›