this post was submitted on 23 Aug 2024
29 points (85.4% liked)
Linux
48323 readers
749 users here now
From Wikipedia, the free encyclopedia
Linux is a family of open source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991 by Linus Torvalds. Linux is typically packaged in a Linux distribution (or distro for short).
Distributions include the Linux kernel and supporting system software and libraries, many of which are provided by the GNU Project. Many Linux distributions use the word "Linux" in their name, but the Free Software Foundation uses the name GNU/Linux to emphasize the importance of GNU software, causing some controversy.
Rules
- Posts must be relevant to operating systems running the Linux kernel. GNU/Linux or otherwise.
- No misinformation
- No NSFW content
- No hate speech, bigotry, etc
Related Communities
Community icon by Alpár-Etele Méder, licensed under CC BY 3.0
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
My experience is that AMDs virtual memory system for VRAM is buggy and those bugs cause kernel crashes. A few tips:
If running both cards is overstressing your PSU you might be suffering from voltage drops when your GPU draws maximum power. I was able to run games absolutely fine on my previous PSU, but running diffusion models caused it to collapse. Try just a single card to see if it helps stability.
Make sure your kernel is as recent as possible. There have been a number of fixes in the 6.x series, and I have seen stability go up. Remember: docker images still use your host OS kernel.
If you can, disable the desktop (e.g.
systemctl isolate multi-user.target
, and run the web gui over the network to another machine. If you're running ComfyUI, that means adding--listen
to the command line options. It's normally the desktop environment that causes the crashes when it tries to access something in VRAM that has been swapped to normal RAM to make room for your models. Giving the whole GPU to the one task boosts stability massively. It's not the desktop environment's fault. The GPU driver should handle the situation.When you get a crash, often it's just that the GPU has crashed and not the machine (Won't be true of a power supply issue).
ssh
ing in and shutting down cleanly can save your filesystems the trauma of a hard reboot. If you don't have another machine, grab assh
client for your phone like Juice SSH on android. (Not affiliated. It just works for me)Using
rocm-smi
to reset the card after a crash might bring things back, but not always. Obviously you have to do this over the network as your display has gone.Be aware of your VRAM usage (
amdgpu_top
) and try to avoid overcommitting it. It sucks, but if you can avoid swapping VRAM everything goes better. Low memory modes on the tools can help. ComfyUI has--low-vram
for example and it more aggressively removes things from VRAM when it's finished using them. Slows down generations a bit, but better than crashing.With this I've been running SDXL on a 8GB RX7600 pretty successfully (~1s per iteration). I've been thinking about upgrading but I think I'll wait for the RX8000 series now. It's possible the underlying problem is something with the GPU hardware as AMD are definitely improving things with software changes, but not solving it once and for all. I'm also hopeful that they will upgrade the VRAM across the range. The 16GB 7600XT says to me that they know <16GB isn't practical anymore, so the high-end also has to go up, right?