Please ComfyUI pipeline
This could perfectly work in ComfyUI pipeline using DynamicVRAM-which is Unified Memory and VRAM not important anymore, the system RAM is. Ive tested RTX4070 16Gb on full Kandinsky 5 Pro diffuser model (76Gb) and it perfectly works in Comfy DynamicVRAM on consumer PC with enough system Ram. Moreover i compared and 4070 equal by generation speed to 3090(in Xformers). In such Unified memory the power and speed of GPU important now than memory.
About Flash attention, i would prefer more Xformers than that or even PyTorch, because such very easy to install (compiling Flashattention every time in new fresh system is really several hours wasted time) and ive tested on model above that Torch generation really slow, after installing on top of it Xformers it improved speed radically (~2 hours vs 40 minutes (full 50 steps in full model).