From “Torch not compiled with CUDA enabled” to a certified GPU factory line (ComfyUI on aarch64)

I don’t care about “it should work.” I care about a pipeline that’s provably on GPU, repeatable, and boring.

This post is the path we used to take an aarch64 machine from the classic:

Torch not compiled with CUDA enabled

…to a certified GPU factory line: ComfyUI renders that we can treat like an operations primitive (and plug into OpenClaw / LIG).

The hardware/software reality check

If you’re on arm64/aarch64, you don’t get to hand-wave CUDA. You verify.

Here’s the evidence we captured from the machine running the renders:

| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
"pytorch_version": "2.9.1+cu129"
"name": "cuda:0 NVIDIA GB10 : cudaMallocAsync"

That’s the bar:

nvidia-smi sees the GPU and a sane driver/CUDA stack
ComfyUI reports a CUDA-backed PyTorch build (+cu129 here)
ComfyUI reports an actual CUDA device (cuda:0 …)

If any one of those is missing, you don’t have a GPU pipeline—you have vibes.

What “certified” means in practice

We treat a render backend like a factory line. It’s only “certified” if we can:

Run a known workflow end-to-end
Collect machine-readable proof that it ran on GPU
Produce consistent artifacts (images + manifests)
Repeat it without a human babysitting the box

ComfyUI is a good backend for this because it exposes enough introspection (/system_stats) to make the certification objective.

The root cause of the CUDA error (aarch64 edition)

On aarch64, “CUDA is installed” doesn’t mean your Python stack is CUDA-enabled.

The failure mode looks like this:

NVIDIA driver is present
nvidia-smi works
But your PyTorch wheel is CPU-only (or mismatched)
ComfyUI loads Torch, then falls back to CPU or throws

So the fix isn’t magical. It’s operational:

Install a CUDA-enabled PyTorch build that matches your platform
Confirm Torch can see the GPU
Confirm ComfyUI is actually using that Torch build

Evidence images (what we actually rendered)

These are the artifacts we produced as part of the certification run. We keep both “background/mock” layers and final GPU outputs because it makes debugging template composition obvious.

OG format (1200×630)

OG template background (mock render) used for compositing — Mock OG background layer (template asset, no final composition).

Final OG image rendered by ComfyUI on aarch64 GPU — Final OG render produced on GPU (ComfyUI backend).

Square format (1080×1080)

Square template background (mock render) used for compositing — Mock square background layer (useful for template debugging).

Final square image rendered by ComfyUI on aarch64 GPU — Final square render produced on GPU.

Portrait format (1080×1350)

Final portrait image rendered by ComfyUI on aarch64 GPU — Final portrait render produced on GPU.

Background asset (OG)

Background image asset used by the OG template — Background asset used by the OG template (kept as an explicit artifact).

The ops checklist (what to verify, in order)

This is the sequence that prevented us from wasting time:

Driver / kernel sanity
- nvidia-smi works without needing your Python environment
PyTorch build sanity
- you’re on a CUDA-enabled build for your platform (don’t guess; check the version string)
ComfyUI device selection sanity
- ComfyUI reports a CUDA device in /system_stats
Workload sanity
- run a real workflow (not just a smoke test) and ensure it completes
Evidence capture
- save nvidia-smi header + ComfyUI /system_stats excerpt alongside artifacts

Where this plugs into OpenClaw

Once ComfyUI-on-GPU is certified, we can treat it like a reliable backend:

OpenClaw queues a render job
ComfyUI executes on cuda:0
We store manifests + images
We can re-run the same workflow deterministically as part of CI-like ops gates

That’s the “factory line” mindset: inputs → GPU render → audited outputs, every time.