From "Torch not compiled with CUDA enabled" to a certified GPU factory line (ComfyUI on aarch64)

How we took an aarch64 box from CPU-only PyTorch errors to a repeatable, GPU-certified ComfyUI render pipeline—with hard evidence you can copy/paste into your own ops checklist.

By Nikhil
Published
gpupytorchcomfyuiopenclawoperations

From “Torch not compiled with CUDA enabled” to a certified GPU factory line (ComfyUI on aarch64)

I don’t care about “it should work.” I care about a pipeline that’s provably on GPU, repeatable, and boring.

This post is the path we used to take an aarch64 machine from the classic:

Torch not compiled with CUDA enabled

…to a certified GPU factory line: ComfyUI renders that we can treat like an operations primitive (and plug into OpenClaw / LIG).

The hardware/software reality check

If you’re on arm64/aarch64, you don’t get to hand-wave CUDA. You verify.

Here’s the evidence we captured from the machine running the renders:

| NVIDIA-SMI 580.126.09             Driver Version: 580.126.09     CUDA Version: 13.0     |
"pytorch_version": "2.9.1+cu129"
"name": "cuda:0 NVIDIA GB10 : cudaMallocAsync"

That’s the bar:

  • nvidia-smi sees the GPU and a sane driver/CUDA stack
  • ComfyUI reports a CUDA-backed PyTorch build (+cu129 here)
  • ComfyUI reports an actual CUDA device (cuda:0 …)

If any one of those is missing, you don’t have a GPU pipeline—you have vibes.

What “certified” means in practice

We treat a render backend like a factory line. It’s only “certified” if we can:

  1. Run a known workflow end-to-end
  2. Collect machine-readable proof that it ran on GPU
  3. Produce consistent artifacts (images + manifests)
  4. Repeat it without a human babysitting the box

ComfyUI is a good backend for this because it exposes enough introspection (/system_stats) to make the certification objective.

The root cause of the CUDA error (aarch64 edition)

On aarch64, “CUDA is installed” doesn’t mean your Python stack is CUDA-enabled.

The failure mode looks like this:

  • NVIDIA driver is present
  • nvidia-smi works
  • But your PyTorch wheel is CPU-only (or mismatched)
  • ComfyUI loads Torch, then falls back to CPU or throws

So the fix isn’t magical. It’s operational:

  • Install a CUDA-enabled PyTorch build that matches your platform
  • Confirm Torch can see the GPU
  • Confirm ComfyUI is actually using that Torch build

Evidence images (what we actually rendered)

These are the artifacts we produced as part of the certification run. We keep both “background/mock” layers and final GPU outputs because it makes debugging template composition obvious.

OG format (1200×630)

OG template background (mock render) used for compositing
Mock OG background layer (template asset, no final composition).
Final OG image rendered by ComfyUI on aarch64 GPU
Final OG render produced on GPU (ComfyUI backend).

Square format (1080×1080)

Square template background (mock render) used for compositing
Mock square background layer (useful for template debugging).
Final square image rendered by ComfyUI on aarch64 GPU
Final square render produced on GPU.

Portrait format (1080×1350)

Final portrait image rendered by ComfyUI on aarch64 GPU
Final portrait render produced on GPU.

Background asset (OG)

Background image asset used by the OG template
Background asset used by the OG template (kept as an explicit artifact).

The ops checklist (what to verify, in order)

This is the sequence that prevented us from wasting time:

  1. Driver / kernel sanity
    • nvidia-smi works without needing your Python environment
  2. PyTorch build sanity
    • you’re on a CUDA-enabled build for your platform (don’t guess; check the version string)
  3. ComfyUI device selection sanity
    • ComfyUI reports a CUDA device in /system_stats
  4. Workload sanity
    • run a real workflow (not just a smoke test) and ensure it completes
  5. Evidence capture
    • save nvidia-smi header + ComfyUI /system_stats excerpt alongside artifacts

Where this plugs into OpenClaw

Once ComfyUI-on-GPU is certified, we can treat it like a reliable backend:

  • OpenClaw queues a render job
  • ComfyUI executes on cuda:0
  • We store manifests + images
  • We can re-run the same workflow deterministically as part of CI-like ops gates

That’s the “factory line” mindset: inputs → GPU render → audited outputs, every time.