Riffusion gamepad control is the most interesting AI-music workflow we have run. Unlike MusicGen, which is essentially batch generation, Riffusion exposes a continuous interpolation between two prompts during inference — which is exactly the shape a gamepad stick wants to drive. The right stick becomes a latent-space joystick, sweeping between "thick dub bass" and "shimmering ambient pad" in real time. This guide wires Riffusion to a PS5 DualSense via Universal Controller MIDI and a small Python loop.
- What it is: Stable Diffusion fine-tuned on spectrograms. Image → ISTFT → audio.
- What the gamepad does: right stick rides prompt-blend alpha, triggers fire denoise, touchpad scrubs through seeds.
- Speed: 3–6 seconds per 5-second clip on a 3090. Fast enough to feel jam-able.
- Quality ceiling: lo-fi, phasey, perfect for textures, bad for polished mixes.
Why spectrogram diffusion is gamepad-friendly
Most AI-music models are sequence-to-sequence: they read a prompt, sample tokens, decode audio. The interaction surface is essentially "wait, then listen". Riffusion is different — it borrows Stable Diffusion's image-conditioning trick, which gives you a continuous latent space you can interpolate through during inference. That continuous latent is the gamepad's natural habitat. A stick is two continuous axes; the latent is two continuous axes (alpha and seed-nudge). Everything else is plumbing.
The mapping that actually works
| Input | MIDI | Param | What it does |
|---|---|---|---|
| Right stick X | CC 12 | alpha | Interpolation between prompt A (0) and prompt B (1) |
| Right stick Y | CC 13 | seed-nudge | Walks the seed by ±32 around the base |
| Left stick Y | CC 14 | CFG scale | How tightly the model sticks to the prompt |
| Touchpad X | CC 16 | seed scrub | Absolute seed across the touchpad surface (0–1024) |
| L2 trigger | CC 15 | inference steps | 15 (fast/grainy) to 50 (slow/clean) |
| R2 trigger | Note 60 | fire | Run inference with current state |
| X (Cross) | Note 61 | swap A/B | Promote current blend to new prompt A |
The Python loop
Riffusion exposes a clean pipeline class. The loop below listens to MIDI, reads CC state, and fires inference on note-on. Roughly fifty lines including imports.
import mido
from riffusion.riffusion_pipeline import RiffusionPipeline
from riffusion.spectrogram_image_converter import SpectrogramImageConverter
import torch, time
pipe = RiffusionPipeline.from_pretrained(
"riffusion/riffusion-model-v1",
torch_dtype=torch.float16,
).to("cuda")
converter = SpectrogramImageConverter()
state = {"alpha": 0.5, "seed": 42, "cfg": 7.0, "steps": 25}
PROMPT_A = "thick dub bass 90 bpm subwoofer"
PROMPT_B = "shimmering ambient pad reverberant"
inp = mido.open_input("Universal Controller MIDI")
for msg in inp:
if msg.type == "control_change":
if msg.control == 12: state["alpha"] = msg.value / 127
if msg.control == 13: state["seed"] = 42 + (msg.value - 64) // 2
if msg.control == 14: state["cfg"] = 3.0 + (msg.value / 127) * 12.0
if msg.control == 15: state["steps"] = 15 + int((msg.value / 127) * 35)
if msg.control == 16: state["seed"] = msg.value * 8
elif msg.type == "note_on" and msg.note == 60:
image = pipe(
prompt_a=PROMPT_A,
prompt_b=PROMPT_B,
alpha=state["alpha"],
seed_a=state["seed"],
seed_b=state["seed"] + 1,
num_inference_steps=state["steps"],
guidance_scale=state["cfg"],
).images[0]
segment = converter.audio_from_spectrogram_image(image)
segment.export(f"./watched/{int(time.time())}.wav", format="wav") What happens when you sweep the stick
Alpha is the magic axis. At 0.0 the output is purely prompt A — thick dub bass. At 1.0 it is purely prompt B — ambient pad. Anywhere in between, Riffusion does a genuine latent blend rather than a crossfade. Sweep the stick mid-set and you get a slow morph from sub-bass into shimmer that feels like the model is composing in front of you. The seed-nudge on Y means you can hold the same blend but walk through nearby variations — same shape, different details.
The touchpad as a seed library
The DualSense touchpad is 52 mm × 23 mm of capacitive surface that maps cleanly to an absolute seed range. Drag your finger across it and the seed jumps to wherever your touch lands. This is the single biggest workflow win — instead of cycling seeds with a button until you find one you like, you scrub. We default to mapping the touchpad to seeds 0–1024; that is plenty to find a good one and easy to remember a position by feel.
Inference steps vs. trigger pressure
The left trigger is variable, so we wire it to inference steps. Light press = 15 steps = grainy and fast = great for jamming. Hard press = 50 steps = clean and slow = great for the take you actually want to keep. This gives the trigger a real performance role rather than wasting it as a binary fire button.
Output quality, honestly
Riffusion outputs are not pristine. The spectrogram-to-audio conversion loses high-frequency information and adds a soft phasey artefact, especially on transients. For pads, drones, ambient beds, and lo-fi loops this is fine — often it sounds intentional. For anything where you need tight transients (drums, lead synths) you will want to layer over the top. Treat Riffusion like a texture generator with a great control surface, not a finished-track engine.
Living with the GPU bill
Running locally on a 3090 or 4090 costs nothing per generation. Running through a hosted endpoint is closer to $0.01–0.02 per clip. For a 60-minute jam with one fire every 20 seconds, that is roughly 180 clips — about $3 on hosted, free on local. The local path is the right call once you are jamming more than once a week.
The verdict
Riffusion is the AI-music model that most rewards a gamepad. The alpha axis was made for a stick; the seed scrub was made for a touchpad; the trigger pressure was made for inference steps. If you have already built a MusicGen pilot rig, Riffusion bolts onto the same bridge — same MIDI mapping, different generation backend. Universal Controller MIDI handles the controller plumbing; the fifty-line Python loop is all you need.