Blog AI 9 min read

Riffusion Spectrogram — Gamepad XY as Latent Pilot

Drive Riffusion's spectrogram diffusion from a PS5 DualSense. Right stick rides prompt-blend alpha, triggers fire denoise, touchpad scrubs seeds live.

By Aidxn Design

Riffusion gamepad control is the most interesting AI-music workflow we have run. Unlike MusicGen, which is essentially batch generation, Riffusion exposes a continuous interpolation between two prompts during inference — which is exactly the shape a gamepad stick wants to drive. The right stick becomes a latent-space joystick, sweeping between "thick dub bass" and "shimmering ambient pad" in real time. This guide wires Riffusion to a PS5 DualSense via Universal Controller MIDI and a small Python loop.

TL;DR
  • What it is: Stable Diffusion fine-tuned on spectrograms. Image → ISTFT → audio.
  • What the gamepad does: right stick rides prompt-blend alpha, triggers fire denoise, touchpad scrubs through seeds.
  • Speed: 3–6 seconds per 5-second clip on a 3090. Fast enough to feel jam-able.
  • Quality ceiling: lo-fi, phasey, perfect for textures, bad for polished mixes.

Why spectrogram diffusion is gamepad-friendly

Most AI-music models are sequence-to-sequence: they read a prompt, sample tokens, decode audio. The interaction surface is essentially "wait, then listen". Riffusion is different — it borrows Stable Diffusion's image-conditioning trick, which gives you a continuous latent space you can interpolate through during inference. That continuous latent is the gamepad's natural habitat. A stick is two continuous axes; the latent is two continuous axes (alpha and seed-nudge). Everything else is plumbing.

The mapping that actually works

InputMIDIParamWhat it does
Right stick XCC 12alphaInterpolation between prompt A (0) and prompt B (1)
Right stick YCC 13seed-nudgeWalks the seed by ±32 around the base
Left stick YCC 14CFG scaleHow tightly the model sticks to the prompt
Touchpad XCC 16seed scrubAbsolute seed across the touchpad surface (0–1024)
L2 triggerCC 15inference steps15 (fast/grainy) to 50 (slow/clean)
R2 triggerNote 60fireRun inference with current state
X (Cross)Note 61swap A/BPromote current blend to new prompt A

The Python loop

Riffusion exposes a clean pipeline class. The loop below listens to MIDI, reads CC state, and fires inference on note-on. Roughly fifty lines including imports.

import mido
from riffusion.riffusion_pipeline import RiffusionPipeline
from riffusion.spectrogram_image_converter import SpectrogramImageConverter
import torch, time

pipe = RiffusionPipeline.from_pretrained(
    "riffusion/riffusion-model-v1",
    torch_dtype=torch.float16,
).to("cuda")

converter = SpectrogramImageConverter()
state = {"alpha": 0.5, "seed": 42, "cfg": 7.0, "steps": 25}
PROMPT_A = "thick dub bass 90 bpm subwoofer"
PROMPT_B = "shimmering ambient pad reverberant"

inp = mido.open_input("Universal Controller MIDI")
for msg in inp:
    if msg.type == "control_change":
        if msg.control == 12: state["alpha"] = msg.value / 127
        if msg.control == 13: state["seed"] = 42 + (msg.value - 64) // 2
        if msg.control == 14: state["cfg"] = 3.0 + (msg.value / 127) * 12.0
        if msg.control == 15: state["steps"] = 15 + int((msg.value / 127) * 35)
        if msg.control == 16: state["seed"] = msg.value * 8
    elif msg.type == "note_on" and msg.note == 60:
        image = pipe(
            prompt_a=PROMPT_A,
            prompt_b=PROMPT_B,
            alpha=state["alpha"],
            seed_a=state["seed"],
            seed_b=state["seed"] + 1,
            num_inference_steps=state["steps"],
            guidance_scale=state["cfg"],
        ).images[0]
        segment = converter.audio_from_spectrogram_image(image)
        segment.export(f"./watched/{int(time.time())}.wav", format="wav")

What happens when you sweep the stick

Alpha is the magic axis. At 0.0 the output is purely prompt A — thick dub bass. At 1.0 it is purely prompt B — ambient pad. Anywhere in between, Riffusion does a genuine latent blend rather than a crossfade. Sweep the stick mid-set and you get a slow morph from sub-bass into shimmer that feels like the model is composing in front of you. The seed-nudge on Y means you can hold the same blend but walk through nearby variations — same shape, different details.

The touchpad as a seed library

The DualSense touchpad is 52 mm × 23 mm of capacitive surface that maps cleanly to an absolute seed range. Drag your finger across it and the seed jumps to wherever your touch lands. This is the single biggest workflow win — instead of cycling seeds with a button until you find one you like, you scrub. We default to mapping the touchpad to seeds 0–1024; that is plenty to find a good one and easy to remember a position by feel.

Inference steps vs. trigger pressure

The left trigger is variable, so we wire it to inference steps. Light press = 15 steps = grainy and fast = great for jamming. Hard press = 50 steps = clean and slow = great for the take you actually want to keep. This gives the trigger a real performance role rather than wasting it as a binary fire button.

Output quality, honestly

Riffusion outputs are not pristine. The spectrogram-to-audio conversion loses high-frequency information and adds a soft phasey artefact, especially on transients. For pads, drones, ambient beds, and lo-fi loops this is fine — often it sounds intentional. For anything where you need tight transients (drums, lead synths) you will want to layer over the top. Treat Riffusion like a texture generator with a great control surface, not a finished-track engine.

Living with the GPU bill

Running locally on a 3090 or 4090 costs nothing per generation. Running through a hosted endpoint is closer to $0.01–0.02 per clip. For a 60-minute jam with one fire every 20 seconds, that is roughly 180 clips — about $3 on hosted, free on local. The local path is the right call once you are jamming more than once a week.

The verdict

Riffusion is the AI-music model that most rewards a gamepad. The alpha axis was made for a stick; the seed scrub was made for a touchpad; the trigger pressure was made for inference steps. If you have already built a MusicGen pilot rig, Riffusion bolts onto the same bridge — same MIDI mapping, different generation backend. Universal Controller MIDI handles the controller plumbing; the fifty-line Python loop is all you need.

Keep reading

More setup walkthroughs