How does Riffusion actually work?

It is Stable Diffusion fine-tuned to generate spectrograms instead of photographs. The output image is converted back to audio via inverse short-time Fourier transform. The clever bit is reusing all the image-diffusion tooling for sound.

Is the audio quality good?

Mid. Spectrogram-to-audio loses high-frequency detail and adds a faint phasey artefact. Great for textures, lo-fi loops, and pads. Bad for vocals or anything you want to mix into a polished mix without effort.

How fast is generation?

About 3–6 seconds per 5-second clip on an RTX 3090 at 25 inference steps. Much faster than MusicGen, but lower fidelity.

Does the latent-space interpolation matter?

It is the whole point. A static prompt-to-spectrogram model is a one-shot trick. Interpolating between two prompts while the model denoises gives you a continuous morph the gamepad can ride live.

Can I run it on Apple Silicon?

Yes — Riffusion runs on MPS via PyTorch. Slower than CUDA but functional. An M3 Max does a 5-second clip in 8–10 seconds.

Riffusion Spectrogram — Gamepad XY as Latent Pilot

Riffusion gamepad control is the most interesting AI-music workflow we have run. Unlike MusicGen, which is essentially batch generation, Riffusion exposes a continuous interpolation between two prompts during inference — which is exactly the shape a gamepad stick wants to drive. The right stick becomes a latent-space joystick, sweeping between "thick dub bass" and "shimmering ambient pad" in real time. This guide wires Riffusion to a PS5 DualSense via Universal Controller MIDI and a small Python loop.

TL;DR

What it is: Stable Diffusion fine-tuned on spectrograms. Image → ISTFT → audio.
What the gamepad does: right stick rides prompt-blend alpha, triggers fire denoise, touchpad scrubs through seeds.
Speed: 3–6 seconds per 5-second clip on a 3090. Fast enough to feel jam-able.
Quality ceiling: lo-fi, phasey, perfect for textures, bad for polished mixes.

Why spectrogram diffusion is gamepad-friendly

Most AI-music models are sequence-to-sequence: they read a prompt, sample tokens, decode audio. The interaction surface is essentially "wait, then listen". Riffusion is different — it borrows Stable Diffusion's image-conditioning trick, which gives you a continuous latent space you can interpolate through during inference. That continuous latent is the gamepad's natural habitat. A stick is two continuous axes; the latent is two continuous axes (alpha and seed-nudge). Everything else is plumbing.

The mapping that actually works

Input	MIDI	Param	What it does
Right stick X	CC 12	alpha	Interpolation between prompt A (0) and prompt B (1)
Right stick Y	CC 13	seed-nudge	Walks the seed by ±32 around the base
Left stick Y	CC 14	CFG scale	How tightly the model sticks to the prompt
Touchpad X	CC 16	seed scrub	Absolute seed across the touchpad surface (0–1024)
L2 trigger	CC 15	inference steps	15 (fast/grainy) to 50 (slow/clean)
R2 trigger	Note 60	fire	Run inference with current state
X (Cross)	Note 61	swap A/B	Promote current blend to new prompt A

The Python loop

Riffusion exposes a clean pipeline class. The loop below listens to MIDI, reads CC state, and fires inference on note-on. Roughly fifty lines including imports.

import mido
from riffusion.riffusion_pipeline import RiffusionPipeline
from riffusion.spectrogram_image_converter import SpectrogramImageConverter
import torch, time

pipe = RiffusionPipeline.from_pretrained(
    "riffusion/riffusion-model-v1",
    torch_dtype=torch.float16,
).to("cuda")

converter = SpectrogramImageConverter()
state = {"alpha": 0.5, "seed": 42, "cfg": 7.0, "steps": 25}
PROMPT_A = "thick dub bass 90 bpm subwoofer"
PROMPT_B = "shimmering ambient pad reverberant"

inp = mido.open_input("Universal Controller MIDI")
for msg in inp:
    if msg.type == "control_change":
        if msg.control == 12: state["alpha"] = msg.value / 127
        if msg.control == 13: state["seed"] = 42 + (msg.value - 64) // 2
        if msg.control == 14: state["cfg"] = 3.0 + (msg.value / 127) * 12.0
        if msg.control == 15: state["steps"] = 15 + int((msg.value / 127) * 35)
        if msg.control == 16: state["seed"] = msg.value * 8
    elif msg.type == "note_on" and msg.note == 60:
        image = pipe(
            prompt_a=PROMPT_A,
            prompt_b=PROMPT_B,
            alpha=state["alpha"],
            seed_a=state["seed"],
            seed_b=state["seed"] + 1,
            num_inference_steps=state["steps"],
            guidance_scale=state["cfg"],
        ).images[0]
        segment = converter.audio_from_spectrogram_image(image)
        segment.export(f"./watched/{int(time.time())}.wav", format="wav")

What happens when you sweep the stick

Alpha is the magic axis. At 0.0 the output is purely prompt A — thick dub bass. At 1.0 it is purely prompt B — ambient pad. Anywhere in between, Riffusion does a genuine latent blend rather than a crossfade. Sweep the stick mid-set and you get a slow morph from sub-bass into shimmer that feels like the model is composing in front of you. The seed-nudge on Y means you can hold the same blend but walk through nearby variations — same shape, different details.

The touchpad as a seed library

The DualSense touchpad is 52 mm × 23 mm of capacitive surface that maps cleanly to an absolute seed range. Drag your finger across it and the seed jumps to wherever your touch lands. This is the single biggest workflow win — instead of cycling seeds with a button until you find one you like, you scrub. We default to mapping the touchpad to seeds 0–1024; that is plenty to find a good one and easy to remember a position by feel.

Inference steps vs. trigger pressure

The left trigger is variable, so we wire it to inference steps. Light press = 15 steps = grainy and fast = great for jamming. Hard press = 50 steps = clean and slow = great for the take you actually want to keep. This gives the trigger a real performance role rather than wasting it as a binary fire button.

Output quality, honestly

Riffusion outputs are not pristine. The spectrogram-to-audio conversion loses high-frequency information and adds a soft phasey artefact, especially on transients. For pads, drones, ambient beds, and lo-fi loops this is fine — often it sounds intentional. For anything where you need tight transients (drums, lead synths) you will want to layer over the top. Treat Riffusion like a texture generator with a great control surface, not a finished-track engine.

Living with the GPU bill

Running locally on a 3090 or 4090 costs nothing per generation. Running through a hosted endpoint is closer to $0.01–0.02 per clip. For a 60-minute jam with one fire every 20 seconds, that is roughly 180 clips — about $3 on hosted, free on local. The local path is the right call once you are jamming more than once a week.

The verdict

Riffusion is the AI-music model that most rewards a gamepad. The alpha axis was made for a stick; the seed scrub was made for a touchpad; the trigger pressure was made for inference steps. If you have already built a MusicGen pilot rig, Riffusion bolts onto the same bridge — same MIDI mapping, different generation backend. Universal Controller MIDI handles the controller plumbing; the fifty-line Python loop is all you need.