Whisper v3 gamepad workflows are the niche use of AI tooling we keep returning to — transcribe a vocal, extract word-level timestamps, convert them into MIDI cues, fire them from a controller button during the show. Whisper does the boring transcription. The gamepad arms and releases the cue track. The result is lyric-synced visuals, lights, and stingers without a single hand-clicked timeline marker.
- Model:
whisper-large-v3ordistil-large-v3(Hugging Face). - Granularity: word-level timestamps, not segment-level.
- Conversion: each word becomes a MIDI note on channel 16.
- Gamepad role: mute/unmute the lyric MIDI track to arm/disarm cue firing.
- Use cases: visuals, lights, OBS scenes, sample stings, subtitle overlays.
Why Whisper for live work
OpenAI released Whisper Large V3 with word-level timestamping baked in. The model is permissively licensed (MIT), runs on a laptop GPU or even on Apple Silicon CPU at usable speed, and handles a wide range of accents. The openai/whisper repo has the canonical implementation; the Hugging Face transformers port adds a streaming option and faster batching.
For live performance work we want the timestamps, not the text. Word boundaries are good to within ~80 ms on clean recordings, which is well inside the threshold where a triggered visual or lighting cue feels synced to the lyric.
Step one — transcribe
Run the model with return_timestamps="word". Here's the minimum-viable pipeline:
# pip install transformers accelerate torch soundfile librosa
from transformers import pipeline
import json
pipe = pipeline(
"automatic-speech-recognition",
model="openai/whisper-large-v3",
chunk_length_s=30,
return_timestamps="word",
device="mps", # or "cuda", or -1 for CPU
)
result = pipe("vocal.wav")
# result["chunks"] = [
# { "text": " The", "timestamp": [0.0, 0.18] },
# { "text": " quick", "timestamp": [0.18, 0.42] },
# ...
# ]
with open("lyrics.json", "w") as f:
json.dump(result["chunks"], f, indent=2)
On an M2 Pro, a 3-minute vocal takes about 25 seconds with Whisper Large V3. Distil-Whisper V3 trims that to under 5 seconds with negligible word error rate cost — switch to distil-whisper/distil-large-v3 if you're batching.
Step two — convert words to MIDI
Each word becomes a MIDI note. We use note 60 on channel 16 by convention — channel 16 keeps lyric cues out of the way of performance MIDI on channels 1–4. Duration matches the word's spoken length so the note-off fires at the word's end (handy for visual fades).
# pip install mido
import json
import mido
mid = mido.MidiFile()
track = mido.MidiTrack()
mid.tracks.append(track)
# Mido uses ticks; assume 480 ticks per beat, 120 BPM, so 1 s = 960 ticks.
TICKS_PER_SEC = 960
last_tick = 0
with open("lyrics.json") as f:
chunks = json.load(f)
for c in chunks:
start_s, end_s = c["timestamp"]
if start_s is None or end_s is None:
continue
start_tick = int(start_s * TICKS_PER_SEC)
end_tick = int(end_s * TICKS_PER_SEC)
delta_on = start_tick - last_tick
track.append(mido.Message("note_on", note=60, velocity=100,
channel=15, time=max(0, delta_on)))
track.append(mido.Message("note_off", note=60, velocity=0,
channel=15, time=max(1, end_tick - start_tick)))
last_tick = end_tick
mid.save("lyrics.mid")
Drop lyrics.mid on a MIDI track in your DAW, line up bar 1 with the vocal start, and you've got a click-track of every lyric in the song.
Step three — gamepad arming
The MIDI clip will fire every word on playback. That's too eager. We want to arm cue firing only during specific sections of the show. Bind a DualSense button to mute/unmute the lyric track:
| Input | MIDI | Function |
|---|---|---|
| Cross | Note 60, ch 1 | Toggle lyric track mute (cue firing on/off) |
| Square | Note 61, ch 1 | Skip to next lyric section marker |
| Triangle | Note 62, ch 1 | Re-arm (re-mute, ready for next verse) |
| Circle | Note 63, ch 1 | Kill all cue firing (panic mute) |
| L1 hold | Note 64, ch 1 | Manual cue override — every press fires the next word |
| R2 trigger | CC 22, ch 1 | Cue intensity (downstream visual amplitude) |
Now the controller is a cue-arming surface. Hit Cross at the start of a verse — every word in that verse fires its bound event. Hit Cross again at the end of the verse — firing stops. The performer keeps both hands free for the mic.
What the cues actually do
The lyric MIDI track is routed wherever you need cues to land:
- Resolume — each note triggers the next clip in a column. Result: visuals advance per word.
- Ableton Drum Rack — each note fires a vocal-chop sting, building a layered pad of every word in the verse.
- MIDI-to-DMX bridge — each note bumps stage-light intensity. Lyric-synced lighting without a programmer.
- OBS scene switching — bridge to OBS WebSocket, each word toggles a lower-third text overlay.
- QLab — see the QLab cue guide for the full theatre rig.
The honest limits of Whisper for lyrics
Whisper was trained on speech, not song. That shows up as three predictable problems:
- Melisma collapse. One held vowel across multiple notes often gets transcribed as a single word. You lose the visual rhythm on those passages — fix by hand-splitting in the JSON.
- Doubled words. Some vocal styles (esp. trap doubling, ad-libs) produce two transcripts of the same word offset by ~150 ms. Deduplicate by collapsing words whose start times are within 100 ms of each other.
- Mumble-rap territory. The model's WER climbs past 15% on highly stylised vocal delivery. If the vocal is clear, you're fine. If it's deliberately blurry, plan extra cleanup time.
# Quick dedupe of doubled words
def dedupe(chunks, min_gap_s=0.1):
out = []
for c in chunks:
start = c["timestamp"][0]
if start is None: continue
if out and abs(out[-1]["timestamp"][0] - start) < min_gap_s:
continue
out.append(c)
return out End-to-end check
Once you've built one of these, the recipe sticks:
vocal.wav ──► Whisper V3 ──► lyrics.json ──► mido ──► lyrics.mid
│
▼
DAW track (ch 16)
│
▼
┌───────────────┴───────────────┐
│ │
Resolume cues DMX light cues
│ │
▼ ▼
┌─ Visual sting per word ─┐ Stage lights per word
└─────────────────────────┘
▲
│
DualSense Cross button
(arms / disarms firing) Where this is honest about AI's value
The model is doing the part nobody enjoys — listening to a track and timestamping every word. A human can do it, but it takes hours. Whisper does it in seconds with acceptable accuracy and the gamepad gives the performer the on/off switch that keeps the cues musical instead of robotic. Pair this with our Twitch stinger workflow for streaming or the DMX bridge for lighting. Universal Controller MIDI ties the gamepad to the cue arm.