Blog AI 9 min read

Whisper V3 Lyric Transcription — Gamepad-Fired Cues

Transcribe vocals with Whisper V3, get word-level timestamps, convert to MIDI notes, fire them from a gamepad button. Lyric-synced visuals, lights, stings.

By Aidxn Design

Whisper v3 gamepad workflows are the niche use of AI tooling we keep returning to — transcribe a vocal, extract word-level timestamps, convert them into MIDI cues, fire them from a controller button during the show. Whisper does the boring transcription. The gamepad arms and releases the cue track. The result is lyric-synced visuals, lights, and stingers without a single hand-clicked timeline marker.

TL;DR
  • Model: whisper-large-v3 or distil-large-v3 (Hugging Face).
  • Granularity: word-level timestamps, not segment-level.
  • Conversion: each word becomes a MIDI note on channel 16.
  • Gamepad role: mute/unmute the lyric MIDI track to arm/disarm cue firing.
  • Use cases: visuals, lights, OBS scenes, sample stings, subtitle overlays.

Why Whisper for live work

OpenAI released Whisper Large V3 with word-level timestamping baked in. The model is permissively licensed (MIT), runs on a laptop GPU or even on Apple Silicon CPU at usable speed, and handles a wide range of accents. The openai/whisper repo has the canonical implementation; the Hugging Face transformers port adds a streaming option and faster batching.

For live performance work we want the timestamps, not the text. Word boundaries are good to within ~80 ms on clean recordings, which is well inside the threshold where a triggered visual or lighting cue feels synced to the lyric.

Step one — transcribe

Run the model with return_timestamps="word". Here's the minimum-viable pipeline:

# pip install transformers accelerate torch soundfile librosa
from transformers import pipeline
import json

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    chunk_length_s=30,
    return_timestamps="word",
    device="mps",  # or "cuda", or -1 for CPU
)

result = pipe("vocal.wav")

# result["chunks"] = [
#   { "text": " The", "timestamp": [0.0, 0.18] },
#   { "text": " quick", "timestamp": [0.18, 0.42] },
#   ...
# ]

with open("lyrics.json", "w") as f:
    json.dump(result["chunks"], f, indent=2)

On an M2 Pro, a 3-minute vocal takes about 25 seconds with Whisper Large V3. Distil-Whisper V3 trims that to under 5 seconds with negligible word error rate cost — switch to distil-whisper/distil-large-v3 if you're batching.

Step two — convert words to MIDI

Each word becomes a MIDI note. We use note 60 on channel 16 by convention — channel 16 keeps lyric cues out of the way of performance MIDI on channels 1–4. Duration matches the word's spoken length so the note-off fires at the word's end (handy for visual fades).

# pip install mido
import json
import mido

mid = mido.MidiFile()
track = mido.MidiTrack()
mid.tracks.append(track)

# Mido uses ticks; assume 480 ticks per beat, 120 BPM, so 1 s = 960 ticks.
TICKS_PER_SEC = 960
last_tick = 0

with open("lyrics.json") as f:
    chunks = json.load(f)

for c in chunks:
    start_s, end_s = c["timestamp"]
    if start_s is None or end_s is None:
        continue
    start_tick = int(start_s * TICKS_PER_SEC)
    end_tick = int(end_s * TICKS_PER_SEC)

    delta_on = start_tick - last_tick
    track.append(mido.Message("note_on",  note=60, velocity=100,
                              channel=15, time=max(0, delta_on)))
    track.append(mido.Message("note_off", note=60, velocity=0,
                              channel=15, time=max(1, end_tick - start_tick)))
    last_tick = end_tick

mid.save("lyrics.mid")

Drop lyrics.mid on a MIDI track in your DAW, line up bar 1 with the vocal start, and you've got a click-track of every lyric in the song.

Step three — gamepad arming

The MIDI clip will fire every word on playback. That's too eager. We want to arm cue firing only during specific sections of the show. Bind a DualSense button to mute/unmute the lyric track:

InputMIDIFunction
CrossNote 60, ch 1Toggle lyric track mute (cue firing on/off)
SquareNote 61, ch 1Skip to next lyric section marker
TriangleNote 62, ch 1Re-arm (re-mute, ready for next verse)
CircleNote 63, ch 1Kill all cue firing (panic mute)
L1 holdNote 64, ch 1Manual cue override — every press fires the next word
R2 triggerCC 22, ch 1Cue intensity (downstream visual amplitude)

Now the controller is a cue-arming surface. Hit Cross at the start of a verse — every word in that verse fires its bound event. Hit Cross again at the end of the verse — firing stops. The performer keeps both hands free for the mic.

What the cues actually do

The lyric MIDI track is routed wherever you need cues to land:

  • Resolume — each note triggers the next clip in a column. Result: visuals advance per word.
  • Ableton Drum Rack — each note fires a vocal-chop sting, building a layered pad of every word in the verse.
  • MIDI-to-DMX bridge — each note bumps stage-light intensity. Lyric-synced lighting without a programmer.
  • OBS scene switching — bridge to OBS WebSocket, each word toggles a lower-third text overlay.
  • QLab — see the QLab cue guide for the full theatre rig.

The honest limits of Whisper for lyrics

Whisper was trained on speech, not song. That shows up as three predictable problems:

  1. Melisma collapse. One held vowel across multiple notes often gets transcribed as a single word. You lose the visual rhythm on those passages — fix by hand-splitting in the JSON.
  2. Doubled words. Some vocal styles (esp. trap doubling, ad-libs) produce two transcripts of the same word offset by ~150 ms. Deduplicate by collapsing words whose start times are within 100 ms of each other.
  3. Mumble-rap territory. The model's WER climbs past 15% on highly stylised vocal delivery. If the vocal is clear, you're fine. If it's deliberately blurry, plan extra cleanup time.
# Quick dedupe of doubled words
def dedupe(chunks, min_gap_s=0.1):
    out = []
    for c in chunks:
        start = c["timestamp"][0]
        if start is None: continue
        if out and abs(out[-1]["timestamp"][0] - start) < min_gap_s:
            continue
        out.append(c)
    return out

End-to-end check

Once you've built one of these, the recipe sticks:

vocal.wav ──► Whisper V3 ──► lyrics.json ──► mido ──► lyrics.mid
                                                          │
                                                          ▼
                                                    DAW track (ch 16)
                                                          │
                                                          ▼
                                          ┌───────────────┴───────────────┐
                                          │                               │
                                    Resolume cues               DMX light cues
                                          │                               │
                                          ▼                               ▼
                              ┌─ Visual sting per word ─┐    Stage lights per word
                              └─────────────────────────┘
                                          ▲
                                          │
                                  DualSense Cross button
                                  (arms / disarms firing)

Where this is honest about AI's value

The model is doing the part nobody enjoys — listening to a track and timestamping every word. A human can do it, but it takes hours. Whisper does it in seconds with acceptable accuracy and the gamepad gives the performer the on/off switch that keeps the cues musical instead of robotic. Pair this with our Twitch stinger workflow for streaming or the DMX bridge for lighting. Universal Controller MIDI ties the gamepad to the cue arm.

Keep reading

More setup walkthroughs