Why use Whisper for this instead of Vocaloid alignment or forced aligners?

Whisper V3 is open-weights, runs on a laptop, and handles accented English better than the academic aligners. Word-level timestamps are accurate enough for live cue triggering — usually within 80 ms of the true onset.

Should I use Whisper Large V3 or Distil-Whisper?

Large V3 is more accurate. Distil-Whisper V3 is 6× faster with about 1% word error rate cost. For live prep, accuracy wins. For batch jobs across a whole album, Distil wins.

Can this run live, not just pre-rendered?

Technically yes, with streaming Whisper. In practice the latency adds up — 200–500 ms even on optimised setups. We prep cues from a recording, not from the live mic.

What is the gamepad actually firing?

Anything MIDI. Visual cues in Resolume, sample one-shots in Ableton, lighting changes via MIDI-to-DMX, even subtitle overlays in OBS. The bridge does not care.

Does Whisper struggle with sung vocals?

It is built for speech. Held notes and melisma will sometimes get one word as two, or vice versa. Manually clean the timestamp JSON before performance.

Is there a CC-licensed alternative to Whisper?

The Whisper weights are MIT-licensed. Free for commercial use. NVIDIA Canary and AssemblyAI are commercial alternatives if accent coverage matters.

Whisper V3 Lyric Transcription — Gamepad-Fired Cues

Whisper v3 gamepad workflows are the niche use of AI tooling we keep returning to — transcribe a vocal, extract word-level timestamps, convert them into MIDI cues, fire them from a controller button during the show. Whisper does the boring transcription. The gamepad arms and releases the cue track. The result is lyric-synced visuals, lights, and stingers without a single hand-clicked timeline marker.

TL;DR

Model: whisper-large-v3 or distil-large-v3 (Hugging Face).
Granularity: word-level timestamps, not segment-level.
Conversion: each word becomes a MIDI note on channel 16.
Gamepad role: mute/unmute the lyric MIDI track to arm/disarm cue firing.
Use cases: visuals, lights, OBS scenes, sample stings, subtitle overlays.

Why Whisper for live work

OpenAI released Whisper Large V3 with word-level timestamping baked in. The model is permissively licensed (MIT), runs on a laptop GPU or even on Apple Silicon CPU at usable speed, and handles a wide range of accents. The openai/whisper repo has the canonical implementation; the Hugging Face transformers port adds a streaming option and faster batching.

For live performance work we want the timestamps, not the text. Word boundaries are good to within ~80 ms on clean recordings, which is well inside the threshold where a triggered visual or lighting cue feels synced to the lyric.

Step one — transcribe

Run the model with return_timestamps="word". Here's the minimum-viable pipeline:

# pip install transformers accelerate torch soundfile librosa
from transformers import pipeline
import json

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v3",
    chunk_length_s=30,
    return_timestamps="word",
    device="mps",  # or "cuda", or -1 for CPU
)

result = pipe("vocal.wav")

# result["chunks"] = [
#   { "text": " The", "timestamp": [0.0, 0.18] },
#   { "text": " quick", "timestamp": [0.18, 0.42] },
#   ...
# ]

with open("lyrics.json", "w") as f:
    json.dump(result["chunks"], f, indent=2)

On an M2 Pro, a 3-minute vocal takes about 25 seconds with Whisper Large V3. Distil-Whisper V3 trims that to under 5 seconds with negligible word error rate cost — switch to distil-whisper/distil-large-v3 if you're batching.

Step two — convert words to MIDI

Each word becomes a MIDI note. We use note 60 on channel 16 by convention — channel 16 keeps lyric cues out of the way of performance MIDI on channels 1–4. Duration matches the word's spoken length so the note-off fires at the word's end (handy for visual fades).

# pip install mido
import json
import mido

mid = mido.MidiFile()
track = mido.MidiTrack()
mid.tracks.append(track)

# Mido uses ticks; assume 480 ticks per beat, 120 BPM, so 1 s = 960 ticks.
TICKS_PER_SEC = 960
last_tick = 0

with open("lyrics.json") as f:
    chunks = json.load(f)

for c in chunks:
    start_s, end_s = c["timestamp"]
    if start_s is None or end_s is None:
        continue
    start_tick = int(start_s * TICKS_PER_SEC)
    end_tick = int(end_s * TICKS_PER_SEC)

    delta_on = start_tick - last_tick
    track.append(mido.Message("note_on",  note=60, velocity=100,
                              channel=15, time=max(0, delta_on)))
    track.append(mido.Message("note_off", note=60, velocity=0,
                              channel=15, time=max(1, end_tick - start_tick)))
    last_tick = end_tick

mid.save("lyrics.mid")

Drop lyrics.mid on a MIDI track in your DAW, line up bar 1 with the vocal start, and you've got a click-track of every lyric in the song.

Step three — gamepad arming

The MIDI clip will fire every word on playback. That's too eager. We want to arm cue firing only during specific sections of the show. Bind a DualSense button to mute/unmute the lyric track:

Input	MIDI	Function
Cross	Note 60, ch 1	Toggle lyric track mute (cue firing on/off)
Square	Note 61, ch 1	Skip to next lyric section marker
Triangle	Note 62, ch 1	Re-arm (re-mute, ready for next verse)
Circle	Note 63, ch 1	Kill all cue firing (panic mute)
L1 hold	Note 64, ch 1	Manual cue override — every press fires the next word
R2 trigger	CC 22, ch 1	Cue intensity (downstream visual amplitude)

Now the controller is a cue-arming surface. Hit Cross at the start of a verse — every word in that verse fires its bound event. Hit Cross again at the end of the verse — firing stops. The performer keeps both hands free for the mic.

What the cues actually do

The lyric MIDI track is routed wherever you need cues to land:

Resolume — each note triggers the next clip in a column. Result: visuals advance per word.
Ableton Drum Rack — each note fires a vocal-chop sting, building a layered pad of every word in the verse.
MIDI-to-DMX bridge — each note bumps stage-light intensity. Lyric-synced lighting without a programmer.
OBS scene switching — bridge to OBS WebSocket, each word toggles a lower-third text overlay.
QLab — see the QLab cue guide for the full theatre rig.

The honest limits of Whisper for lyrics

Whisper was trained on speech, not song. That shows up as three predictable problems:

Melisma collapse. One held vowel across multiple notes often gets transcribed as a single word. You lose the visual rhythm on those passages — fix by hand-splitting in the JSON.
Doubled words. Some vocal styles (esp. trap doubling, ad-libs) produce two transcripts of the same word offset by ~150 ms. Deduplicate by collapsing words whose start times are within 100 ms of each other.
Mumble-rap territory. The model's WER climbs past 15% on highly stylised vocal delivery. If the vocal is clear, you're fine. If it's deliberately blurry, plan extra cleanup time.

# Quick dedupe of doubled words
def dedupe(chunks, min_gap_s=0.1):
    out = []
    for c in chunks:
        start = c["timestamp"][0]
        if start is None: continue
        if out and abs(out[-1]["timestamp"][0] - start) < min_gap_s:
            continue
        out.append(c)
    return out

End-to-end check

Once you've built one of these, the recipe sticks:

vocal.wav ──► Whisper V3 ──► lyrics.json ──► mido ──► lyrics.mid
                                                          │
                                                          ▼
                                                    DAW track (ch 16)
                                                          │
                                                          ▼
                                          ┌───────────────┴───────────────┐
                                          │                               │
                                    Resolume cues               DMX light cues
                                          │                               │
                                          ▼                               ▼
                              ┌─ Visual sting per word ─┐    Stage lights per word
                              └─────────────────────────┘
                                          ▲
                                          │
                                  DualSense Cross button
                                  (arms / disarms firing)

Where this is honest about AI's value

The model is doing the part nobody enjoys — listening to a track and timestamping every word. A human can do it, but it takes hours. Whisper does it in seconds with acceptable accuracy and the gamepad gives the performer the on/off switch that keeps the cues musical instead of robotic. Pair this with our Twitch stinger workflow for streaming or the DMX bridge for lighting. Universal Controller MIDI ties the gamepad to the cue arm.

Whisper V3 Lyric Transcription — Gamepad-Fired Cues

Why Whisper for live work

Step one — transcribe

Step two — convert words to MIDI

Step three — gamepad arming

What the cues actually do

The honest limits of Whisper for lyrics

End-to-end check

Where this is honest about AI's value

More setup walkthroughs

QLab Gamepad MIDI Show Control — DualSense Cue Rig

Lighting Control on a Gamepad — Resolume to Lumen to DMX Chain

Twitch Streaming With a Gamepad — Stinger Transitions and Scene Switching