NHJ Audio Standard & Pipeline¶

How Not-Happy-Jan's audio is built, stored, mixed, and tuned. This is the canonical reference for the asset fidelity tiers, the single-stream mixer, the on/off-hold sequence, the experience modes, and every runtime control. Written to be teachable — the why sits next to the what.

1. The one architectural fact that drives everything¶

All of NHJ's audio runs through one sounddevice.OutputStream — a single mixer (src/nhj/duck_player.py) that sums named buses on one timeline. Every source, whatever its format, is decoded to 48 kHz stereo float32 at playback (_decode_file, _Decoder via PyAV).

The consequence: storage format sets only the fidelity ceiling and the file size, not the playback path. A mono 24 kHz clunk and a 48 kHz-stereo copy of it sound identical once they hit the bus — so storing SFX at 48 kHz stereo is pure waste. We store each asset class at the lowest tier that loses nothing audible for its role, and let the mixer upsample.

2. Fidelity tiers¶

Class	Format	Channels / rate	Why
Music (hold bed)	AAC `.m4a`, ~128 kbps	stereo 48 k	The showcase — "very cool music." Compressed is transparent; stereo matters.
Tones (`call_on_hold`, `call_off_hold`, `failed_transfer`)	PCM `.wav` s16	mono 24 k	Point sounds; phone tones live < 4 kHz. No stereo/HF to lose.
Handset (pickup/putdown pool)	PCM `.wav` s16	mono 24 k	Mono-recorded clunks. 24 k Nyquist (12 kHz) covers the transient rattle.
Ambient beds (pub/party/beach/outdoor/callcentre/special-forces-radio)	PCM `.wav` s16	mono 24 k	Scene beds play one-shot; mode beds can loop under voice. Stereo width is inaudible there.
Voice (scenes / intros / returns / hot-mic asides)	PCM `.wav` s16	mono 24 k	TTS-native (Qwen3-TTS emits 24 k mono). Already optimal.

Rule of thumb: music is high; everything else is mono 24 k. One consistent SFX/voice tier. Cutting the SFX + ambient from 48 k-stereo to this standard took them from ~10.5 MB to ~2.7 MB with zero audible change.

Masters are never thrown away — full-res source .m4a live in archive/originals/ (gitignored), so any asset can be re-derived at a different window/level. The whole audio/ tree is gitignored; only the code that builds and plays it is tracked.

3. The buses¶

The mixer (Mixer._callback) sums four buses every ~21 ms block:

Bus	Concurrency	Ducking	Carries
music	single bed (looping playlist)	ducks under speech/fx; rides the long-session floor	the hold music
fx	concurrent one-shots	none (sits on top)	going-on-hold putdown + tone
voice	sequential queue (FIFO)	—	off-hold tone+pickup, scene/aside lines, the character verdict
ambient	concurrent low beds	ducked to `NHJ_AMBIENT_DUCK` (0.35) under voice	scene beds, call-centre room tone, looping mode beds

Auto-sidechain: whenever any fx/voice is sounding, the music target drops to NHJ_MUZAK_DUCK_GAIN (0.22). A per-block linear ramp on _music_gain smooths the dip and recovery. pause() freezes the music bed at its exact sample position (skips the bed in the callback) so it resumes seamlessly; voice/fx/ambient always sound regardless of pause.

Voice levelling: the pre-rendered banks are loudness-normalised offline (~-19 LUFS); live TTS comes out at whatever level the model produced, so the adapter levels each live clip to the same target (_normalize_voice_wav — full-clip RMS gain to NHJ_VOICE_TARGET_DBFS, default -19.5 dBFS, with a -1.5 dBFS peak ceiling) before caching/playback. Timbre-safe (pure gain); on any error it returns the clip untouched so levelling can never break playback. Live and banked lines then sit consistently on the bus.

Cross-process: the Stop-hook → worker → adapter pipeline doesn't touch audio directly — it drops play-events (JSON) into ~/.config/nhj/audio-events/, which the mixer ingests via pump_events(). One timeline, no racing afplay processes.

4. The on/off-hold sequence (gapless by construction)¶

The call-centre metaphor: Claude working = you're on hold (music). Claude done = Jan comes off hold and delivers the verdict.

Going ON hold   :  [intro / scene / hot-mic aside]  →  putdown clunk + call_on_hold tone  →  music
Coming OFF hold :  call_off_hold tone + handset pickup  →  the character's verdict (or silence)

Gapless guarantee — the requirement was music → click → receiver with no audible gaps. Two timed one-shots can't promise that, so Mixer.play_sequence() concatenates the tone + clunk into a single buffer (zero samples between, per-part gain baked in) and queues it on the sequential voice bus. The character's verdict — when its TTS lands — queues straight behind it on the same FIFO, so it butts on with no gap. If no verdict fires, you simply get tone + clunk → silence.

The verdict is produced by a separate process (TTS takes 1–3 s). If it's slower than the ~1.5 s tone+clunk, there's a short natural beat before she speaks — like a real pickup. The three sounds that must be gapless always are; the verdict-after-clunk is gapless when the line is ready in time.

The handset pool is shared and serves as both pickup and putdown — a random clip each time keeps it from sounding canned. Rarely (NHJ_FAILED_TRANSFER_PROB) a botched-transfer gag replaces the off-hold pair.

5. Experience modes — `nhj mode <…>` (or `NHJ_MODE`)¶

Mode	Hold music	Voice filter	Extras
normal	yes	—	full hold-music experience (default)
agent-vibes	no	—	just the off-hold notification (tone + pickup) + the verdict. Going-on-hold is silent. A lighter daily driver.
call-centre	yes	phone bandpass (300–3400 Hz + drive) on the whole voice bus	looping call-centre room bed (murmur + key ticks) under the whole mode

Set persistently with nhj mode call-centre (writes the audio_mode and ambient_mode settings), or per-run with NHJ_MODE=agent-vibes. The controller re-reads the mode every tick, so it applies live.

6. Long-session ducking¶

Long inferences shouldn't drone the bed at constant volume. The music floor eases from full down to NHJ_LONGHOLD_DUCK (0.15) once a continuous inference passes NHJ_LONGHOLD_AFTER (20 s), over a NHJ_LONGHOLD_RAMP (4 s) glide, then snaps back to full when it goes idle and the next hold begins. It rides the same music_floor target the sidechain uses, so it composes cleanly with speech ducking. On by default; NHJ_LONGHOLD=0 disables.

7. Compression — `NHJ_COMP=1` (off by default)¶

A soft-knee tanh saturation + makeup on the master sum, replacing the bare clip limiter — glues the music + voice + fx and tames inter-bus peaks. Off by default so it doesn't colour the stock sound; NHJ_COMP_DRIVE (1.6) sets the knee. (This is instantaneous waveshaping, not a time-constant compressor — the long-session ducking handles macro-dynamics; this just controls peaks.)

8. Levels — runtime dials, not baked¶

Masters are peak-normalised (−1 dBFS for SFX, loudnorm −20 LUFS for beds); their role level is applied at play time via NHJ_*_VOLUME env vars. This keeps re-balancing a one-line change — no re-encoding — which matters for tuning and for the teaching narrative. SFX that always pair with a line stay separate (the sequential bus concatenates them gaplessly at runtime), preserving full variation rather than welding fixed pairs.

9. Modular / swappable voices¶

Voice routing goes through character_by_name() → a voices/<name>/ directory, and every spoken clip is an independent voice-bus one-shot. Swapping a character for a different TTS, a fully dynamic voice, or the user's own recordings is just pointing a character at a different voice set — no pipeline change. This is exactly why SFX are kept separate from speech: a pre-merged clunk+line would weld that door shut.

9a. Speech speed — applied once, at playback¶

Speed is applied in exactly one place: by resampling the clip at playback on the voice bus. It is never sent to the TTS backend — the deployed Qwen model ignores a synthesis-speed kwarg, and applying it at both synthesis and playback would double it. Resampling at playback means bank, cache, and live-synthesised clips all speed up identically (they're stored speed-neutral; the cache stays reusable at any speed).

Precedence: an explicit speed from the marker/MCP (e.g. [Jan:ok|speed=1.3]) overrides the character band's speed; speed=1.0 defers to the band's natural pace. Trade-off: resampling shifts pitch (faster = higher) — a pitch-preserving time-stretch would need a DSP dependency we deliberately don't take.

9b. Output-stream resilience¶

The single OutputStream is opened and kept alive by a bounded-backoff keeper (StreamKeeper in duck_player.py). A transient PortAudio failure on the initial open no longer kills the controller, and when the default output device switches (or a reopen fails) it retries with capped exponential backoff — even if the device name doesn't change again — instead of going dead until restart. Failures stay observable (one log line each) without a tight error loop; queued voice keeps buffering and plays once the stream recovers.

10. Control reference¶

Modes & settings¶

Control	Default	Effect
`nhj mode normal\\|agent-vibes\\|call-centre` / `NHJ_MODE`	normal	experience mode
`NHJ_MUZAK_PLAYER`	`duck`	selects this single-stream controller
`NHJ_MUZAK_VOLUME`	0.5	master gain

Ducking¶

Control	Default	Effect
`NHJ_MUZAK_DUCK_GAIN`	0.22	music floor while speech/fx sounds (sidechain)
`NHJ_AMBIENT_DUCK`	0.35	ambient bed level under voice
`NHJ_LONGHOLD` / `_AFTER` / `_RAMP` / `_DUCK`	1 / 20 / 4 / 0.15	long-session muzak fade

Hold sounds¶

Control	Default	Effect
`NHJ_MUZAK_TRANSFER` / `_VOLUME`	1 / 0.85	on/off-hold tones on/off, level
`NHJ_HANDSET` / `_VOLUME`	1 / 0.9	handset pickup/putdown on/off, level
`NHJ_PUTDOWN`	1	putdown clunk going on hold
`NHJ_FAILED_TRANSFER` / `_PROB` / `_VOLUME`	1 / 0.05 / 0.9	botched-transfer gag
`NHJ_MUZAK_INTRO` / `_VOLUME`	1 / 0.9	"hold please" intro

Scenes & asides¶

Control	Default	Effect
`NHJ_SCENES` / `NHJ_SCENES_PROB`	1 / 0.08	easter-egg hold vignettes
`NHJ_OPENMIC_PROB`	0.05	hot-mic asides (she forgets to hit hold)

Filters¶

Control	Default	Effect
`NHJ_COMP` / `NHJ_COMP_DRIVE`	0 / 1.6	master soft-knee saturation
`NHJ_MODE_AMBIENT_VOLUME`	0.5	looping mode-bed level

11. Building assets¶

# Scene + hot-mic lines (TTS on :9992) + fill MISSING placeholder beds (never clobbers reals)
nhj build-scenes                 # --force-beds to regenerate the synthetic placeholders

# Music tracks live as audio/music/inference_muzak_NNN.m4a (stereo 48k AAC) — kept as-is.

SFX one-shots and beds are derived from masters in archive/originals/ with ffmpeg at the standard:

# SFX tone:  trim silence → peak-normalise → mono 24k
ffmpeg -i src.m4a -af "<trim>,volume=<g>dB" -ar 24000 -ac 1 -c:a pcm_s16le out.wav

# Ambient bed:  windowed → loudnorm -20 LUFS → 0.4s fades → mono 24k
ffmpeg -ss <t> -t 12 -i src.m4a -af "loudnorm=I=-20:TP=-1.5:LRA=11,afade=in,afade=out" \
       -ar 24000 -ac 1 -c:a pcm_s16le bed.wav

# Handset pool:  energy-envelope splice of a multi-hit take → many normalised mono-24k clips
#                (merge a clunk's internal micro-gaps; drop blips; pad; de-click edges)

Synthetic placeholders (scenes.generate_placeholder_beds, generate_callcentre_bed) emit at the standard so out-of-box demos match production.

12. TODO / roadmap¶

Real CC0 field recordings to replace the synthetic outdoor and callcentre beds (and any beds you want richer). The hooks are ready — drop a .wav in audio/ambient/.
Typing / phone one-shots sprinkled in call-centre mode (presently folded into the synth bed's key-ticks).
Modular-voice UX — expose the voice-swap (different TTS / user-recorded voices) as a first-class setting; the architecture already supports it.
Time-constant compressor (attack/release) if the instantaneous saturation proves too blunt for "glue."