This isn't a caption system purpose-built for VJ shows. It's a general-purpose captioning intelligence service designed to serve:
We start with live music + visuals because if the pipeline works with music bleed, crowd noise, and multiple speakers, the cleaner use cases are downstream. When you see "VJ audio" or "live-show context" below, read it as "our hardest calibration target," not "our only use case."
Since v2.6.0 shipped, Matt (in collaboration with ChatGPT Plus) produced a formal architecture paper that becomes the new strategic spine for this project. We've adopted it as the v3.0.0 foundation. The short version: treat the BBC Subtitle Guidelines as a compliance floor (the minimum safe readability / accessibility behaviour), and allow advanced features (emotion, prosody, WallSpace visuals, immersive placement) only as controlled extensions above that floor.
Standards-based captioning should define the minimum safe and readable behaviour of the system, while advanced features are implemented only as controlled extensions above that baseline. Decision order: readability first, compliance second, enhancement third, expression fourth.
The full specification set Matt prepared, referenced throughout this v3.0.0 plan. Click any card to download the PDF.
Matt's architecture uses BBC Subtitle Guidelines as the compliance floor. For the deaf-accessibility use case in A.EYE.ECHO this floor is strict — deaf users depend on it daily, and standards compliance is non-negotiable. For WallSpace's creative / VJ contexts, a caption may legitimately operate with different constraints: a live show, small localized audience, or experimental art installation has different priorities than a daily accessibility conversation.
Matt's paper treats the compliance floor as invariant across all contexts. Jack's read is that for WallSpace — particularly with smaller audiences, localized VJ shows, and experimental art contexts — more flexibility may be appropriate. The architecture already supports this through Rule Class D (project-specific advanced rules) and presentation profiles, but we need to agree on exactly which rules bend and which stay fixed.
Action: Jack to review this framing with Matt. The specific A-class rules that must hold for WallSpace in all modes, and the B/C/D-class rules that can be overridden in Immersive / Experimental WallSpace profiles, need explicit agreement.
| A.EYE.ECHO (deaf-first) | WallSpace (creative) | |
|---|---|---|
| Compliance floor | Strict. BBC baseline always enforced. | Flexible within clear limits. Class A rules still hold. |
| Default profile | Compliance or Accessible Enhanced | Live/Low Latency, Immersive, or Experimental |
| Enhancement scope | Limited to readability-preserving additions | Reactive typography, spatial placement, WallSpace visual control allowed |
| Override path | None beyond built-in profiles | Class D rules + per-layer profile selection |
| Logging requirement | Full decision log (verbose) | Configurable per profile (minimal in live shows, verbose in R&D) |
Matt's architecture replaces our earlier "signal-first pipeline" with a five-layer system. Our existing DSP / emotion / phoneme work from v2.6.0 doesn't go away — it becomes the contents of Layer 1 (Input Engines). Everything above Layer 1 is new infrastructure for governing how that signal data becomes visible captions.
| Component | What it owns | What it must not do |
|---|---|---|
| Input Engines | Generate structured data | Make UI or rendering decisions |
| Rules Engine | Validate data, enforce constraints | Decide how to display anything |
| Policy Engine | Decide rendering behaviour, resolve conflicts, approve enhancements | Invent rules; override constraint priorities |
| Renderer | Execute layout and display | Modify caption content or make creative calls |
| Enhancement Layer | Propose controlled modifications | Enforce behaviour independently |
Lower-priority behaviours must yield to higher-priority constraints without exception.
| Class | Name | Role | Who uses it |
|---|---|---|---|
| A | Hard accessibility | Must never be violated. Defines the floor. | Both apps |
| B | Preferred presentation | Strong defaults with context-sensitive exceptions | Both apps; stricter in Echo |
| C | Controlled expressive | Limited stylistic behaviour already recognised by BBC guidelines | Enabled in Enhanced/Immersive profiles |
| D | Project-specific advanced | Features beyond BBC (reactive typography, WallSpace integration, immersive placement) | WallSpace-specific; override path |
| Profile | Description | Primary use |
|---|---|---|
| Compliance / Broadcast-safe | Strict BBC adherence. No experimental enhancements. | A.EYE.ECHO default; conference captioning |
| Accessible Enhanced | Baseline + limited semantic additions (emotion tags, sound annotation) | A.EYE.ECHO daily use; podcast post-processing |
| Live / Low Latency | Immediacy prioritised; relaxed segmentation; readability floor held | Live conferences, news, Q&A |
| Immersive / Spatial | Spatial placement and motion enabled with motion constraints + fallback | VJ shows, art installations, VR/AR |
| Experimental / Expressive | Advanced enhancements unlocked; Class D rules available | WallSpace R&D, experimental performance |
A sixth profile — Offline / Post-Processing — will likely be added for podcasts, audiobooks, and recorded interview analysis. Flagged as a gap in Matt's initial spec; to discuss.
This session focused on making the voice analysis and emotion detection pipeline actually work end-to-end, and then deeply researching where to take it next. Here are the concrete improvements shipping in v2.6.0:
Added autocorrelation-based fundamental frequency (F0) detection covering 80-500 Hz (bass to soprano). New consonant transient detector catches hard attacks (p/t/k/s sounds) via energy delta analysis. Pitch direction tracking (rising/falling/stable) enables question detection and excitement mapping.
NEWWhen someone says positive words ("that's great") with negative vocal tone (falling pitch, low energy), the system now detects the contradiction and trusts the voice over the text. Because tone is harder to fake than words.
NEWVoice analysis was being extracted but wasn't reaching the emotion scoring engine. Fixed auto-enable so voice features flow through to emotion blending automatically. No more silent failures.
FIXTransient detection now accumulates over a 1-second rolling window instead of just catching individual spikes. This catches patterns like rapid-fire consonants in angry speech — 3+ transients/sec flags anger. Threshold lowered and strength shown in debug panel for tuning.
IMPROVEDNew collapsible debug panel shows real-time emotion scores, voice feature values, and blending weights. Polls voice features at 200ms intervals (not just on text events) so you can see the voice emotion shifting even between words. Essential for tuning and demonstrating the system.
DEBUGMatt's Tasks 3-5: Added emotion test triggers for development, debug state inspection, and temporal alignment between voice features and text events so emotions don't lag behind speech.
MATT'S TASKSAll references below to multi-engine ASR, streaming Web Speech, or native-speech bridges describe historical v2.6.0 state and are retained for context only. The live pipeline is Whisper-only until Moonshine v2 lands.
Here's everything the voice analysis pipeline can do right now, before the next-gen upgrades. Rows marked Abandoned v2.6.1 are kept in the table so the evolution from v2.6.0 is traceable.
| Layer | Implementation | Status |
|---|---|---|
| ASR (Primary) | whisper.cpp subprocess (tiny.en / base.en / small.en) | Working, 2-3s latency |
| ASR (Streaming) | Abandoned v2.6.1 — crashes renderer in Electron | |
| ASR (Native) | Abandoned v2.6.1 — crashes on both Rosetta & arm64 | |
| ASR (Next-Gen) | Moonshine v2 (local, streaming, replaces all above) | Planned — Phase 1 |
| Music Lyrics | Shazam fingerprint + LRCLIB synced lyrics | Working |
| DSP Features | VoiceFeatureExtractor (18 features @ 100ms) | Working |
| Emotion (Text) | Lexicon-based (800+ terms, phrase matching, negation) | Working |
| Emotion (Voice) | Heuristic rules (pitch, energy, transients, volume) | Working |
| Emotion (Blend) | Weighted fusion + sarcasm detection | NEW in v2.6.0 |
| Beat/Kick | FFT onset detection, MIDI clock, tap tempo | Working |
"You take a level obviously not just, you cross reference the text with a kind of sonic analysis
and you try to provide a tone of voice tag... These models will do consonants really well.
Transient analysis is very important."
— Gadi Sassoon, DSP Engineer (25 years, Berkeley College of Music)
Not all commonly cited Whisper limitations are actually observed problems in WallSpace. Some come from Gadi's consultation (his experience with call transcripts and offline tools), others are confirmed in our codebase with specific mitigations in place.
| Limitation | Source | WallSpace Status |
|---|---|---|
| 2-3 second latency | Confirmed | Actively measured via transcriptionService.ts latency tracking.
Default 5s chunks + inference time. Compensated with manual latency offset slider (-500 to +500ms)
and auto-calibrate button. |
| Hallucinations on silence | Confirmed | Observed enough to hardcode filter patterns in whisperBridge.ts:
(music), (applause), "you", "thank you", dot strings.
Mitigated via silence detection (RMS < -60dB skips transcription entirely)
and hallucination filtering before display. |
| Queue drops | Confirmed | Explicit backpressure logic in whisperBridge.ts: if queue > 1 item, oldest chunk is dropped.
Comment in code: "Whisper is slower than real-time."
Intentional trade-off to prevent OOM in live streaming. |
| No emotion data | Factual | By design — Whisper is ASR-only, outputs text with no tone/emotion metadata. Workaround in place: parallel DSP pipeline (VoiceFeatureExtractor) + text lexicon analysis provide emotion independently. |
| Drops consonants | From Gadi | Reported by Gadi from his call transcript experience, not from WallSpace bug reports. Our transient detection (v2.6.0) monitors consonant attacks via DSP but doesn't currently correct Whisper output. Phoneme-level correction planned for Phase 4. |
| Accent struggles | From Gadi | Gadi mentioned struggles with "Globish" and non-native speakers from his tools. No evidence of this in WallSpace. Multilingual models (tiny/base/small) are available alongside English-only variants. No bug reports or workarounds for accents. |
| Gap | Description | Current State |
|---|---|---|
| Signal-First | DSP should lead; text transcription is secondary to the audio signal | Partial |
| Transient Analysis | Consonant edges carry meaning ASR models miss entirely | Basic (v2.6.0) |
| Tone-of-Voice Tags | Cross-reference text with sonic analysis for tone metadata | Basic (v2.6.0) |
| Formant Analysis | Vowel structure (F1/F2/F3) for accent/speaker profiling | Not implemented |
| Decision Matrix | Not a linear pipeline but a matrix of DSP + emotion + semantics | Not implemented |
| Clockless Analysis | Real-time requires careful buffering strategy decisions | Partial |
100x faster than Whisper Large v3 on MacBook Pro (107ms vs 11,286ms). Better WER accuracy. Streaming encoder with sliding-window attention for bounded low-latency. Incremental audio caching — subsequent calls only process new audio. Sizes from 26M (Tiny) to 245M (Medium Streaming). Same subprocess integration pattern as current whisper.cpp.
CoreML-compiled Whisper on Apple Neural Engine. OpenAI-compatible HTTP local server — can be bundled as Electron subprocess. Streaming, word timestamps, VAD, speaker diarization built in.
ASR/TTS/VAD/diarization via ONNX Runtime in WebAssembly (50KB gzipped). Could run speech recognition directly in Electron renderer — no subprocess needed. 12 language bindings, fully offline.
| Service | WER | Streaming Latency | Price | Notes |
|---|---|---|---|---|
| Deepgram Nova-3 | ~5.26% | <300ms | $0.0077/min | $200 free credit |
| AssemblyAI Universal-3 | ~6.68% | ~150ms P50 | ~$0.01/min | 30% fewer hallucinations than Whisper |
| GPT-4o-mini-transcribe | Better than Whisper | Low | ~$0.006/min | WebSocket streaming, accent-resilient |
| Google Chirp 3 | Competitive | Low | Usage-based | Built-in denoiser, speaker diarization |
These models understand raw audio directly — tone, emotion, background noise — not just convert speech to text. This is the paradigm shift Gadi described.
End-to-end audio understanding: tone, emotion, background noise filtering. Responds to user's tone of voice. Live API with bidirectional audio streaming.
Speech + natural sounds + music in one encoder. Voice chat mode (no text needed) + audio analysis mode. Excels at ASR, emotion recognition, acoustic scene classification.
Combined ASR + emotion recognition + audio event detection in one model. Could replace both Whisper AND heuristic emotion detection.
Dedicated emotion classifier: angry, happy, neutral, sad. Lightweight, runs alongside existing ASR pipeline. Multiple model sizes.
Single binary codebook at 0.16-0.65 kbps. Preserves speaker identity AND emotion in reconstructed speech. Outperforms SpeechTokenizer, Mimi, EnCodec. Use case: encode vocal characteristics into compact tokens for speaker profiling, emotion encoding, and network transmission.
There is no off-the-shelf "consonant transient detector" ML model. ASR models like Whisper treat audio as a sequence of words — they don't preserve the signal-level detail of how those words were spoken. The consonant edges (the p/t/k/s attacks) carry critical emotional and clarity information that gets discarded in the text-only pipeline. Our current energy-delta approach is a good start. The upgrade path adds formant analysis, MFCC features, spectral flux, and eventually phoneme classification.
| Feature | What It Does | Why It Matters | Phase |
|---|---|---|---|
| Formant Extraction (F1/F2/F3) | LPC spectral envelope peak-picking | Vowel height/frontness, accent profiling, speaker ID | Phase 2 |
| MFCC (13 coefficients) | Mel filterbank + DCT | Phoneme classification, consonant type detection | Phase 2 |
| Spectral Flux | Frame-to-frame spectral change | More robust consonant edge detection in noise | Phase 2 |
| Harmonic-to-Noise Ratio | Voiced vs unvoiced segment detection | Distinguish vowels from consonants precisely | Phase 2 |
| wav2vec2 Phoneme Classifier | ONNX model for phoneme-level detection | Classify specific consonants (p/t/k/b/d/g/s/z/f/v) | Phase 4 |
| Montreal Forced Aligner | Post-hoc phoneme-transcript alignment | Find where consonants were dropped/mumbled | Phase 4 |
Per Matt's architecture paper, build order prioritises a stable, auditable core before any innovation layers. This replaces the earlier Phase 1–5 capability roadmap from v2.6.0. The capability work (Moonshine, formants, emotion2vec, phoneme analysis) now fits inside Layer 1 (Input Engines) — so those items move inside Phase 1 and Phase 6, not independent phases.
Goal: Convert BBC Subtitle Guidelines into a machine-readable rules matrix.
rules-matrix.json consumable by the Rules EngineGoal: One normalised schema every input engine writes to and every downstream layer reads.
Goal: Consumes the Rules Matrix + Canonical Model; emits constraints + scores.
Goal: Central decision authority — resolves conflicts, applies profiles.
Goal: Deterministic renderer that strictly follows Policy Engine instructions.
Goal: Emotion, prosody, reactive typography, WallSpace integration, spatial placement. All as proposals that the Policy Engine approves / modifies / rejects.
Goal: Every enhancement regression-tested against the compliance floor.
Concerns flagged in review of Matt's architecture, and how we plan to address each.
| Concern | Why it matters | Mitigation |
|---|---|---|
| Performance envelope | Five layers + enhancement + decision log could exceed our <200ms target | Decision Log spec defines minimal / standard / verbose levels. Live profile uses minimal. Validate <200ms before Phase 5 ships. |
| BBC floor vs WallSpace flex | Live VJ, small-audience, and experimental contexts may need different constraints than deaf-accessibility defaults | Class D (project-specific) rules + WallSpace-specific profiles. Class A rules still hold. Every override logged. To review with Matt. |
| Rewrite cost | v2.6.0 has shipping signal-first code; seven phases looks like big-bang | Acceptable — Jack and Matt are effectively the only users of latest. v2.6.0 work was a single-day exploration; rewrite is fine if the architecture unlocks something better. ML-assisted layers will replace hand-tuned heuristics anyway. |
| Profile-switching UX undefined | How does a user move from Compliance to Immersive mid-session? | Likely per-caption-layer in WallSpace (each layer gets a profile); per-session-default in A.EYE.ECHO. To be designed in Phase 4. |
| Offline audio profile missing | Podcasts, audiobooks, recorded interviews have fundamentally different latency constraints | Add a sixth Offline / Post-Processing profile. Same rule/policy framework, relaxed timing. Flagged for discussion with Matt. |
| Decision-log volume in live contexts | 30+ caption updates/sec × full log = heavy I/O | Live profile uses minimal logging + sampling (Decision Log spec §13–14). Log material decisions only. Verbose mode available for R&D. |
| Claude's role inside the system | Matt specifies Claude as a constrained reasoning engine, not a generative assistant | Accepted. Claude does rule extraction, compliance scoring, gap analysis — always within the architecture, always emitting decision-log-compatible output. |
Matt's Gap → Fix → Implementation Matrix v1 (2026-04-20) flagged that the v3.0.0 architecture was strong but the execution layer was incomplete — no benchmark framework, no migration path from the current 2.6.1 system, an under-specified canonical data model, and an abstract policy engine. This section tightens those pieces so the plan is buildable and testable. Every block below responds to a numbered gap in Matt's matrix.
Gap #10 (Rule Flexibility — Class A/B/C/D formalisation) is intentionally
left open. Matt's matrix calls for explicit definitions of Class B (preferred) and Class C
(expressive), plus a rules.json structure and policy-engine rejection/override
logging. That decision is bound up with the Two-Track Compliance question
(A.EYE.ECHO strict floor vs WallSpace flex profiles) which we still need to agree on in
person before committing the rule classes to code. See the matching row in “Known
Tensions” above.
Without a fixed benchmark suite, every model decision (Whisper vs Moonshine vs Deepgram, DSP improvements, phoneme/consonant work) becomes subjective. We add an offline benchmark harness that scores every candidate model against the same audio, same metrics, same pass/fail thresholds.
latency_partial_ms — first partial tokenlatency_final_ms — finalised captionWER / CERhallucination_ratespeaker_accuracyemotion_accuracy (human-rated)consonant_confidence_scorelatency_final_ms < 300 (target)WER < 10% (clean speech)WER < 20% (live noisy)hallucination_rate < 2%speaker_accuracy > 85%Ranking order: latency → accuracy → stability (failure rate). We pick the best model per use case (live music, conference, mobile 1-on-1), not a single global winner. Corpus curation — particularly accents and live-music-bleed samples — is a Jack ↔ Matt open item.
We already have a working v2.6.1 pipeline. Building the v3.0.0 architecture as a hot-swap rewrite is how working systems break. Instead, each layer lands in observe-only mode first, gated behind a feature flag, with a single-flag rollback.
| Phase | What ships | Risk if wrong |
|---|---|---|
| M1 — Canonical adapter | Wrap current pipeline so it emits CanonicalCaption objects alongside existing output. No behaviour change. |
None — current renderer still drives output. |
| M2 — Rules Engine observe-only | Evaluate every caption against rules matrix. Log violations only. Does not affect what the user sees. | None — log volume only. |
| M3 — Policy Engine shadow mode | Generate RenderInstruction decisions for every caption. Do not apply them. Compare against live output in dashboards. |
None — decisions written to decision log only. |
| M4 — Dual rendering | Run current renderer live + policy renderer to a hidden test surface. Visual A/B diff. | GPU cost; mitigated by sampling. |
| M5 — Feature flag cutover | ENABLE_POLICY_RENDER=true flips live output to the new stack. |
Mitigated by mandatory rollback — a single flag reverts the entire new stack and the system immediately runs on the unchanged 2.6.1 pipeline. |
Rollback contract (mandatory): no phase is allowed to land without a verified one-flag rollback to the previous phase. Current pipeline code stays in the tree until M5 has been green for an agreed soak period.
Phase 2 listed the fields conceptually. Matt's gap review calls for a strict schema —
streaming (is_partial + revision_id), token-level timing,
per-token confidence, overlapping speakers, explicit uncertainty flags, audio-context typing,
and source-engine tracking. The TypeScript interface below is the canonical definition every
input engine writes and every downstream layer consumes.
Schema lives as both a TypeScript interface and a JSON schema in the shared
@wallspace/captions-core package consumed by A.EYE.ECHO and WallSpace.
Validation runs in CI; any engine emitting a non-conforming object fails the build.
Phase 4 states the priority order. The gap review wants deterministic process: how we score, how conflicts resolve, when we fall back to a safe mode. This is that spec.
SAFE fallback mode = static bottom-centre placement, no enhancements, reduced font scale, no spatial positioning. It is the pipeline's “degrade gracefully” target; the accessibility-testing hard constraint (below) means SAFE mode must always remain ≥ baseline comprehension.
Accessibility is central to the system but we have no structured way to say “feature X made things better/worse.” We define four repeatable scenarios, four metrics, and one hard constraint.
No new feature may reduce comprehension score vs baseline. Regression on any scenario blocks the feature. Results stored as JSON per test run; baseline vs current tracked over time. Matt is primary user-tester for A.EYE.ECHO; WallSpace needs a second deaf/HoH tester cohort (open item).
Emotion inference is mentioned conceptually throughout the plan (Gadi's framing). It becomes a real product, ethical, and legal concern the moment it ships. We lock down the rules now, before any ML emotion model lands.
Emotion inference must NEVER be enabled without explicit user awareness. First-run onboarding must show the emotion setting; silent telemetry of emotion data is banned. This complements the Spotify/media compliance guardrails — emotion data is display-only, not a layer, and never leaves the device unless the user opts into a cloud emotion service.
The cloud-ASR path (Deepgram, fal, or a shared WallSpace service) has good architecture but no defined behaviour under failure, latency spikes, or cost overruns. We pin numbers.
The v3.0.0 plan is Apple-centric (Core Audio, Metal, Vulkan-on-Metal). Windows needs an explicit parity plan or it will drift. Matt's own dev environment is Windows-capable — this is not hypothetical.
| Layer | Mac | Windows |
|---|---|---|
| Audio capture | Core Audio / Screen Capture Kit | WASAPI |
| ASR runtime | whisper.cpp subprocess, Moonshine subprocess | Subprocess (Moonshine / Whisper), optional ONNX runtime |
| GPU inference | Metal | Vulkan (if supported), fallback CPU |
Windows parity checklist (per release):
The Jack ↔ Matt split was described in prose. This table makes ownership explicit. Deadlines are TBD pending our in-person meeting — Matt holds the Boomtown + cohort context that should set realistic dates.
| Phase | Owner | Deliverable | Metric | Deadline |
|---|---|---|---|---|
| P1 | Matt | rules-matrix.json (BBC-derived) | Coverage % of BBC guidelines | TBD |
| P2 | Jack | Canonical schema (TS + JSON) | Validation pass in CI | TBD |
| P3 | Both | Rules Engine | Test pass rate | TBD |
| P4 | Jack | Policy Engine | Decision accuracy vs expected | TBD |
| P5 | Jack | Renderer (baseline + SAFE mode) | Visual stability / no reflow | TBD |
| P6 | Both | Enhancement layers (emotion, prosody, spatial) | Regression pass vs Phase 7 suite | TBD |
| P7 | Both | Testing suite (unit + perception) | Full scenario coverage (A/B/C/D) | TBD |
One of the core goals of the Decision Log specification is cross-system consistency between WallSpace and A.EYE.ECHO. The decision log becomes the common language the two apps use to stay coherent even as they evolve on different cadences with different contributors.
A.EYE.ECHO is open-source (MIT). Matt is focusing on pushing that side forward; Jack is focusing on WallSpace and the shared cloud services. Both apps share a caption codebase and this architecture, so work on one benefits the other.
Matt leads the Rules Matrix extraction (Phase 1) given his depth on the BBC document, and owns the A.EYE.ECHO implementation of Phases 2–7. Jack leads shared-service architecture (Layers 2–5 as reusable packages) and the WallSpace-specific Phase 6 enhancement integrations (visual / creative).
| Component | Shared | App-specific |
|---|---|---|
| Rules Matrix | Class A rules | Class D rules per app |
| Canonical Caption Data Model | Schema identical | — |
| Rules Engine | Core engine | — |
| Policy Engine | Core logic + priority order | Profile definitions |
| Renderer | Baseline text rendering logic | Platform-specific output (React Native for Echo, Electron + Scope for WallSpace) |
| Enhancement Layer | Proposal interface | Implementations differ (haptics for Echo; Scope visuals for WallSpace) |
| Decision Log | Schema + constraint vocabulary | Storage backend |
Per Matt's architecture paper, Claude is not used as a general-purpose assistant. It operates strictly within the defined architecture as a constrained reasoning and validation engine. All documents in the spec set (Rules Matrix, Canonical Caption Data Model, Policy Engine, Renderer, Enhancement Layer, Decision Log) are provided to Claude as authoritative inputs.
12x GPU performance boost via Vulkan API. Immediate improvement with no architecture changes.
Frame-to-frame spectral change for consonant edges. More robust than current energy delta method.
Download, benchmark against current whisper.cpp. If it works: immediate 100x speed improvement.
Same Whisper model, 299x faster inference via cloud LPU. Quick test with no local changes needed.
LPC-based formant extraction. Gives us vowel space analysis and accent profiling capability.
$200 free credit. WebSocket streaming API. Could be our cloud fallback for best accuracy.
"The future of voice AI is not better transcription, but deeper audio-native understanding that combines signal processing with semantic reasoning." — Claude Analysis Report
"You cross-reference the text with a kind of sonic analysis and you try to provide a tone-of-voice tag. For instance, they will do consonants really well. Transient analysis is very important." — Gadi Sassoon, Vocal Forensics Consultation
"The hard engineering question is also quite interesting... I used to do vocal synthesis with four months in Csound in 2003. The processes that I have been developing are designed for the original design for video editing." — Gadi Sassoon, on bridging DSP and real-time systems
"I've built a really super crazy stack of agents that has been growing and growing... one of the things they build is basically a models librarian which runs on a constant cron job and scrubs the internet for the latest developments in AI models specifically with a particular interest in audio." — Gadi Sassoon, on staying current with audio AI research
A.EYE.ECHO is a React Native / Expo mobile app (com.wallspace.aeyeecho) built for deaf and hard-of-hearing
accessibility. It uses native speech APIs exclusively — no Whisper, no ML models, no DSP.
The philosophy: leverage what the OS already does well, and focus engineering on accessibility UX.
| Capability | WallSpace (Electron) | A.EYE.ECHO (Mobile) |
|---|---|---|
| Platform | Electron (macOS / Win / Linux) | Expo / React Native (iOS / Android) |
| Speech Engine | Whisper subprocess + Web Speech + Native | expo-speech-recognition (native only) |
| DSP Features | 18 features @ 100ms | None |
| Emotion Analysis | Lexicon + voice + sarcasm | None |
| Translation | CTranslate2 offline → DeepL → LibreTranslate | DeepL → LibreTranslate |
| Speaker ID | Spectral centroid profiling | Camera face + lip-sync correlation |
| Diarization | Spectral centroid shift | Energy + timing heuristics |
| Sign Language | None | ASL (26 letters, Vision hand pose) |
| Haptic Feedback | None | 6 patterns (expo-haptics) |
| URL Ingest | None | YouTube, HLS, direct media |
| Caption Sharing | None | WebSocket relay (room codes) |
| Persistence | Session-only (JSON/SRT) | SQLite (sessions + segments) |
| Beat / Music | FFT onset, MIDI, tap tempo | None |
| Scope Integration | Real-time prompt modifiers | None |
Voice features can drive accessibility-specific outputs on mobile that don't exist yet. Emotion detection maps to haptic intensity patterns — deaf users could feel the emotional tone of speech through their phone's vibration motor. Pitch contour maps to caption text styling (italic for questions, bold for emphasis). Volume maps to caption font size (whisper → shouting). These are novel accessibility features that neither iOS nor Android provide natively.
Beyond the phone: Haptic feedback isn't limited to mobile vibration motors. Deaf and hard-of-hearing audience members at live events may wear haptic wearables — vests (SubPac, Woojer), wristbands (Basslet), or seat transducers — that translate sound into physical sensation. WallSpace could drive these devices from the stage, sending both speech emotion haptics (feel the tone of a speaker) and music-reactive haptics (feel the beat, bass, and dynamics). WallSpace already has beat/kick detection (FFT onset, MIDI clock, tap tempo) and frequency band analysis (sub/bass/mid/high) — this data is ready to drive haptic output.
| Voice Feature | Echo (Phone) | WallSpace (Visuals) | Haptic Wearables (Live Events) | Feasibility |
|---|---|---|---|---|
| Emotion | Phone vibration patterns | Caption color tint + Scope prompts | Vest/wristband intensity + zone mapping | Text lexicon (free) |
| Pitch direction | Caption styling (italic) | Caption styling + question detection | Rising/falling sensation on body | Light DSP |
| Volume | Caption font size scaling | Caption size + output emphasis | Haptic intensity scaling | Amplitude available |
| Speaking rate | Caption scroll speed | Caption pacing + scene timing | Pulse rhythm matching speech cadence | Timing heuristics |
| Speaker change | Triple-pulse vibration | Speaker label + color switch | Directional haptic (left/right speaker) | Already in both |
| Beat / kick | Not implemented | Visual triggers + scene changes | Bass transducer pulses on beat | WallSpace has FFT onset |
| Frequency bands | Not implemented | Audio-reactive layer effects | Sub/bass/mid/high mapped to body zones | WallSpace has 7 bands |
| Transients | Alert vibration | Scope visual intensity | Sharp tactile clicks on consonants | Needs raw audio |
| Trembling | Gentle double pulse | Visual softening effect | Subtle tremor sensation | Needs DSP or cloud |
| Sarcasm | Visual indicator (~) | Caption annotation + mood shift | Contradictory pulse (sharp then soft) | Needs text + voice sync |
At live music and speech events, deaf audience members increasingly use haptic wearable technology to experience sound physically. WallSpace is uniquely positioned to drive these devices because it already has the audio analysis pipeline running in real-time:
Devices: SubPac M2X (backpack/vest), Woojer Vest Edge, Basslet (wristband),
ButtKicker (seat mount), custom Arduino/ESP32 builds via Bluetooth LE or OSC.
WallSpace's existing OSC bridge (src/main/oscBridge.ts) could output haptic control
messages alongside visual triggers — same data, different output modality.
expo-av metering (dB level)
Limitation: expo-av provides only dB amplitude, not raw PCM buffers.
For real DSP (pitch, spectral centroid, transients), you need AVAudioEngine.installTap (iOS)
or AudioRecord (Android) via a custom Expo native module.
To run DSP directly on mobile without a cloud service, you'd need custom native modules:
AVAudioEngine with installTap(onBus:) for raw PCM buffers;
Accelerate framework for vDSP FFT; could run pitch detection + RMS + basic spectral analysis nativelyAudioRecord for raw PCM;
basic DSP feasible nativelyonAudioBuffer(Float32Array) callback
would allow porting a subset of VoiceFeatureExtractor — but maintaining Swift + Kotlin implementations is significant engineering
WallSpace uses Web Audio API's AudioWorklet for real-time DSP in the Electron renderer process.
React Native has no equivalent. expo-av provides only amplitude metering.
The practical path forward: (a) basic amplitude/timing features locally,
(b) heavy DSP via a shared cloud service that accepts audio chunks and returns enriched data.
This avoids the significant native engineering of building cross-platform audio buffer access.
A cloud service both apps share for heavy audio processing. Mobile gets capabilities it can't run locally. Desktop gets a cloud fallback when local processing is insufficient. Built on the existing wallspace.studio Cloudflare infrastructure.
| Endpoint | Method | Purpose | Auth |
|---|---|---|---|
wss://wallspace.studio/api/audio/stream |
WebSocket | Send audio chunks, receive enriched transcripts in real-time | JWT |
POST /api/audio/analyze |
HTTP | One-shot analysis of an audio buffer (batch mode) | JWT |
GET /api/audio/models |
HTTP | List available processing models and capabilities | Public |
POST /api/audio/session |
HTTP | Create or end a processing session | JWT |
Each active audio session maps to a Durable Object instance. The DO holds: current speaker profile, emotion history (for hysteresis smoothing), accumulated transient buffer (1-second window), session metadata. Audio chunks arrive via WebSocket, get processed by external API (Deepgram), results streamed back. Durable Objects provide: per-session state without external database, WebSocket hibernation (cost-efficient idle sessions), automatic cleanup on disconnect.
Worker receives audio from client, forwards to Deepgram Nova-3, enriches response with emotion data before returning.
Cloudflare Workers AI for on-edge inference. Run whisper-tiny or emotion classification directly on Cloudflare's GPU fleet.
Both apps already share the wallspace.studio domain. The existing JWT auth system
(/api/auth/login, /api/auth/me, Google/GitHub/Apple SSO) authenticates
both Electron and mobile clients. The mobile app stores the JWT in expo-secure-store.
The existing _shared.ts auth helper already has CORS support and token verification.
No new auth infrastructure needed.
WallSpace's translationService.ts was literally ported from A.EYE.ECHO.
The two files are 90% identical — same DeepL client, same LibreTranslate fallback, same LRU cache.
Time to extract shared code into packages both apps import from.
| Service | WallSpace File | Echo File | Overlap | Action |
|---|---|---|---|---|
| Translation | renderer/services/translationService.ts |
src/services/translationService.ts |
90% | Extract @wallspace/translation |
| Emotion Lexicon | renderer/utils/sentimentAnalyzer.ts |
none | Pure TS | Copy to Echo (zero deps) |
| Caption Network | none | src/services/captionNetworkService.ts |
Echo only | Port to WallSpace |
| Diarization | VoiceFeatureExtractor (spectral) | src/services/audioDiarization.ts |
Different approach | Merge: timing + spectral |
| Vibration | none | src/services/vibrationService.ts |
Echo only | Add emotion → haptic map |
| DB Schema | session-only (JSON/SRT) | src/services/database.ts |
Partial | Align with cloud D1 schema |
| Types | Various renderer types | src/types/index.ts |
~80% | Extract @wallspace/types |
@wallspace/translationTranslationCache (LRU, 200 entries)Platform-specific bits stay separate: WallSpace keeps CTranslate2 offline via electronAPI; Echo keeps expo-constants API key resolution.
@wallspace/typesTranscriptSegmentSpeaker, SpeakerProfileEmotion, EmotionResultTranscriptionStatusWhisperLanguage
Both apps define these nearly identically.
Union them (Echo adds source?: 'speech' | 'sign-language').
@wallspace/emotionWEIGHTED_LEXICON (800+ terms)analyzeTextEmotion()EMOTION_VISUALS mappingemotionToHaptic() (new)
Pure TypeScript, zero DOM dependencies.
Runs in React Native without modification.
Add emotionToHaptic mapping for mobile.
Option A (recommended start): npm workspace monorepo — add packages/ directory
to crt-wall-controller, reference from both projects. Simpler dev workflow, instant iteration.
Option B (later): Published @wallspace/ scoped packages on npm — cleaner separation,
works with any project structure, but adds publish/version overhead.
Start with A for speed, migrate to B when the shared API is stable.
WallSpace.Studio is a commercial creative tool. A.EYE.ECHO is open-source (MIT-licensed) so it stays freely available to the deaf community. Matt's going to focus on pushing A.EYE.ECHO forward while Jack focuses on WallSpace and the shared cloud services, so both sides of the caption system move in parallel and benefit from the same underlying work.
The two surfaces share a caption codebase but have different funding targets. Cloud ASR services (Deepgram, AssemblyAI, Workers AI) have per-minute costs that add up — especially for a service intended to stay free for deaf users. We start with what costs nothing: native OS speech APIs, existing code reuse, and pure TypeScript logic that runs on any platform. Paid cloud services come later, funded either through grants / community support on the A.EYE.ECHO side, or via WallSpace commercial revenue subsidizing both.
WallSpace has a complete native SFSpeechRecognizer implementation —
the same free, on-device Apple speech engine that Echo uses. It's fully built:
native C++/Objective-C addon (native/speech-recognition/src/speech_mac.mm),
main process bridge (src/main/nativeSpeechBridge.ts),
renderer service (src/renderer/services/nativeSpeechEngine.ts),
IPC plumbing, 25+ language support, 55-second auto-restart.
Abandoned in v2.6.1: Despite adding entitlements, signing, notarizing, and switching to native arm64 builds, both SFSpeechRecognizer and Web Speech API consistently crash the Electron renderer process. Tested on Rosetta x64 (SIGTRAP), arm64 native (grey screen / SIGSEGV), with and without active Apple Developer agreement. The crash occurs at the native addon boundary and is not recoverable via try-catch. Moonshine v2 replaces all three speech engines (Whisper, native, web) with a single local streaming ML model — that is the path forward.
| Requirement | Status | Action |
|---|---|---|
Native addon (speech_mac.mm) |
Complete | No changes needed |
| IPC bridge + renderer service | Complete | No changes needed |
| Hardened runtime | Enabled | Already in electron-builder config |
| Speech entitlement in plist | Added in v2.6.0 | com.apple.security.speech-recognition added to entitlements.mac.plist |
| Apple Developer certificate | Working | Team ID 9K65QDV874 — signing successful |
| Notarized build | v2.6.0 shipped | Signed, notarized, published April 12, 2026 |
Path forward: Moonshine v2 provides local streaming ASR without native addon dependencies — runs as WASM or subprocess, no SFSpeechRecognizer, no Web Speech API, no Electron renderer crash risk. Replaces Whisper too (lower latency, streaming output). Echo continues using native speech APIs on iOS where they work reliably.
Everything in this phase costs nothing — no API keys, no subscriptions, no cloud services. Pure code reuse, native OS capabilities, and TypeScript logic that already exists.
Outcome: Despite full implementation (entitlements, signing, notarization, arm64 native build), SFSpeechRecognizer and Web Speech API both crash the Electron renderer on every tested configuration. Replaced by Moonshine v2 in the roadmap below.
com.apple.security.speech-recognition entitlementStatus: Abandoned. Whisper remains the only ASR engine until Moonshine v2.
Goal: Give Echo text-based emotion analysis with zero additional dependencies
sentimentAnalyzer.ts to Echo — pure TypeScript, zero DOM or Electron depsVibrationServiceEffort: Half a day | Cost: $0
Goal: Deaf users feel the emotional tone of speech — on phone and wearable devices
VibrationService to accept EmotionResult from lexiconEffort: 1-2 days | Cost: $0 (OSC output is free, wearable hardware is user-provided)
Goal: Visual representation of how loud someone is speaking
expo-av (dB level)volumeCategory in VoiceFeatureExtractor — reuse the thresholdsEffort: Half a day | Cost: $0
Goal: Eliminate 90% duplicated code between both apps
TranslationCache, DeepL client, LibreTranslate client into @wallspace/translationEffort: Half a day | Cost: $0
Goal: WallSpace can broadcast/receive captions to/from mobile devices
CaptionNetworkService to WallSpace (WebSocket relay + room codes)caption-relay.glitch.me) is free-tier hostedEffort: 1-2 days | Cost: $0 (Glitch free tier)
Goal: Captions adapt pacing to speech speed
Effort: Half a day | Cost: $0
These features use paid APIs. Development and testing can happen now using free credits and careful usage management. Production-scale always-on use requires funding. The key principle: free engines run by default, paid engines are opt-in and never persist across restarts.
Deepgram is not an always-on replacement for native speech — it's a premium mode you switch on when you need the best accuracy or cloud-based voice emotion analysis, then switch off. Free native ASR handles day-to-day captioning; Deepgram handles demos, live events, and testing.
| Usage Pattern | Monthly Cost | $200 Credit Lasts |
|---|---|---|
| 2-hour live events, 2x/month | $1.85 | ~9 years |
| 1 hour/day testing & development | $14 | ~14 months |
| 4 hours/day regular use | $55 | ~108 days |
| 12 hours/day always-on (avoid this) | $166 | ~36 days |
Deepgram is one of several cost-based APIs in the WallSpace ecosystem (Deepgram, DeepL, fal.ai/RunPod for Scope GPU, future cloud services). A comprehensive API cost management initiative is needed across all paid services — unified cost tracking dashboard, per-service budgets, usage alerts, and spend reporting. Matt already has tickets scoped around this topic. For now, Deepgram follows the same patterns as other cost-based APIs in the app (manual enable, session tracking, warnings). A unified cost management system will be planned and addressed as a separate initiative.
Goal: Validate cloud ASR quality + voice emotion pipeline, manage costs carefully
Effort: 1-2 days | Cost: $0 prefunding ($200 credit), ~$0.0077/min after
Goal: Both apps get enriched captions via wallspace.studio/api/audio
Effort: 1-2 days | Cost: Deepgram usage + Cloudflare Workers (free tier generous)
Goal: Replace Glitch free-tier relay with production-grade infrastructure
caption-relay.glitch.me to wallspace.studio Durable ObjectsEffort: 1-2 days | Cost: Cloudflare Workers Paid ($5/mo base)
Goal: Replace heuristic emotion rules with trained ML models
Effort: 1-2 weeks | Cost: Workers AI pricing or GPU hosting
Goal: On-device voice feature extraction without cloud dependency
AVAudioEngine.installTap (iOS) + AudioRecord (Android)Effort: 1-2 weeks | Cost: $0 (engineering time only)
Goal: Merge camera-based (Echo) + audio-based (WallSpace) speaker identification
Effort: 2+ weeks | Cost: $0 (engineering time only)
SFSpeechRecognizer sessions auto-terminate at ~60s.
Echo already handles this with auto-restart at 55s (SESSION_RESTART_MS = 55_000).
WallSpace's native bridge uses the same pattern (speech_mac.mm restarts at 55s)Total cost: $0
$1-5/mo if toggled sparingly
$0 runtime, weeks of dev time
expo-secure-store sufficient for JWT storage on mobile?