Vocal Forensics & Next-Gen Audio Intelligence

Signal-First Architecture for Real-Time Captioning & Accessibility
WallSpace.Studio | v3.1.0 Development Plan | Updated April 20, 2026

Scope: one captioning intelligence engine, many use cases

This isn't a caption system purpose-built for VJ shows. It's a general-purpose captioning intelligence service designed to serve:

We start with live music + visuals because if the pipeline works with music bleed, crowd noise, and multiple speakers, the cleaner use cases are downstream. When you see "VJ audio" or "live-show context" below, read it as "our hardest calibration target," not "our only use case."

8
Commits This Session
18
Voice Features Extracted
7
Emotion Categories
3
ASR Engines Active
<200ms
Target Latency
7
Implementation Phases

🔗 Strategic Frame: Matt's Architecture Paper

Since v2.6.0 shipped, Matt (in collaboration with ChatGPT Plus) produced a formal architecture paper that becomes the new strategic spine for this project. We've adopted it as the v3.0.0 foundation. The short version: treat the BBC Subtitle Guidelines as a compliance floor (the minimum safe readability / accessibility behaviour), and allow advanced features (emotion, prosody, WallSpace visuals, immersive placement) only as controlled extensions above that floor.

Core thesis

Standards-based captioning should define the minimum safe and readable behaviour of the system, while advanced features are implemented only as controlled extensions above that baseline. Decision order: readability first, compliance second, enhancement third, expression fourth.

Matt's source documents — download

The full specification set Matt prepared, referenced throughout this v3.0.0 plan. Click any card to download the PDF.

PDFAdvanced Captioning Architecture Paper v3
The overarching paper — start here
PDFRules Matrix Specification v1
Machine-readable rules derived from BBC
PDFPresentation Policy Engine Spec v1
The decision authority — resolves conflicts, applies profiles
PDFRenderer Specification v1
Layout + display execution; no creative liberties
PDFEnhancement Layer Specification v1
Emotion, prosody, WallSpace integration (proposal-only)
PDFDecision Log Specification v1
Audit trail + WallSpace/Echo common language
PDFBBC Subtitle Guidelines v1.2.5
Baseline standards source — the compliance floor

Two-Track Compliance: A.EYE.ECHO Strict Floor vs WallSpace Flexibility

Matt's architecture uses BBC Subtitle Guidelines as the compliance floor. For the deaf-accessibility use case in A.EYE.ECHO this floor is strict — deaf users depend on it daily, and standards compliance is non-negotiable. For WallSpace's creative / VJ contexts, a caption may legitimately operate with different constraints: a live show, small localized audience, or experimental art installation has different priorities than a daily accessibility conversation.

Slight divergence from Matt's plan — to review together

Matt's paper treats the compliance floor as invariant across all contexts. Jack's read is that for WallSpace — particularly with smaller audiences, localized VJ shows, and experimental art contexts — more flexibility may be appropriate. The architecture already supports this through Rule Class D (project-specific advanced rules) and presentation profiles, but we need to agree on exactly which rules bend and which stay fixed.

Action: Jack to review this framing with Matt. The specific A-class rules that must hold for WallSpace in all modes, and the B/C/D-class rules that can be overridden in Immersive / Experimental WallSpace profiles, need explicit agreement.

How the two tracks split

 A.EYE.ECHO (deaf-first)WallSpace (creative)
Compliance floor Strict. BBC baseline always enforced. Flexible within clear limits. Class A rules still hold.
Default profile Compliance or Accessible Enhanced Live/Low Latency, Immersive, or Experimental
Enhancement scope Limited to readability-preserving additions Reactive typography, spatial placement, WallSpace visual control allowed
Override path None beyond built-in profiles Class D rules + per-layer profile selection
Logging requirement Full decision log (verbose) Configurable per profile (minimal in live shows, verbose in R&D)

🛠 Five-Layer Architecture

Matt's architecture replaces our earlier "signal-first pipeline" with a five-layer system. Our existing DSP / emotion / phoneme work from v2.6.0 doesn't go away — it becomes the contents of Layer 1 (Input Engines). Everything above Layer 1 is new infrastructure for governing how that signal data becomes visible captions.

Audio In | v [Layer 1: Input Engines] — STT, DSP, emotion, phonemes, speaker diarisation, shot detection | Produce structured evidence, never display decisions v [Layer 2: Canonical Caption Data Model] — one normalized schema for everything | Start/end time, text, speaker_id, caption_type, confidence, metadata v [Layer 3: Standards Rules Engine] — validates / scores / repairs against BBC rules | Applies the Rules Matrix, generates constraints v [Layer 4: Presentation Policy Engine] — resolves conflicts, applies profiles | Central decision authority; final instructions for renderer v [Layer 5: Renderer + Enhancement Layers] — executes display, proposes controlled modifications | +---> [Decision Log] — audit trail, WallSpace/Echo common language

Decision ownership model

ComponentWhat it ownsWhat it must not do
Input EnginesGenerate structured dataMake UI or rendering decisions
Rules EngineValidate data, enforce constraintsDecide how to display anything
Policy EngineDecide rendering behaviour, resolve conflicts, approve enhancementsInvent rules; override constraint priorities
RendererExecute layout and displayModify caption content or make creative calls
Enhancement LayerPropose controlled modificationsEnforce behaviour independently

Conflict resolution priority (mandatory, invariant)

  1. Readability and accessibility
  2. Synchronisation and timing accuracy
  3. Visual safety (non-obstruction)
  4. Speaker clarity
  5. Enhancement behaviour

Lower-priority behaviours must yield to higher-priority constraints without exception.

📋 Rule Classes & Presentation Profiles

Four rule classes

ClassNameRoleWho uses it
AHard accessibilityMust never be violated. Defines the floor.Both apps
BPreferred presentationStrong defaults with context-sensitive exceptionsBoth apps; stricter in Echo
CControlled expressiveLimited stylistic behaviour already recognised by BBC guidelinesEnabled in Enhanced/Immersive profiles
DProject-specific advancedFeatures beyond BBC (reactive typography, WallSpace integration, immersive placement)WallSpace-specific; override path

Five presentation profiles

ProfileDescriptionPrimary use
Compliance / Broadcast-safeStrict BBC adherence. No experimental enhancements.A.EYE.ECHO default; conference captioning
Accessible EnhancedBaseline + limited semantic additions (emotion tags, sound annotation)A.EYE.ECHO daily use; podcast post-processing
Live / Low LatencyImmediacy prioritised; relaxed segmentation; readability floor heldLive conferences, news, Q&A
Immersive / SpatialSpatial placement and motion enabled with motion constraints + fallbackVJ shows, art installations, VR/AR
Experimental / ExpressiveAdvanced enhancements unlocked; Class D rules availableWallSpace R&D, experimental performance

A sixth profile — Offline / Post-Processing — will likely be added for podcasts, audiobooks, and recorded interview analysis. Flagged as a gap in Matt's initial spec; to discuss.

What We Built Today (v2.5.1 → v2.6.0)

This session focused on making the voice analysis and emotion detection pipeline actually work end-to-end, and then deeply researching where to take it next. Here are the concrete improvements shipping in v2.6.0:

F0 Pitch Extraction & Consonant Transients

Added autocorrelation-based fundamental frequency (F0) detection covering 80-500 Hz (bass to soprano). New consonant transient detector catches hard attacks (p/t/k/s sounds) via energy delta analysis. Pitch direction tracking (rising/falling/stable) enables question detection and excitement mapping.

NEW

Sarcasm Detection via Text+Voice Contradiction

When someone says positive words ("that's great") with negative vocal tone (falling pitch, low energy), the system now detects the contradiction and trusts the voice over the text. Because tone is harder to fake than words.

NEW

Voice Features → Emotion Pipeline Fixed

Voice analysis was being extracted but wasn't reaching the emotion scoring engine. Fixed auto-enable so voice features flow through to emotion blending automatically. No more silent failures.

FIX

Transient Accumulation (1-Second Window)

Transient detection now accumulates over a 1-second rolling window instead of just catching individual spikes. This catches patterns like rapid-fire consonants in angry speech — 3+ transients/sec flags anger. Threshold lowered and strength shown in debug panel for tuning.

IMPROVED

Emotion Scoring Debug Panel

New collapsible debug panel shows real-time emotion scores, voice feature values, and blending weights. Polls voice features at 200ms intervals (not just on text events) so you can see the voice emotion shifting even between words. Essential for tuning and demonstrating the system.

DEBUG

Emotion Test Hook & Temporal Alignment

Matt's Tasks 3-5: Added emotion test triggers for development, debug state inspection, and temporal alignment between voice features and text events so emotions don't lag behind speech.

MATT'S TASKS

Commit History

📌 Current System State (v2.6.1)

What's actually running today (v2.6.1 — first public release)

All references below to multi-engine ASR, streaming Web Speech, or native-speech bridges describe historical v2.6.0 state and are retained for context only. The live pipeline is Whisper-only until Moonshine v2 lands.

Here's everything the voice analysis pipeline can do right now, before the next-gen upgrades. Rows marked Abandoned v2.6.1 are kept in the table so the evolution from v2.6.0 is traceable.

LayerImplementationStatus
ASR (Primary)whisper.cpp subprocess (tiny.en / base.en / small.en)Working, 2-3s latency
ASR (Streaming)Web Speech API (Chromium)Abandoned v2.6.1 — crashes renderer in Electron
ASR (Native)SFSpeechRecognizer bridge (macOS)Abandoned v2.6.1 — crashes on both Rosetta & arm64
ASR (Next-Gen)Moonshine v2 (local, streaming, replaces all above)Planned — Phase 1
Music LyricsShazam fingerprint + LRCLIB synced lyricsWorking
DSP FeaturesVoiceFeatureExtractor (18 features @ 100ms)Working
Emotion (Text)Lexicon-based (800+ terms, phrase matching, negation)Working
Emotion (Voice)Heuristic rules (pitch, energy, transients, volume)Working
Emotion (Blend)Weighted fusion + sarcasm detectionNEW in v2.6.0
Beat/KickFFT onset detection, MIDI clock, tap tempoWorking

Voice Features Extracted in Real-Time

Pitch & Tone

  • F0 frequency (80-500 Hz)
  • Pitch confidence (0-1)
  • Pitch direction (rising/falling/stable)
  • Spectral centroid (brightness)

Energy & Dynamics

  • RMS level (0-1)
  • Voice energy (300-3000 Hz)
  • Volume category (whisper→shouting)
  • Energy variance (trembling)

Transients & Rhythm

  • Transient strength (0-1)
  • Has transient (boolean)
  • Recent transient count (1s window)
  • Zero-crossing rate

Voice Activity

  • Is speaking (VAD)
  • Silence duration (ms)
  • Speaking rate (WPM)
  • Is trembling (emotional)

Speaker ID

  • Speaker change detected
  • Speaker ID index
  • Timbre descriptor
  • Intensity descriptor

Emotion Output

  • 7 emotions + neutral
  • Pastel color mapping
  • Scope prompt modifiers
  • Hysteresis smoothing

🔍 The Problem: Why Whisper Isn't Enough

Key Insight from Gadi Sassoon's Consultation

"You take a level obviously not just, you cross reference the text with a kind of sonic analysis and you try to provide a tone of voice tag... These models will do consonants really well. Transient analysis is very important." — Gadi Sassoon, DSP Engineer (25 years, Berkeley College of Music)

Whisper Limitations — Evidence-Based Assessment

Not all commonly cited Whisper limitations are actually observed problems in WallSpace. Some come from Gadi's consultation (his experience with call transcripts and offline tools), others are confirmed in our codebase with specific mitigations in place.

LimitationSourceWallSpace Status
2-3 second latency Confirmed Actively measured via transcriptionService.ts latency tracking. Default 5s chunks + inference time. Compensated with manual latency offset slider (-500 to +500ms) and auto-calibrate button.
Hallucinations on silence Confirmed Observed enough to hardcode filter patterns in whisperBridge.ts: (music), (applause), "you", "thank you", dot strings. Mitigated via silence detection (RMS < -60dB skips transcription entirely) and hallucination filtering before display.
Queue drops Confirmed Explicit backpressure logic in whisperBridge.ts: if queue > 1 item, oldest chunk is dropped. Comment in code: "Whisper is slower than real-time." Intentional trade-off to prevent OOM in live streaming.
No emotion data Factual By design — Whisper is ASR-only, outputs text with no tone/emotion metadata. Workaround in place: parallel DSP pipeline (VoiceFeatureExtractor) + text lexicon analysis provide emotion independently.
Drops consonants From Gadi Reported by Gadi from his call transcript experience, not from WallSpace bug reports. Our transient detection (v2.6.0) monitors consonant attacks via DSP but doesn't currently correct Whisper output. Phoneme-level correction planned for Phase 4.
Accent struggles From Gadi Gadi mentioned struggles with "Globish" and non-native speakers from his tools. No evidence of this in WallSpace. Multilingual models (tiny/base/small) are available alongside English-only variants. No bug reports or workarounds for accents.

What We Need

Gadi's Gaps Identified

GapDescriptionCurrent State
Signal-FirstDSP should lead; text transcription is secondary to the audio signalPartial
Transient AnalysisConsonant edges carry meaning ASR models miss entirelyBasic (v2.6.0)
Tone-of-Voice TagsCross-reference text with sonic analysis for tone metadataBasic (v2.6.0)
Formant AnalysisVowel structure (F1/F2/F3) for accent/speaker profilingNot implemented
Decision MatrixNot a linear pipeline but a matrix of DSP + emotion + semanticsNot implemented
Clockless AnalysisReal-time requires careful buffering strategy decisionsPartial

🤖 Next-Gen Model Landscape

Tier 1: Whisper Replacements Highest Impact

Moonshine v2 TOP PICK

Local MIT License Streaming 100x Faster

100x faster than Whisper Large v3 on MacBook Pro (107ms vs 11,286ms). Better WER accuracy. Streaming encoder with sliding-window attention for bounded low-latency. Incremental audio caching — subsequent calls only process new audio. Sizes from 26M (Tiny) to 245M (Medium Streaming). Same subprocess integration pattern as current whisper.cpp.

WhisperKit Local Server

Apple Silicon Neural Engine

CoreML-compiled Whisper on Apple Neural Engine. OpenAI-compatible HTTP local server — can be bundled as Electron subprocess. Streaming, word timestamps, VAD, speaker diarization built in.

Sherpa-ONNX

In-Browser 50KB WASM

ASR/TTS/VAD/diarization via ONNX Runtime in WebAssembly (50KB gzipped). Could run speech recognition directly in Electron renderer — no subprocess needed. 12 language bindings, fully offline.

Cloud APIs Best Accuracy

ServiceWERStreaming LatencyPriceNotes
Deepgram Nova-3 ~5.26% <300ms $0.0077/min $200 free credit
AssemblyAI Universal-3 ~6.68% ~150ms P50 ~$0.01/min 30% fewer hallucinations than Whisper
GPT-4o-mini-transcribe Better than Whisper Low ~$0.006/min WebSocket streaming, accent-resilient
Google Chirp 3 Competitive Low Usage-based Built-in denoiser, speaker diarization

Audio-Native LLMs Beyond Transcription

These models understand raw audio directly — tone, emotion, background noise — not just convert speech to text. This is the paradigm shift Gadi described.

Gemini 2.5 Native Audio

Cloud Emotion-Aware

End-to-end audio understanding: tone, emotion, background noise filtering. Responds to user's tone of voice. Live API with bidirectional audio streaming.

Qwen2-Audio (Open Source)

Open Source Local GPU

Speech + natural sounds + music in one encoder. Voice chat mode (no text needed) + audio analysis mode. Excels at ASR, emotion recognition, acoustic scene classification.

Speech Emotion Recognition Supplement to ASR

SenseVoice-Small (Alibaba) STRONG PICK

Open Source ASR + Emotion

Combined ASR + emotion recognition + audio event detection in one model. Could replace both Whisper AND heuristic emotion detection.

emotion2vec (FunASR)

Open Source Lightweight

Dedicated emotion classifier: angry, happy, neutral, sad. Lightweight, runs alongside existing ASR pipeline. Multiple model sizes.

Voice Tokenization Future

FocalCodec State of the Art

NeurIPS 2025 Identity + Emotion

Single binary codebook at 0.16-0.65 kbps. Preserves speaker identity AND emotion in reconstructed speech. Outperforms SpeechTokenizer, Mimi, EnCodec. Use case: encode vocal characteristics into compact tokens for speaker profiling, emotion encoding, and network transmission.

🛠 Proposed Architecture: Signal-First Pipeline

Audio In (mic / system audio) | v [Layer 1: Audio Ingestion] — 10-40ms chunks, AudioWorklet | +---> [Layer 2a: DSP Features] — FFT, pitch, formants, MFCC, transients, ZCR | (existing VoiceFeatureExtractor, enhanced) | +---> [Layer 2b: Voice Tokenization] — FocalCodec (identity + emotion encoding) | (future: compact voice fingerprint) | +---> [Layer 3: ASR Engine] Moonshine v2 (local, streaming) | | + Deepgram Nova-3 (cloud fallback) | v | Raw Transcript | +---> [Layer 4: Emotion] — Two parallel paths: | | | +---> emotion2vec / SenseVoice (ML-based from audio) | +---> Lexicon analysis (from transcript text) | | | v | [Fusion: weighted blend with sarcasm detection] | v [Layer 5: Decision Matrix] — Combines all signals: - DSP features (pitch contour, transients, formants) - ASR text (words, confidence) - Emotion (audio + text blended) - Speaker ID (voice fingerprint) - Beat / music context | v [Layer 6: Output] — Caption display + Scope prompt modifiers + visual triggers | +---> [Agentic Loop] — Re-analyze ambiguous segments

🎵 Consonant & Transient Analysis — The Gadi Gap

The Core Problem

There is no off-the-shelf "consonant transient detector" ML model. ASR models like Whisper treat audio as a sequence of words — they don't preserve the signal-level detail of how those words were spoken. The consonant edges (the p/t/k/s attacks) carry critical emotional and clarity information that gets discarded in the text-only pipeline. Our current energy-delta approach is a good start. The upgrade path adds formant analysis, MFCC features, spectral flux, and eventually phoneme classification.

DSP Feature Upgrade Path

FeatureWhat It DoesWhy It MattersPhase
Formant Extraction (F1/F2/F3) LPC spectral envelope peak-picking Vowel height/frontness, accent profiling, speaker ID Phase 2
MFCC (13 coefficients) Mel filterbank + DCT Phoneme classification, consonant type detection Phase 2
Spectral Flux Frame-to-frame spectral change More robust consonant edge detection in noise Phase 2
Harmonic-to-Noise Ratio Voiced vs unvoiced segment detection Distinguish vowels from consonants precisely Phase 2
wav2vec2 Phoneme Classifier ONNX model for phoneme-level detection Classify specific consonants (p/t/k/b/d/g/s/z/f/v) Phase 4
Montreal Forced Aligner Post-hoc phoneme-transcript alignment Find where consonants were dropped/mumbled Phase 4

🚀 Seven-Phase Implementation Order

Per Matt's architecture paper, build order prioritises a stable, auditable core before any innovation layers. This replaces the earlier Phase 1–5 capability roadmap from v2.6.0. The capability work (Moonshine, formants, emotion2vec, phoneme analysis) now fits inside Layer 1 (Input Engines) — so those items move inside Phase 1 and Phase 6, not independent phases.

Phase 1: Rules Matrix Extraction Start Here

Goal: Convert BBC Subtitle Guidelines into a machine-readable rules matrix.

  • Formalise each BBC rule with ID, source section, class (A/B/C/D), thresholds, evaluation method
  • Tag rules that differ between A.EYE.ECHO (strict) and WallSpace (flex) contexts
  • Mostly a research / writing task — Matt-led given his BBC depth; Jack refines
  • Deliverable: rules-matrix.json consumable by the Rules Engine

Phase 2: Canonical Caption Data Model Foundation

Goal: One normalised schema every input engine writes to and every downstream layer reads.

  • Required fields: start_time, end_time, text, speaker_id, caption_type
  • Optional: confidence, line-break candidates, shot/scene references, style suggestions, enhancement eligibility
  • TypeScript interface + JSON schema
  • Shared between WallSpace and A.EYE.ECHO via a common package

Phase 3: Standards Rules Engine Core Infrastructure

Goal: Consumes the Rules Matrix + Canonical Model; emits constraints + scores.

  • Evaluation methods: deterministic, heuristic, ML-assisted
  • Scoring: pass / warning / fail → aggregated readability / timing / layout / compliance scores
  • Auto-fix strategies: extend_duration, split_caption, resegment, suppress_feature
  • Emits structured constraints that the Policy Engine consumes

Phase 4: Presentation Policy Engine Decision Authority

Goal: Central decision authority — resolves conflicts, applies profiles.

  • Mandatory priority order: Readability → Timing → Visual Safety → Speaker Clarity → Enhancement
  • Initial three profiles: Compliance, Accessible Enhanced, Live/Low Latency
  • Immersive + Experimental profiles follow in Phase 6
  • Generates final rendering instructions; outputs to Renderer + Decision Log

Phase 5: Baseline Renderer Hardening Shipped + Refined

Goal: Deterministic renderer that strictly follows Policy Engine instructions.

  • Safe region enforcement (avoid faces, UI overlays, clipping)
  • Reflow prevention during active display
  • Collision handling with repositioning / size reduction / fallback
  • Fallback modes: static bottom-centre, reduced font, no enhancements

Phase 6: Guarded Enhancement Layers Expressive Work

Goal: Emotion, prosody, reactive typography, WallSpace integration, spatial placement. All as proposals that the Policy Engine approves / modifies / rejects.

  • Input-Engines upgrades: Moonshine v2, emotion2vec, phoneme analysis (wav2vec2), formants/MFCC
  • External visual integration: trigger WallSpace visuals, drive shader parameters, lighting control
  • Immersive + Experimental profiles activated
  • Spatial positioning with motion constraints

Phase 7: Regression Testing Continuous

Goal: Every enhancement regression-tested against the compliance floor.

  • Rules tests: detect WPM / line breaks / sync drift / shot straddling / speaker ambiguity
  • Rendering tests: safe zones, font scaling, overlay avoidance, stable alignment
  • Perception tests (human): comprehension, comfort, fatigue, trust, user preference
  • Failure-case scenarios: fallback behaviour verified

Known Tensions & Mitigations

Concerns flagged in review of Matt's architecture, and how we plan to address each.

ConcernWhy it mattersMitigation
Performance envelope Five layers + enhancement + decision log could exceed our <200ms target Decision Log spec defines minimal / standard / verbose levels. Live profile uses minimal. Validate <200ms before Phase 5 ships.
BBC floor vs WallSpace flex Live VJ, small-audience, and experimental contexts may need different constraints than deaf-accessibility defaults Class D (project-specific) rules + WallSpace-specific profiles. Class A rules still hold. Every override logged. To review with Matt.
Rewrite cost v2.6.0 has shipping signal-first code; seven phases looks like big-bang Acceptable — Jack and Matt are effectively the only users of latest. v2.6.0 work was a single-day exploration; rewrite is fine if the architecture unlocks something better. ML-assisted layers will replace hand-tuned heuristics anyway.
Profile-switching UX undefined How does a user move from Compliance to Immersive mid-session? Likely per-caption-layer in WallSpace (each layer gets a profile); per-session-default in A.EYE.ECHO. To be designed in Phase 4.
Offline audio profile missing Podcasts, audiobooks, recorded interviews have fundamentally different latency constraints Add a sixth Offline / Post-Processing profile. Same rule/policy framework, relaxed timing. Flagged for discussion with Matt.
Decision-log volume in live contexts 30+ caption updates/sec × full log = heavy I/O Live profile uses minimal logging + sampling (Decision Log spec §13–14). Log material decisions only. Verbose mode available for R&D.
Claude's role inside the system Matt specifies Claude as a constrained reasoning engine, not a generative assistant Accepted. Claude does rule extraction, compliance scoring, gap analysis — always within the architecture, always emitting decision-log-compatible output.

Execution Layer — Matt's Gap Review Response

Matt's Gap → Fix → Implementation Matrix v1 (2026-04-20) flagged that the v3.0.0 architecture was strong but the execution layer was incomplete — no benchmark framework, no migration path from the current 2.6.1 system, an under-specified canonical data model, and an abstract policy engine. This section tightens those pieces so the plan is buildable and testable. Every block below responds to a numbered gap in Matt's matrix.

Deferred to Jack ↔ Matt review

Gap #10 (Rule Flexibility — Class A/B/C/D formalisation) is intentionally left open. Matt's matrix calls for explicit definitions of Class B (preferred) and Class C (expressive), plus a rules.json structure and policy-engine rejection/override logging. That decision is bound up with the Two-Track Compliance question (A.EYE.ECHO strict floor vs WallSpace flex profiles) which we still need to agree on in person before committing the rule classes to code. See the matching row in “Known Tensions” above.

1. Benchmark Framework Gap #2 CRITICAL

Without a fixed benchmark suite, every model decision (Whisper vs Moonshine vs Deepgram, DSP improvements, phoneme/consonant work) becomes subjective. We add an offline benchmark harness that scores every candidate model against the same audio, same metrics, same pass/fail thresholds.

/benchmarks /audio clean_speech.wav conference.wav live_music_bleed.wav accents.wav overlapping_speakers.wav mumbled_consonants.wav sarcasm_cases.wav /runs <timestamp>/<model>.json <timestamp>/<model>.csv run.ts # npm run benchmark

Metrics

  • latency_partial_ms — first partial token
  • latency_final_ms — finalised caption
  • WER / CER
  • hallucination_rate
  • speaker_accuracy
  • emotion_accuracy (human-rated)
  • consonant_confidence_score

Pass / Fail thresholds

  • latency_final_ms < 300 (target)
  • WER < 10% (clean speech)
  • WER < 20% (live noisy)
  • hallucination_rate < 2%
  • speaker_accuracy > 85%
  • Result per model: PASS / WARNING / FAIL

Ranking order: latency → accuracy → stability (failure rate). We pick the best model per use case (live music, conference, mobile 1-on-1), not a single global winner. Corpus curation — particularly accents and live-music-bleed samples — is a Jack ↔ Matt open item.

2. Staged Migration Path (M1 → M5) Gap #3 CRITICAL

We already have a working v2.6.1 pipeline. Building the v3.0.0 architecture as a hot-swap rewrite is how working systems break. Instead, each layer lands in observe-only mode first, gated behind a feature flag, with a single-flag rollback.

PhaseWhat shipsRisk if wrong
M1 — Canonical adapter Wrap current pipeline so it emits CanonicalCaption objects alongside existing output. No behaviour change. None — current renderer still drives output.
M2 — Rules Engine observe-only Evaluate every caption against rules matrix. Log violations only. Does not affect what the user sees. None — log volume only.
M3 — Policy Engine shadow mode Generate RenderInstruction decisions for every caption. Do not apply them. Compare against live output in dashboards. None — decisions written to decision log only.
M4 — Dual rendering Run current renderer live + policy renderer to a hidden test surface. Visual A/B diff. GPU cost; mitigated by sampling.
M5 — Feature flag cutover ENABLE_POLICY_RENDER=true flips live output to the new stack. Mitigated by mandatory rollback — a single flag reverts the entire new stack and the system immediately runs on the unchanged 2.6.1 pipeline.

Rollback contract (mandatory): no phase is allowed to land without a verified one-flag rollback to the previous phase. Current pipeline code stays in the tree until M5 has been green for an agreed soak period.

3. Canonical Caption — Strict Schema Gap #4 CRITICAL

Phase 2 listed the fields conceptually. Matt's gap review calls for a strict schema — streaming (is_partial + revision_id), token-level timing, per-token confidence, overlapping speakers, explicit uncertainty flags, audio-context typing, and source-engine tracking. The TypeScript interface below is the canonical definition every input engine writes and every downstream layer consumes.

type CanonicalCaption = { id: string // ordering (critical for streaming updates) sequence_id: number // ensures correct ordering of updates // timing start_ms: number end_ms: number // token-level detail (for precision + phoneme alignment) tokens?: { text: string start_ms: number end_ms: number confidence?: number }[] // text state text: string is_partial: boolean // true = streaming partial, false = final revision_id: number // increments on each update // confidence confidence_overall: number confidence_tokens?: number[] // speaker speaker_id?: string speaker_confidence?: number // audio context audio_type: "speech" | "music" | "mixed" | "noise" // uncertainty flags (explicit system awareness) uncertainty: { lexical?: boolean timing?: boolean speaker?: boolean emotion?: boolean } // source tracking source_engine: "whisper" | "moonshine" | "deepgram" // timestamps created_at: number updated_at?: number // control + fallback behaviour fallback_applied?: boolean suppression_reason?: string // debug / traceability debug?: { raw_text?: string processing_time_ms: number } }

Schema lives as both a TypeScript interface and a JSON schema in the shared @wallspace/captions-core package consumed by A.EYE.ECHO and WallSpace. Validation runs in CI; any engine emitting a non-conforming object fails the build.

4. Policy Engine — Decision Pipeline Gap #5 CRITICAL

Phase 4 states the priority order. The gap review wants deterministic process: how we score, how conflicts resolve, when we fall back to a safe mode. This is that spec.

Input: - canonicalCaption - constraints (from Rules Engine) Process: 1. Score: readability_score timing_score safety_score speaker_score 2. Apply priority: readability > timing > safety > speaker > enhancement 3. Resolve conflicts: - extend_duration - split_caption - suppress_enhancement 4. Fallback trigger: if confidence_overall < threshold: fallback_mode = SAFE Output: RenderInstruction JSON (+ decision log entry)

SAFE fallback mode = static bottom-centre placement, no enhancements, reduced font scale, no spatial positioning. It is the pipeline's “degrade gracefully” target; the accessibility-testing hard constraint (below) means SAFE mode must always remain ≥ baseline comprehension.

5. Accessibility Testing Framework Gap #6 HIGH

Accessibility is central to the system but we have no structured way to say “feature X made things better/worse.” We define four repeatable scenarios, four metrics, and one hard constraint.

Test scenarios

  • A — 1-on-1 conversation
  • B — group discussion (3+ speakers)
  • C — noisy environment (cafe / street)
  • D — live event captions (music, crowd)

Metrics

  • Comprehension accuracy (% correct answers to probe questions)
  • Latency perception (user rating 1–5)
  • Fatigue (time-to-fatigue self-report)
  • Lipreading support (qualitative + rating)

Hard constraint

No new feature may reduce comprehension score vs baseline. Regression on any scenario blocks the feature. Results stored as JSON per test run; baseline vs current tracked over time. Matt is primary user-tester for A.EYE.ECHO; WallSpace needs a second deaf/HoH tester cohort (open item).

6. Emotional Sovereignty — Privacy & Control Model Gap #7 HIGH

Emotion inference is mentioned conceptually throughout the plan (Gadi's framing). It becomes a real product, ethical, and legal concern the moment it ships. We lock down the rules now, before any ML emotion model lands.

Config: emotion_enabled = true | false emotion_storage = "none" | "session" | "persistent" Default rules: - emotion_enabled = ON - emotion_storage = "session" # discarded on app close - user can disable completely in settings - logs redact emotion fields unless debug mode is explicitly on UI requirements: [ ] Enable emotion detection [ ] Store emotion data (both surfaced in onboarding + settings, not buried) Decision log: emotion_inference: logged (only if enabled)

Hard rule

Emotion inference must NEVER be enabled without explicit user awareness. First-run onboarding must show the emotion setting; silent telemetry of emotion data is banned. This complements the Spotify/media compliance guardrails — emotion data is display-only, not a layer, and never leaves the device unless the user opts into a cloud emotion service.

7. Cloud Path — Operational Constraints Gap #8

The cloud-ASR path (Deepgram, fal, or a shared WallSpace service) has good architecture but no defined behaviour under failure, latency spikes, or cost overruns. We pin numbers.

Latency budget: - ingest: 20ms - ASR: 100ms - response: 50ms ──────────────── end-to-end < 200ms target Fallback: if WebSocket fails: revert to local Whisper (degraded but available) Reconnect: - exponential backoff - session resume (preserve sequence_id) Cost controls: - max minutes per session - rate limit per user Queue: - max queue size = 2 - drop oldest if overflow Failure priority: 1. Maintain caption output (even degraded) 2. Reduce latency 3. Disable enhancements if required

8. Windows Path Gap #9

The v3.0.0 plan is Apple-centric (Core Audio, Metal, Vulkan-on-Metal). Windows needs an explicit parity plan or it will drift. Matt's own dev environment is Windows-capable — this is not hypothetical.

LayerMacWindows
Audio captureCore Audio / Screen Capture KitWASAPI
ASR runtimewhisper.cpp subprocess, Moonshine subprocessSubprocess (Moonshine / Whisper), optional ONNX runtime
GPU inferenceMetalVulkan (if supported), fallback CPU

Windows parity checklist (per release):

9. Delivery Matrix — Phase → Owner → Deliverable → Metric → Deadline Gap #11

The Jack ↔ Matt split was described in prose. This table makes ownership explicit. Deadlines are TBD pending our in-person meeting — Matt holds the Boomtown + cohort context that should set realistic dates.

PhaseOwnerDeliverableMetricDeadline
P1Mattrules-matrix.json (BBC-derived)Coverage % of BBC guidelinesTBD
P2JackCanonical schema (TS + JSON)Validation pass in CITBD
P3BothRules EngineTest pass rateTBD
P4JackPolicy EngineDecision accuracy vs expectedTBD
P5JackRenderer (baseline + SAFE mode)Visual stability / no reflowTBD
P6BothEnhancement layers (emotion, prosody, spatial)Regression pass vs Phase 7 suiteTBD
P7BothTesting suite (unit + perception)Full scenario coverage (A/B/C/D)TBD

🔄 Cross-System Architecture: WallSpace & A.EYE.ECHO

One of the core goals of the Decision Log specification is cross-system consistency between WallSpace and A.EYE.ECHO. The decision log becomes the common language the two apps use to stay coherent even as they evolve on different cadences with different contributors.

Division of work — how Jack and Matt split it

A.EYE.ECHO is open-source (MIT). Matt is focusing on pushing that side forward; Jack is focusing on WallSpace and the shared cloud services. Both apps share a caption codebase and this architecture, so work on one benefits the other.

Matt leads the Rules Matrix extraction (Phase 1) given his depth on the BBC document, and owns the A.EYE.ECHO implementation of Phases 2–7. Jack leads shared-service architecture (Layers 2–5 as reusable packages) and the WallSpace-specific Phase 6 enhancement integrations (visual / creative).

Shared vs app-specific

ComponentSharedApp-specific
Rules MatrixClass A rulesClass D rules per app
Canonical Caption Data ModelSchema identical
Rules EngineCore engine
Policy EngineCore logic + priority orderProfile definitions
RendererBaseline text rendering logicPlatform-specific output (React Native for Echo, Electron + Scope for WallSpace)
Enhancement LayerProposal interfaceImplementations differ (haptics for Echo; Scope visuals for WallSpace)
Decision LogSchema + constraint vocabularyStorage backend

🧠 Role of Claude (Constrained Reasoning Engine)

Per Matt's architecture paper, Claude is not used as a general-purpose assistant. It operates strictly within the defined architecture as a constrained reasoning and validation engine. All documents in the spec set (Rules Matrix, Canonical Caption Data Model, Policy Engine, Renderer, Enhancement Layer, Decision Log) are provided to Claude as authoritative inputs.

Claude must

  • Interpret BBC-derived rules via the Rules Matrix
  • Evaluate caption data against those rules
  • Apply policy-driven decision logic
  • Generate decision-log-compatible output
  • Identify violations, gaps, inconsistencies
  • Propose fixes within system constraints

Claude must not

  • Invent new rules or behaviours outside supplied specs
  • Override constraint priorities
  • Introduce presentation behaviour not governed by the Policy Engine
  • Make decisions without emitting a decision log entry

Tasks for Claude

Quick Wins (This Month)

Upgrade whisper.cpp to v1.8.3

12x GPU performance boost via Vulkan API. Immediate improvement with no architecture changes.

Add Spectral Flux

Frame-to-frame spectral change for consonant edges. More robust than current energy delta method.

Evaluate Moonshine v2

Download, benchmark against current whisper.cpp. If it works: immediate 100x speed improvement.

Try Groq Whisper API

Same Whisper model, 299x faster inference via cloud LPU. Quick test with no local changes needed.

Add Formant Estimation (F1/F2)

LPC-based formant extraction. Gives us vowel space analysis and accent profiling capability.

Test Deepgram Nova-3

$200 free credit. WebSocket streaming API. Could be our cloud fallback for best accuracy.

💬 Key Insights

"The future of voice AI is not better transcription, but deeper audio-native understanding that combines signal processing with semantic reasoning." — Claude Analysis Report
"You cross-reference the text with a kind of sonic analysis and you try to provide a tone-of-voice tag. For instance, they will do consonants really well. Transient analysis is very important." — Gadi Sassoon, Vocal Forensics Consultation
"The hard engineering question is also quite interesting... I used to do vocal synthesis with four months in Csound in 2003. The processes that I have been developing are designed for the original design for video editing." — Gadi Sassoon, on bridging DSP and real-time systems
"I've built a really super crazy stack of agents that has been growing and growing... one of the things they build is basically a models librarian which runs on a constant cron job and scrubs the internet for the latest developments in AI models specifically with a particular interest in audio." — Gadi Sassoon, on staying current with audio AI research

📱 How A.EYE.ECHO Works Today

A.EYE.ECHO is a React Native / Expo mobile app (com.wallspace.aeyeecho) built for deaf and hard-of-hearing accessibility. It uses native speech APIs exclusively — no Whisper, no ML models, no DSP. The philosophy: leverage what the OS already does well, and focus engineering on accessibility UX.

14
Services
2
Platforms (iOS/Android)
26
ASL Letters Recognized
6
Haptic Patterns
3
Audio Sources

Mobile Pipeline

Audio In (mic / system audio / URL ingest) | v [expo-speech-recognition] | +---> iOS: SFSpeechRecognizer (on-device, 55s auto-restart) | +---> Android: Google SpeechRecognizer (on-device) | v [TranscriptSegment] — partial + final results, hallucination filtered | +---> [AudioDiarization] — energy + timing + pause heuristics +---> [SpeakerService] — camera face detection + lip-sync correlation +---> [TranslationService] — DeepL → LibreTranslate +---> [VibrationService] — expo-haptics (6 patterns) +---> [CaptionNetworkService] — WebSocket relay broadcast +---> [Database] — SQLite persistence (expo-sqlite) | v [Caption Display] — face-anchored, speaker-colored, accessible fonts +---> [ASL Recognition] — Apple Vision hand pose (21 joints)

Platform Comparison

CapabilityWallSpace (Electron)A.EYE.ECHO (Mobile)
PlatformElectron (macOS / Win / Linux)Expo / React Native (iOS / Android)
Speech EngineWhisper subprocess + Web Speech + Nativeexpo-speech-recognition (native only)
DSP Features18 features @ 100msNone
Emotion AnalysisLexicon + voice + sarcasmNone
TranslationCTranslate2 offline → DeepL → LibreTranslateDeepL → LibreTranslate
Speaker IDSpectral centroid profilingCamera face + lip-sync correlation
DiarizationSpectral centroid shiftEnergy + timing heuristics
Sign LanguageNoneASL (26 letters, Vision hand pose)
Haptic FeedbackNone6 patterns (expo-haptics)
URL IngestNoneYouTube, HLS, direct media
Caption SharingNoneWebSocket relay (room codes)
PersistenceSession-only (JSON/SRT)SQLite (sessions + segments)
Beat / MusicFFT onset, MIDI, tap tempoNone
Scope IntegrationReal-time prompt modifiersNone

Feature Distribution

Echo Has, WallSpace Doesn't

  • ASL recognition — Apple Vision hand pose, 21 joints, geometry-based finger classification
  • Haptic feedback — 6 vibration patterns (speech start/end, speaker change, punctuation grammar)
  • Face-anchored speakers — MLKit face detection + lip-sync correlation with audio amplitude
  • URL ingest — YouTube (Piped API + react-native-ytdl), HLS streams, direct media files
  • Caption sharing — WebSocket relay with 6-digit room codes for multi-device broadcasting
  • SQLite persistence — full session/segment/speaker history with offline-first architecture
  • Power management — battery-adaptive modes (full / balanced / saver)
  • Accessible fonts — OpenDyslexic, Atkinson Hyperlegible, configurable 24-120pt

WallSpace Has, Echo Doesn't

  • 18 real-time DSP features — F0 pitch, transients, spectral centroid, ZCR, energy variance
  • Emotion analysis — 800+ term weighted lexicon, phrase matching, negation, sarcasm detection
  • Voice+text emotion blending — weighted fusion with confidence scoring
  • Whisper subprocess — multiple model sizes (tiny/base/small), multilingual
  • CTranslate2 offline translation — Opus-MT models, no network needed
  • AudioWorklet processing — real-time PCM capture + DSP on audio thread
  • Scope integration — emotion → AI prompt modifiers for generative visuals
  • Beat/kick detection — FFT onset, MIDI clock, tap tempo for music-reactive visuals

🔨 Enhancing Echo with Vocal Forensics

Key Opportunity

Voice features can drive accessibility-specific outputs on mobile that don't exist yet. Emotion detection maps to haptic intensity patterns — deaf users could feel the emotional tone of speech through their phone's vibration motor. Pitch contour maps to caption text styling (italic for questions, bold for emphasis). Volume maps to caption font size (whisper → shouting). These are novel accessibility features that neither iOS nor Android provide natively.

Beyond the phone: Haptic feedback isn't limited to mobile vibration motors. Deaf and hard-of-hearing audience members at live events may wear haptic wearables — vests (SubPac, Woojer), wristbands (Basslet), or seat transducers — that translate sound into physical sensation. WallSpace could drive these devices from the stage, sending both speech emotion haptics (feel the tone of a speaker) and music-reactive haptics (feel the beat, bass, and dynamics). WallSpace already has beat/kick detection (FFT onset, MIDI clock, tap tempo) and frequency band analysis (sub/bass/mid/high) — this data is ready to drive haptic output.

Voice Feature → Platform-Specific Outputs

Voice FeatureEcho (Phone)WallSpace (Visuals)Haptic Wearables (Live Events)Feasibility
Emotion Phone vibration patterns Caption color tint + Scope prompts Vest/wristband intensity + zone mapping Text lexicon (free)
Pitch direction Caption styling (italic) Caption styling + question detection Rising/falling sensation on body Light DSP
Volume Caption font size scaling Caption size + output emphasis Haptic intensity scaling Amplitude available
Speaking rate Caption scroll speed Caption pacing + scene timing Pulse rhythm matching speech cadence Timing heuristics
Speaker change Triple-pulse vibration Speaker label + color switch Directional haptic (left/right speaker) Already in both
Beat / kick Not implemented Visual triggers + scene changes Bass transducer pulses on beat WallSpace has FFT onset
Frequency bands Not implemented Audio-reactive layer effects Sub/bass/mid/high mapped to body zones WallSpace has 7 bands
Transients Alert vibration Scope visual intensity Sharp tactile clicks on consonants Needs raw audio
Trembling Gentle double pulse Visual softening effect Subtle tremor sensation Needs DSP or cloud
Sarcasm Visual indicator (~) Caption annotation + mood shift Contradictory pulse (sharp then soft) Needs text + voice sync

Haptic Wearables: The Live Event Opportunity

At live music and speech events, deaf audience members increasingly use haptic wearable technology to experience sound physically. WallSpace is uniquely positioned to drive these devices because it already has the audio analysis pipeline running in real-time:

Speech Haptics

  • Emotion intensity → vibration strength
  • Speaker identity → directional zones
  • Volume dynamics → amplitude mapping
  • Consonant transients → tactile clicks

Music Haptics

  • Beat/kick detection → bass transducer pulses
  • 7 frequency bands → body zone mapping (sub in chest, highs in wrists)
  • MIDI clock sync → rhythmic patterns
  • Onset detection → dynamic intensity curves

Devices: SubPac M2X (backpack/vest), Woojer Vest Edge, Basslet (wristband), ButtKicker (seat mount), custom Arduino/ESP32 builds via Bluetooth LE or OSC. WallSpace's existing OSC bridge (src/main/oscBridge.ts) could output haptic control messages alongside visual triggers — same data, different output modality.

What Can Run On-Device vs Cloud

Mobile-Feasible (On-Device)

  • RMS amplitude — available via expo-av metering (dB level)
  • Speaking rate — word count / segment duration (timing heuristics)
  • Silence detection — pause duration from segment gaps
  • Text-based emotion — lexicon analysis is pure TypeScript, no audio needed

Limitation: expo-av provides only dB amplitude, not raw PCM buffers. For real DSP (pitch, spectral centroid, transients), you need AVAudioEngine.installTap (iOS) or AudioRecord (Android) via a custom Expo native module.

Cloud-Required (Too Heavy for Mobile)

  • Full VoiceFeatureExtractor — needs AudioWorklet equivalent (doesn't exist in RN)
  • ML emotion models — emotion2vec, SenseVoice (large ONNX models)
  • Formant analysis — LPC computation on raw audio
  • MFCC extraction — Mel filterbank + DCT on raw spectral data
  • Transient detection — energy delta analysis on raw waveform
  • Sarcasm detection — synchronized text + voice analysis pipeline

Native Audio Analysis Alternatives

To run DSP directly on mobile without a cloud service, you'd need custom native modules:

The AudioWorklet Gap

WallSpace uses Web Audio API's AudioWorklet for real-time DSP in the Electron renderer process. React Native has no equivalent. expo-av provides only amplitude metering. The practical path forward: (a) basic amplitude/timing features locally, (b) heavy DSP via a shared cloud service that accepts audio chunks and returns enriched data. This avoids the significant native engineering of building cross-platform audio buffer access.

Shared Web Service Architecture

A cloud service both apps share for heavy audio processing. Mobile gets capabilities it can't run locally. Desktop gets a cloud fallback when local processing is insufficient. Built on the existing wallspace.studio Cloudflare infrastructure.

A.EYE.ECHO (Mobile) WallSpace (Electron) | | audio chunks (opus/PCM) audio chunks (PCM) via WebSocket via WebSocket | | v v +=========================================================+ | wallspace.studio/api/audio | | Cloudflare Worker + Durable Object | | | | [Deepgram Nova-3] [emotion2vec] | | streaming ASR speech emotion | | WER ~5.26% ML-based | | | | | | v v | | [Response Assembler] | | { transcript, emotion, voiceFeatures, | | speakerId, hapticTrigger, confidence } | | | | Auth: wallspace.studio JWT (existing) | | Storage: D1 (session logs) + R2 (audio cache) | +=========================================================+ | | v v Enriched captions Cloud fallback + haptic triggers when local DSP unavailable + emotion colors

Service Endpoints

EndpointMethodPurposeAuth
wss://wallspace.studio/api/audio/stream WebSocket Send audio chunks, receive enriched transcripts in real-time JWT
POST /api/audio/analyze HTTP One-shot analysis of an audio buffer (batch mode) JWT
GET /api/audio/models HTTP List available processing models and capabilities Public
POST /api/audio/session HTTP Create or end a processing session JWT

Cloudflare Durable Objects for Session State

Existing Infra WebSocket Native Per-Session State

Each active audio session maps to a Durable Object instance. The DO holds: current speaker profile, emotion history (for hysteresis smoothing), accumulated transient buffer (1-second window), session metadata. Audio chunks arrive via WebSocket, get processed by external API (Deepgram), results streamed back. Durable Objects provide: per-session state without external database, WebSocket hibernation (cost-efficient idle sessions), automatic cleanup on disconnect.

Option A: Deepgram Proxy

Worker receives audio from client, forwards to Deepgram Nova-3, enriches response with emotion data before returning.

  • Advantage: Best WER accuracy (~5.26%), streaming, speaker diarization built in
  • Cost: $0.0077/min ($200 free credit available)
  • Latency: <300ms end-to-end

Option B: Workers AI (On-Edge)

Cloudflare Workers AI for on-edge inference. Run whisper-tiny or emotion classification directly on Cloudflare's GPU fleet.

  • Advantage: No external API dependency, lower latency for nearby PoPs
  • Limitation: Model availability, cold start latency
  • Cost: Workers AI pricing (pay per inference)

Response Format

{ "type": "enriched-caption", "transcript": { "text": "That's just great.", "confidence": 0.94, "isFinal": true }, "emotion": { "primary": "anger", "confidence": 0.72, "source": "voice+text", "sarcasmDetected": true }, "voiceFeatures": { "pitchHz": 185, "pitchDirection": "falling", "volumeCategory": "loud", "rmsLevel": 0.71 }, "speaker": { "id": "spk_01", "changeDetected": false }, "hapticTrigger": { "pattern": "anger-pulse", "intensity": "strong" }, "timestamp": { "startMs": 14200, "endMs": 15800 } }

Auth Integration

Both apps already share the wallspace.studio domain. The existing JWT auth system (/api/auth/login, /api/auth/me, Google/GitHub/Apple SSO) authenticates both Electron and mobile clients. The mobile app stores the JWT in expo-secure-store. The existing _shared.ts auth helper already has CORS support and token verification. No new auth infrastructure needed.

🔄 Code Reuse Opportunities

Already Ported Once

WallSpace's translationService.ts was literally ported from A.EYE.ECHO. The two files are 90% identical — same DeepL client, same LibreTranslate fallback, same LRU cache. Time to extract shared code into packages both apps import from.

Reuse Audit

ServiceWallSpace FileEcho FileOverlapAction
Translation renderer/services/translationService.ts src/services/translationService.ts 90% Extract @wallspace/translation
Emotion Lexicon renderer/utils/sentimentAnalyzer.ts none Pure TS Copy to Echo (zero deps)
Caption Network none src/services/captionNetworkService.ts Echo only Port to WallSpace
Diarization VoiceFeatureExtractor (spectral) src/services/audioDiarization.ts Different approach Merge: timing + spectral
Vibration none src/services/vibrationService.ts Echo only Add emotion → haptic map
DB Schema session-only (JSON/SRT) src/services/database.ts Partial Align with cloud D1 schema
Types Various renderer types src/types/index.ts ~80% Extract @wallspace/types

Proposed Shared Packages

@wallspace/translation

  • TranslationCache (LRU, 200 entries)
  • DeepL HTTP client + language mappings
  • LibreTranslate HTTP client
  • Rate limiting + fallback cascade

Platform-specific bits stay separate: WallSpace keeps CTranslate2 offline via electronAPI; Echo keeps expo-constants API key resolution.

@wallspace/types

  • TranscriptSegment
  • Speaker, SpeakerProfile
  • Emotion, EmotionResult
  • TranscriptionStatus
  • WhisperLanguage

Both apps define these nearly identically. Union them (Echo adds source?: 'speech' | 'sign-language').

@wallspace/emotion

  • WEIGHTED_LEXICON (800+ terms)
  • analyzeTextEmotion()
  • EMOTION_VISUALS mapping
  • emotionToHaptic() (new)

Pure TypeScript, zero DOM dependencies. Runs in React Native without modification. Add emotionToHaptic mapping for mobile.

Package Strategy: Monorepo First

Option A (recommended start): npm workspace monorepo — add packages/ directory to crt-wall-controller, reference from both projects. Simpler dev workflow, instant iteration.
Option B (later): Published @wallspace/ scoped packages on npm — cleaner separation, works with any project structure, but adds publish/version overhead.
Start with A for speed, migrate to B when the shared API is stable.

🎯 Recommendations: Echo Integration Roadmap

Funding Reality: Free First, Paid Later

WallSpace.Studio is a commercial creative tool. A.EYE.ECHO is open-source (MIT-licensed) so it stays freely available to the deaf community. Matt's going to focus on pushing A.EYE.ECHO forward while Jack focuses on WallSpace and the shared cloud services, so both sides of the caption system move in parallel and benefit from the same underlying work.

The two surfaces share a caption codebase but have different funding targets. Cloud ASR services (Deepgram, AssemblyAI, Workers AI) have per-minute costs that add up — especially for a service intended to stay free for deaf users. We start with what costs nothing: native OS speech APIs, existing code reuse, and pure TypeScript logic that runs on any platform. Paid cloud services come later, funded either through grants / community support on the A.EYE.ECHO side, or via WallSpace commercial revenue subsidizing both.

Code Signing: Free ASR Unlocked in v2.6.0

WallSpace has a complete native SFSpeechRecognizer implementation — the same free, on-device Apple speech engine that Echo uses. It's fully built: native C++/Objective-C addon (native/speech-recognition/src/speech_mac.mm), main process bridge (src/main/nativeSpeechBridge.ts), renderer service (src/renderer/services/nativeSpeechEngine.ts), IPC plumbing, 25+ language support, 55-second auto-restart.

Abandoned in v2.6.1: Despite adding entitlements, signing, notarizing, and switching to native arm64 builds, both SFSpeechRecognizer and Web Speech API consistently crash the Electron renderer process. Tested on Rosetta x64 (SIGTRAP), arm64 native (grey screen / SIGSEGV), with and without active Apple Developer agreement. The crash occurs at the native addon boundary and is not recoverable via try-catch. Moonshine v2 replaces all three speech engines (Whisper, native, web) with a single local streaming ML model — that is the path forward.

RequirementStatusAction
Native addon (speech_mac.mm) Complete No changes needed
IPC bridge + renderer service Complete No changes needed
Hardened runtime Enabled Already in electron-builder config
Speech entitlement in plist Added in v2.6.0 com.apple.security.speech-recognition added to entitlements.mac.plist
Apple Developer certificate Working Team ID 9K65QDV874 — signing successful
Notarized build v2.6.0 shipped Signed, notarized, published April 12, 2026

Path forward: Moonshine v2 provides local streaming ASR without native addon dependencies — runs as WASM or subprocess, no SFSpeechRecognizer, no Web Speech API, no Electron renderer crash risk. Replaces Whisper too (lower latency, streaming output). Echo continues using native speech APIs on iOS where they work reliably.

Phase 1: Zero-Cost Foundation Start Here

Everything in this phase costs nothing — no API keys, no subscriptions, no cloud services. Pure code reuse, native OS capabilities, and TypeScript logic that already exists.

1. Fix Code Signing + Enable Native Speech on WallSpace Abandoned v2.6.1

Outcome: Despite full implementation (entitlements, signing, notarization, arm64 native build), SFSpeechRecognizer and Web Speech API both crash the Electron renderer on every tested configuration. Replaced by Moonshine v2 in the roadmap below.

  • Add com.apple.security.speech-recognition entitlement ✓ (done, didn't help)
  • Build a signed, notarized release ✓ (done, didn't help)
  • Switch to arm64 native (no Rosetta) ✓ (done, still crashes)
  • Renew Apple Developer agreement ✓ (done, still crashes)
  • Conclusion: The native addon boundary in Electron is fundamentally unstable for speech APIs. Moonshine v2 (pure JS/WASM, no native addon) is the correct path.

Status: Abandoned. Whisper remains the only ASR engine until Moonshine v2.

2. Copy Emotion Lexicon to Echo $0

Goal: Give Echo text-based emotion analysis with zero additional dependencies

  • Copy sentimentAnalyzer.ts to Echo — pure TypeScript, zero DOM or Electron deps
  • Wire into transcript pipeline: every caption gets an emotion tag
  • Map emotions to vibration patterns via existing VibrationService
  • Result: Deaf users feel the emotional tone of speech via haptics — novel accessibility feature

Effort: Half a day  |  Cost: $0

3. Emotion-Driven Haptic Patterns $0

Goal: Deaf users feel the emotional tone of speech — on phone and wearable devices

  • Echo (phone): Extend VibrationService to accept EmotionResult from lexicon
  • WallSpace (wearables): Output emotion + beat data via OSC to haptic vests/wristbands at live events
  • Anger: strong rapid pulses   Sadness: slow gentle pulses
  • Joy: light double-tap   Fear: increasing intensity
  • WallSpace already has the OSC bridge + beat detection + 7 frequency bands — map to haptic zones
  • Result: Emotional context through touch, from phone vibration to full-body haptic vest

Effort: 1-2 days  |  Cost: $0 (OSC output is free, wearable hardware is user-provided)

4. Volume-Based Caption Styling $0

Goal: Visual representation of how loud someone is speaking

  • Echo already has amplitude metering via expo-av (dB level)
  • Map volume levels to caption font size: whisper (small, light) → shouting (large, bold)
  • WallSpace already has volumeCategory in VoiceFeatureExtractor — reuse the thresholds
  • Result: Deaf users see volume visually without any new audio processing

Effort: Half a day  |  Cost: $0

5. Extract Shared Translation Package $0

Goal: Eliminate 90% duplicated code between both apps

  • Factor out TranslationCache, DeepL client, LibreTranslate client into @wallspace/translation
  • Both apps already use LibreTranslate as free fallback — no API key required
  • DeepL free tier (500K chars/month) available when API key is configured
  • Keep platform-specific adapters separate (CTranslate2 for Electron, expo-constants for mobile)

Effort: Half a day  |  Cost: $0

6. Caption Network: WallSpace ↔ Echo $0

Goal: WallSpace can broadcast/receive captions to/from mobile devices

  • Port Echo's CaptionNetworkService to WallSpace (WebSocket relay + room codes)
  • Existing Glitch relay (caption-relay.glitch.me) is free-tier hosted
  • Enables: CRT wall showing captions from mobile users in audience
  • Result: Live performance captioning where audience phones feed the wall

Effort: 1-2 days  |  Cost: $0 (Glitch free tier)

7. Speaking Rate → Caption Scroll Speed $0

Goal: Captions adapt pacing to speech speed

  • Compute WPM from segment timing (word count / duration) — no audio analysis needed
  • Fast speech: captions scroll faster, auto-truncate older lines
  • Slow speech: captions hold longer, auto-pause on silence
  • Result: Better readability for deaf users without any cloud or DSP dependency

Effort: Half a day  |  Cost: $0

Phase 2: Cloud-Powered Upgrades Test & Validate Now, Scale When Funded

These features use paid APIs. Development and testing can happen now using free credits and careful usage management. Production-scale always-on use requires funding. The key principle: free engines run by default, paid engines are opt-in and never persist across restarts.

Deepgram: On-Demand Premium Toggle

Deepgram is not an always-on replacement for native speech — it's a premium mode you switch on when you need the best accuracy or cloud-based voice emotion analysis, then switch off. Free native ASR handles day-to-day captioning; Deepgram handles demos, live events, and testing.

Usage PatternMonthly Cost$200 Credit Lasts
2-hour live events, 2x/month$1.85~9 years
1 hour/day testing & development$14~14 months
4 hours/day regular use$55~108 days
12 hours/day always-on (avoid this)$166~36 days

Required Safeguards

Broader API Cost Management (Planned)

Deepgram is one of several cost-based APIs in the WallSpace ecosystem (Deepgram, DeepL, fal.ai/RunPod for Scope GPU, future cloud services). A comprehensive API cost management initiative is needed across all paid services — unified cost tracking dashboard, per-service budgets, usage alerts, and spend reporting. Matt already has tickets scoped around this topic. For now, Deepgram follows the same patterns as other cost-based APIs in the app (manual enable, session tracking, warnings). A unified cost management system will be planned and addressed as a separate initiative.

8. Deepgram Integration (Test & Validate) $200 Free Credit

Goal: Validate cloud ASR quality + voice emotion pipeline, manage costs carefully

  • Add Deepgram Nova-3 as an engine option in the caption engine selector
  • WebSocket streaming integration (same pattern as existing engines)
  • Default: OFF. Never persists across app restarts. Manual toggle only.
  • Add session cost counter + spend warning system to caption panel
  • Test and validate: accuracy vs native speech, latency, speaker diarization quality
  • $200 free credit is sufficient for months of development and testing

Effort: 1-2 days  |  Cost: $0 prefunding ($200 credit), ~$0.0077/min after

9. Shared Cloud Audio Service When Funded

Goal: Both apps get enriched captions via wallspace.studio/api/audio

  • Cloudflare Worker + Durable Object proxying to Deepgram
  • Adds emotion classification from audio (not just text) to the response
  • Mobile gets: cloud ASR + voice-based emotion when premium mode is on
  • Same toggle-on/off safeguards apply — free native speech remains the default

Effort: 1-2 days  |  Cost: Deepgram usage + Cloudflare Workers (free tier generous)

10. Migrate Caption Relay to Cloudflare Durable Objects When Funded

Goal: Replace Glitch free-tier relay with production-grade infrastructure

  • Move from caption-relay.glitch.me to wallspace.studio Durable Objects
  • Better reliability, no cold starts, WebSocket hibernation
  • Unified auth with existing wallspace.studio JWT system

Effort: 1-2 days  |  Cost: Cloudflare Workers Paid ($5/mo base)

11. ML Emotion Models (Cloud) Future

Goal: Replace heuristic emotion rules with trained ML models

  • emotion2vec or SenseVoice-Small via cloud inference
  • Workers AI or dedicated GPU endpoint
  • Same cost management principles: opt-in, tracked, budgeted

Effort: 1-2 weeks  |  Cost: Workers AI pricing or GPU hosting

Phase 3: Deep Engineering Long-Term

11. Mobile DSP Expo Module Engineering

Goal: On-device voice feature extraction without cloud dependency

  • Custom Expo module: AVAudioEngine.installTap (iOS) + AudioRecord (Android)
  • Port subset of VoiceFeatureExtractor: RMS, basic pitch, spectral centroid
  • Would give Echo on-device emotion from voice (not just text) — completely free at runtime

Effort: 1-2 weeks  |  Cost: $0 (engineering time only)

12. Unified Speaker Detection Engineering

Goal: Merge camera-based (Echo) + audio-based (WallSpace) speaker identification

  • Echo sends face IDs + lip-sync correlation data
  • WallSpace sends spectral centroid profiles
  • Fused speaker identity across platforms and sessions

Effort: 2+ weeks  |  Cost: $0 (engineering time only)

Mobile Constraints to Remember

Phase 1: Free ($0)

  • Fix code signing → native speech on WallSpace
  • Emotion lexicon → Echo (pure TS)
  • Emotion → phone haptics + wearable OSC
  • Beat/frequency → music haptics via OSC
  • Volume → caption font size
  • Translation package extraction
  • Caption network port
  • Speaking rate → scroll speed

Total cost: $0

Phase 2: Test Now, Scale When Funded

  • Deepgram (test with $200 credit)
  • Always off on restart, manual toggle
  • Session cost tracker + warnings
  • Durable Object caption relay
  • ML emotion models (cloud)
  • Broader API cost mgmt (planned)

$1-5/mo if toggled sparingly

Phase 3: Engineering

  • Mobile native DSP module
  • Unified speaker identity
  • On-device ML emotion
  • FocalCodec voice tokenization
  • Decision matrix fusion

$0 runtime, weeks of dev time

📝 For Review: Questions for Gadi, Matt & Echo

For Gadi

  • Which formant extraction method do you recommend for real-time? LPC vs cepstral?
  • Your "time-window script" concept — what chunk size worked best in your offline tools?
  • Any specific Chinese models from your librarian agent we should evaluate?
  • Would you want to contribute DSP code directly? We can set up a branch for testing.
  • Your emotional sovereignty papers — any framework we should adopt for the decision matrix?
  • PLAUD recorder — any insights from their approach worth borrowing?

For Matt

  • With the new transient detection in v2.6.0, can you notice consonant-heavy speech being handled differently?
  • Is the emotion debug panel useful for understanding what the system is "hearing"?
  • Deepgram vs Whisper — want to A/B test both during a session?
  • Would phoneme-level confidence indicators in captions be useful? (e.g., dim uncertain words)
  • Speaker change detection — is it triggering correctly in multi-person conversations?
  • Priority: faster transcription (Phase 1) or better emotion accuracy (Phase 3)?

For Echo Integration

  • Should the cloud service use Deepgram proxy or Workers AI for inference?
  • Which emotion-to-haptic mappings feel right for deaf users? Need user testing.
  • Should caption relay migrate from Glitch to wallspace.studio Durable Objects?
  • Which Echo features should WallSpace adopt first? (URL ingest? Caption sharing? ASL?)
  • Monorepo or published npm packages for shared code?
  • Is expo-secure-store sufficient for JWT storage on mobile?