WallSpace.Studio — Vocal Forensics & Next-Gen Audio Intelligence Plan

Scope: one captioning intelligence engine, many use cases

This isn't a caption system purpose-built for VJ shows. It's a general-purpose captioning intelligence service designed to serve:

Live events — concerts, VJ shows, club nights (first focus — the hardest calibration target)
Live conferences & talks — keynotes, panels, lectures
Online video & meetings — streams, video calls, recorded video
Everyday conversations — A.EYE.ECHO mobile accessibility, 1-on-1 use
Offline audio — podcasts, audiobooks, recorded interviews, voice memos

We start with live music + visuals because if the pipeline works with music bleed, crowd noise, and multiple speakers, the cleaner use cases are downstream. When you see "VJ audio" or "live-show context" below, read it as "our hardest calibration target," not "our only use case."

🔗 Strategic Frame: Matt's Architecture Paper

Since v2.6.0 shipped, Matt (in collaboration with ChatGPT Plus) produced a formal architecture paper that becomes the new strategic spine for this project. We've adopted it as the v3.0.0 foundation. The short version: treat the BBC Subtitle Guidelines as a compliance floor (the minimum safe readability / accessibility behaviour), and allow advanced features (emotion, prosody, WallSpace visuals, immersive placement) only as controlled extensions above that floor.

Core thesis

Standards-based captioning should define the minimum safe and readable behaviour of the system, while advanced features are implemented only as controlled extensions above that baseline. Decision order: readability first, compliance second, enhancement third, expression fourth.

Matt's source documents — download

The full specification set Matt prepared, referenced throughout this v3.0.0 plan. Click any card to download the PDF.

PDFAdvanced Captioning Architecture Paper v3

The overarching paper — start here

PDFRules Matrix Specification v1

Machine-readable rules derived from BBC

PDFPresentation Policy Engine Spec v1

The decision authority — resolves conflicts, applies profiles

PDFRenderer Specification v1

Layout + display execution; no creative liberties

PDFEnhancement Layer Specification v1

Emotion, prosody, WallSpace integration (proposal-only)

PDFDecision Log Specification v1

Audit trail + WallSpace/Echo common language

PDFBBC Subtitle Guidelines v1.2.5

Baseline standards source — the compliance floor

⚖ Two-Track Compliance: A.EYE.ECHO Strict Floor vs WallSpace Flexibility

Matt's architecture uses BBC Subtitle Guidelines as the compliance floor. For the deaf-accessibility use case in A.EYE.ECHO this floor is strict — deaf users depend on it daily, and standards compliance is non-negotiable. For WallSpace's creative / VJ contexts, a caption may legitimately operate with different constraints: a live show, small localized audience, or experimental art installation has different priorities than a daily accessibility conversation.

Slight divergence from Matt's plan — to review together

Matt's paper treats the compliance floor as invariant across all contexts. Jack's read is that for WallSpace — particularly with smaller audiences, localized VJ shows, and experimental art contexts — more flexibility may be appropriate. The architecture already supports this through Rule Class D (project-specific advanced rules) and presentation profiles, but we need to agree on exactly which rules bend and which stay fixed.

Action: Jack to review this framing with Matt. The specific A-class rules that must hold for WallSpace in all modes, and the B/C/D-class rules that can be overridden in Immersive / Experimental WallSpace profiles, need explicit agreement.

How the two tracks split

	A.EYE.ECHO (deaf-first)	WallSpace (creative)
Compliance floor	Strict. BBC baseline always enforced.	Flexible within clear limits. Class A rules still hold.
Default profile	Compliance or Accessible Enhanced	Live/Low Latency, Immersive, or Experimental
Enhancement scope	Limited to readability-preserving additions	Reactive typography, spatial placement, WallSpace visual control allowed
Override path	None beyond built-in profiles	Class D rules + per-layer profile selection
Logging requirement	Full decision log (verbose)	Configurable per profile (minimal in live shows, verbose in R&D)

🛠 Five-Layer Architecture

Matt's architecture replaces our earlier "signal-first pipeline" with a five-layer system. Our existing DSP / emotion / phoneme work from v2.6.0 doesn't go away — it becomes the contents of Layer 1 (Input Engines). Everything above Layer 1 is new infrastructure for governing how that signal data becomes visible captions.

Audio In | v [Layer 1: Input Engines] — STT, DSP, emotion, phonemes, speaker diarisation, shot detection | Produce structured evidence, never display decisions v [Layer 2: Canonical Caption Data Model] — one normalized schema for everything | Start/end time, text, speaker_id, caption_type, confidence, metadata v [Layer 3: Standards Rules Engine] — validates / scores / repairs against BBC rules | Applies the Rules Matrix, generates constraints v [Layer 4: Presentation Policy Engine] — resolves conflicts, applies profiles | Central decision authority; final instructions for renderer v [Layer 5: Renderer + Enhancement Layers] — executes display, proposes controlled modifications | +---> [Decision Log] — audit trail, WallSpace/Echo common language

Decision ownership model

Component	What it owns	What it must not do
Input Engines	Generate structured data	Make UI or rendering decisions
Rules Engine	Validate data, enforce constraints	Decide how to display anything
Policy Engine	Decide rendering behaviour, resolve conflicts, approve enhancements	Invent rules; override constraint priorities
Renderer	Execute layout and display	Modify caption content or make creative calls
Enhancement Layer	Propose controlled modifications	Enforce behaviour independently

Conflict resolution priority (mandatory, invariant)

Readability and accessibility
Synchronisation and timing accuracy
Visual safety (non-obstruction)
Speaker clarity
Enhancement behaviour

Lower-priority behaviours must yield to higher-priority constraints without exception.

📋 Rule Classes & Presentation Profiles

Four rule classes

Class	Name	Role	Who uses it
A	Hard accessibility	Must never be violated. Defines the floor.	Both apps
B	Preferred presentation	Strong defaults with context-sensitive exceptions	Both apps; stricter in Echo
C	Controlled expressive	Limited stylistic behaviour already recognised by BBC guidelines	Enabled in Enhanced/Immersive profiles
D	Project-specific advanced	Features beyond BBC (reactive typography, WallSpace integration, immersive placement)	WallSpace-specific; override path

Five presentation profiles

Profile	Description	Primary use
Compliance / Broadcast-safe	Strict BBC adherence. No experimental enhancements.	A.EYE.ECHO default; conference captioning
Accessible Enhanced	Baseline + limited semantic additions (emotion tags, sound annotation)	A.EYE.ECHO daily use; podcast post-processing
Live / Low Latency	Immediacy prioritised; relaxed segmentation; readability floor held	Live conferences, news, Q&A
Immersive / Spatial	Spatial placement and motion enabled with motion constraints + fallback	VJ shows, art installations, VR/AR
Experimental / Expressive	Advanced enhancements unlocked; Class D rules available	WallSpace R&D, experimental performance

A sixth profile — Offline / Post-Processing — will likely be added for podcasts, audiobooks, and recorded interview analysis. Flagged as a gap in Matt's initial spec; to discuss.

⚡ What We Built Today (v2.5.1 → v2.6.0)

This session focused on making the voice analysis and emotion detection pipeline actually work end-to-end, and then deeply researching where to take it next. Here are the concrete improvements shipping in v2.6.0:

F0 Pitch Extraction & Consonant Transients

Added autocorrelation-based fundamental frequency (F0) detection covering 80-500 Hz (bass to soprano). New consonant transient detector catches hard attacks (p/t/k/s sounds) via energy delta analysis. Pitch direction tracking (rising/falling/stable) enables question detection and excitement mapping.

NEW

Sarcasm Detection via Text+Voice Contradiction

When someone says positive words ("that's great") with negative vocal tone (falling pitch, low energy), the system now detects the contradiction and trusts the voice over the text. Because tone is harder to fake than words.

NEW

Voice Features → Emotion Pipeline Fixed

Voice analysis was being extracted but wasn't reaching the emotion scoring engine. Fixed auto-enable so voice features flow through to emotion blending automatically. No more silent failures.

FIX

Transient Accumulation (1-Second Window)

Transient detection now accumulates over a 1-second rolling window instead of just catching individual spikes. This catches patterns like rapid-fire consonants in angry speech — 3+ transients/sec flags anger. Threshold lowered and strength shown in debug panel for tuning.

IMPROVED

Emotion Scoring Debug Panel

New collapsible debug panel shows real-time emotion scores, voice feature values, and blending weights. Polls voice features at 200ms intervals (not just on text events) so you can see the voice emotion shifting even between words. Essential for tuning and demonstrating the system.

DEBUG

Emotion Test Hook & Temporal Alignment

Matt's Tasks 3-5: Added emotion test triggers for development, debug state inspection, and temporal alignment between voice features and text events so emotions don't lag behind speech.

MATT'S TASKS

Commit History

bda1207Accumulate transients over 1s window — catches patterns not just spikes
89ac66fFix transient debug display — poll voice features at 200ms, not just on text
bfbd538Lower transient detection threshold + show strength in debug panel
031675aFix voice features not reaching emotion scoring — auto-enable voice analysis
0512dccMove emotion scoring debug to collapsible panel (not hover tooltip)
4aae2adImprove emotion sensitivity + add scoring debug display
f928050Add emotion test hook, debug state, temporal alignment (Matt's Tasks 3-5)
2ab482cAdd F0 pitch extraction, consonant transients, sarcasm detection

📌 Current System State (v2.6.1)

What's actually running today (v2.6.1 — first public release)

ASR: Whisper (sole engine) — whisper.cpp subprocess
Streaming ASR: none active
Native speech (SFSpeechRecognizer): removed
Web Speech (Chromium): removed
Next ASR: Moonshine v2 (planned — Phase 1 / Phase 6 Input-Engines upgrade)

All references below to multi-engine ASR, streaming Web Speech, or native-speech bridges describe historical v2.6.0 state and are retained for context only. The live pipeline is Whisper-only until Moonshine v2 lands.

Here's everything the voice analysis pipeline can do right now, before the next-gen upgrades. Rows marked Abandoned v2.6.1 are kept in the table so the evolution from v2.6.0 is traceable.

Layer	Implementation	Status
ASR (Primary)	whisper.cpp subprocess (tiny.en / base.en / small.en)	Working, 2-3s latency
ASR (Streaming)	~~Web Speech API (Chromium)~~	Abandoned v2.6.1 — crashes renderer in Electron
ASR (Native)	~~SFSpeechRecognizer bridge (macOS)~~	Abandoned v2.6.1 — crashes on both Rosetta & arm64
ASR (Next-Gen)	Moonshine v2 (local, streaming, replaces all above)	Planned — Phase 1
Music Lyrics	Shazam fingerprint + LRCLIB synced lyrics	Working
DSP Features	VoiceFeatureExtractor (18 features @ 100ms)	Working
Emotion (Text)	Lexicon-based (800+ terms, phrase matching, negation)	Working
Emotion (Voice)	Heuristic rules (pitch, energy, transients, volume)	Working
Emotion (Blend)	Weighted fusion + sarcasm detection	NEW in v2.6.0
Beat/Kick	FFT onset detection, MIDI clock, tap tempo	Working

Voice Features Extracted in Real-Time

Pitch & Tone

F0 frequency (80-500 Hz)
Pitch confidence (0-1)
Pitch direction (rising/falling/stable)
Spectral centroid (brightness)

Energy & Dynamics

RMS level (0-1)
Voice energy (300-3000 Hz)
Volume category (whisper→shouting)
Energy variance (trembling)

Transients & Rhythm

Transient strength (0-1)
Has transient (boolean)
Recent transient count (1s window)
Zero-crossing rate

Voice Activity

Is speaking (VAD)
Silence duration (ms)
Speaking rate (WPM)
Is trembling (emotional)

Speaker ID

Speaker change detected
Speaker ID index
Timbre descriptor
Intensity descriptor

Emotion Output

7 emotions + neutral
Pastel color mapping
Scope prompt modifiers
Hysteresis smoothing

🔍 The Problem: Why Whisper Isn't Enough

    Key Insight from Gadi Sassoon's Consultation
    
      "You take a level obviously not just, you cross reference the text with a kind of sonic analysis
      and you try to provide a tone of voice tag... These models will do consonants really well.
      Transient analysis is very important."
      — Gadi Sassoon, DSP Engineer (25 years, Berkeley College of Music)
    
  

Whisper Limitations — Evidence-Based Assessment

Not all commonly cited Whisper limitations are actually observed problems in WallSpace. Some come from Gadi's consultation (his experience with call transcripts and offline tools), others are confirmed in our codebase with specific mitigations in place.

Limitation	Source	WallSpace Status
2-3 second latency	Confirmed	Actively measured via `transcriptionService.ts` latency tracking. Default 5s chunks + inference time. Compensated with manual latency offset slider (-500 to +500ms) and auto-calibrate button.
Hallucinations on silence	Confirmed	Observed enough to hardcode filter patterns in `whisperBridge.ts`: `(music)`, `(applause)`, `"you"`, `"thank you"`, dot strings. Mitigated via silence detection (RMS < -60dB skips transcription entirely) and hallucination filtering before display.
Queue drops	Confirmed	Explicit backpressure logic in `whisperBridge.ts`: if queue > 1 item, oldest chunk is dropped. Comment in code: "Whisper is slower than real-time." Intentional trade-off to prevent OOM in live streaming.
No emotion data	Factual	By design — Whisper is ASR-only, outputs text with no tone/emotion metadata. Workaround in place: parallel DSP pipeline (VoiceFeatureExtractor) + text lexicon analysis provide emotion independently.
Drops consonants	From Gadi	Reported by Gadi from his call transcript experience, not from WallSpace bug reports. Our transient detection (v2.6.0) monitors consonant attacks via DSP but doesn't currently correct Whisper output. Phoneme-level correction planned for Phase 4.
Accent struggles	From Gadi	Gadi mentioned struggles with "Globish" and non-native speakers from his tools. No evidence of this in WallSpace. Multilingual models (tiny/base/small) are available alongside English-only variants. No bug reports or workarounds for accents.

What We Need

<200ms latency — streaming transcription
Signal-first architecture — DSP features lead, text is secondary
Consonant awareness — detect what ASR models miss
ML-based emotion — replace heuristic rules with trained models
Formant analysis — vowel/accent/speaker profiling
Decision matrix — fuse DSP + emotion + text + speaker context

Gadi's Gaps Identified

Gap	Description	Current State
Signal-First	DSP should lead; text transcription is secondary to the audio signal	Partial
Transient Analysis	Consonant edges carry meaning ASR models miss entirely	Basic (v2.6.0)
Tone-of-Voice Tags	Cross-reference text with sonic analysis for tone metadata	Basic (v2.6.0)
Formant Analysis	Vowel structure (F1/F2/F3) for accent/speaker profiling	Not implemented
Decision Matrix	Not a linear pipeline but a matrix of DSP + emotion + semantics	Not implemented
Clockless Analysis	Real-time requires careful buffering strategy decisions	Partial

🤖 Next-Gen Model Landscape

Tier 1: Whisper Replacements Highest Impact

Moonshine v2 TOP PICK

Local MIT License Streaming 100x Faster

100x faster than Whisper Large v3 on MacBook Pro (107ms vs 11,286ms). Better WER accuracy. Streaming encoder with sliding-window attention for bounded low-latency. Incremental audio caching — subsequent calls only process new audio. Sizes from 26M (Tiny) to 245M (Medium Streaming). Same subprocess integration pattern as current whisper.cpp.

WhisperKit Local Server

Apple Silicon Neural Engine

CoreML-compiled Whisper on Apple Neural Engine. OpenAI-compatible HTTP local server — can be bundled as Electron subprocess. Streaming, word timestamps, VAD, speaker diarization built in.

Sherpa-ONNX

In-Browser 50KB WASM

ASR/TTS/VAD/diarization via ONNX Runtime in WebAssembly (50KB gzipped). Could run speech recognition directly in Electron renderer — no subprocess needed. 12 language bindings, fully offline.

Cloud APIs Best Accuracy

Service	WER	Streaming Latency	Price	Notes
Deepgram Nova-3	~5.26%	<300ms	$0.0077/min	$200 free credit
AssemblyAI Universal-3	~6.68%	~150ms P50	~$0.01/min	30% fewer hallucinations than Whisper
GPT-4o-mini-transcribe	Better than Whisper	Low	~$0.006/min	WebSocket streaming, accent-resilient
Google Chirp 3	Competitive	Low	Usage-based	Built-in denoiser, speaker diarization

Audio-Native LLMs Beyond Transcription

These models understand raw audio directly — tone, emotion, background noise — not just convert speech to text. This is the paradigm shift Gadi described.

Gemini 2.5 Native Audio

Cloud Emotion-Aware

End-to-end audio understanding: tone, emotion, background noise filtering. Responds to user's tone of voice. Live API with bidirectional audio streaming.

Qwen2-Audio (Open Source)

Open Source Local GPU

Speech + natural sounds + music in one encoder. Voice chat mode (no text needed) + audio analysis mode. Excels at ASR, emotion recognition, acoustic scene classification.

Speech Emotion Recognition Supplement to ASR

SenseVoice-Small (Alibaba) STRONG PICK

Open Source ASR + Emotion

Combined ASR + emotion recognition + audio event detection in one model. Could replace both Whisper AND heuristic emotion detection.

emotion2vec (FunASR)

Open Source Lightweight

Dedicated emotion classifier: angry, happy, neutral, sad. Lightweight, runs alongside existing ASR pipeline. Multiple model sizes.

Voice Tokenization Future

FocalCodec State of the Art

NeurIPS 2025 Identity + Emotion

Single binary codebook at 0.16-0.65 kbps. Preserves speaker identity AND emotion in reconstructed speech. Outperforms SpeechTokenizer, Mimi, EnCodec. Use case: encode vocal characteristics into compact tokens for speaker profiling, emotion encoding, and network transmission.

🛠 Proposed Architecture: Signal-First Pipeline

Audio In (mic / system audio) | v [Layer 1: Audio Ingestion] — 10-40ms chunks, AudioWorklet | +---> [Layer 2a: DSP Features] — FFT, pitch, formants, MFCC, transients, ZCR | (existing VoiceFeatureExtractor, enhanced) | +---> [Layer 2b: Voice Tokenization] — FocalCodec (identity + emotion encoding) | (future: compact voice fingerprint) | +---> [Layer 3: ASR Engine] — Moonshine v2 (local, streaming) | | + Deepgram Nova-3 (cloud fallback) | v | Raw Transcript | +---> [Layer 4: Emotion] — Two parallel paths: | | | +---> emotion2vec / SenseVoice (ML-based from audio) | +---> Lexicon analysis (from transcript text) | | | v | [Fusion: weighted blend with sarcasm detection] | v [Layer 5: Decision Matrix] — Combines all signals: - DSP features (pitch contour, transients, formants) - ASR text (words, confidence) - Emotion (audio + text blended) - Speaker ID (voice fingerprint) - Beat / music context | v [Layer 6: Output] — Caption display + Scope prompt modifiers + visual triggers | +---> [Agentic Loop] — Re-analyze ambiguous segments

🎵 Consonant & Transient Analysis — The Gadi Gap

The Core Problem

There is no off-the-shelf "consonant transient detector" ML model. ASR models like Whisper treat audio as a sequence of words — they don't preserve the signal-level detail of how those words were spoken. The consonant edges (the p/t/k/s attacks) carry critical emotional and clarity information that gets discarded in the text-only pipeline. Our current energy-delta approach is a good start. The upgrade path adds formant analysis, MFCC features, spectral flux, and eventually phoneme classification.

DSP Feature Upgrade Path

Feature	What It Does	Why It Matters	Phase
Formant Extraction (F1/F2/F3)	LPC spectral envelope peak-picking	Vowel height/frontness, accent profiling, speaker ID	Phase 2
MFCC (13 coefficients)	Mel filterbank + DCT	Phoneme classification, consonant type detection	Phase 2
Spectral Flux	Frame-to-frame spectral change	More robust consonant edge detection in noise	Phase 2
Harmonic-to-Noise Ratio	Voiced vs unvoiced segment detection	Distinguish vowels from consonants precisely	Phase 2
wav2vec2 Phoneme Classifier	ONNX model for phoneme-level detection	Classify specific consonants (p/t/k/b/d/g/s/z/f/v)	Phase 4
Montreal Forced Aligner	Post-hoc phoneme-transcript alignment	Find where consonants were dropped/mumbled	Phase 4

🚀 Seven-Phase Implementation Order

Per Matt's architecture paper, build order prioritises a stable, auditable core before any innovation layers. This replaces the earlier Phase 1–5 capability roadmap from v2.6.0. The capability work (Moonshine, formants, emotion2vec, phoneme analysis) now fits inside Layer 1 (Input Engines) — so those items move inside Phase 1 and Phase 6, not independent phases.

Phase 1: Rules Matrix Extraction Start Here

Goal: Convert BBC Subtitle Guidelines into a machine-readable rules matrix.

Formalise each BBC rule with ID, source section, class (A/B/C/D), thresholds, evaluation method
Tag rules that differ between A.EYE.ECHO (strict) and WallSpace (flex) contexts
Mostly a research / writing task — Matt-led given his BBC depth; Jack refines
Deliverable: rules-matrix.json consumable by the Rules Engine

Phase 2: Canonical Caption Data Model Foundation

Goal: One normalised schema every input engine writes to and every downstream layer reads.

Required fields: start_time, end_time, text, speaker_id, caption_type
Optional: confidence, line-break candidates, shot/scene references, style suggestions, enhancement eligibility
TypeScript interface + JSON schema
Shared between WallSpace and A.EYE.ECHO via a common package

Phase 3: Standards Rules Engine Core Infrastructure

Goal: Consumes the Rules Matrix + Canonical Model; emits constraints + scores.

Evaluation methods: deterministic, heuristic, ML-assisted
Scoring: pass / warning / fail → aggregated readability / timing / layout / compliance scores
Auto-fix strategies: extend_duration, split_caption, resegment, suppress_feature
Emits structured constraints that the Policy Engine consumes

Phase 4: Presentation Policy Engine Decision Authority

Goal: Central decision authority — resolves conflicts, applies profiles.

Mandatory priority order: Readability → Timing → Visual Safety → Speaker Clarity → Enhancement
Initial three profiles: Compliance, Accessible Enhanced, Live/Low Latency
Immersive + Experimental profiles follow in Phase 6
Generates final rendering instructions; outputs to Renderer + Decision Log

Phase 5: Baseline Renderer Hardening Shipped + Refined

Goal: Deterministic renderer that strictly follows Policy Engine instructions.

Safe region enforcement (avoid faces, UI overlays, clipping)
Reflow prevention during active display
Collision handling with repositioning / size reduction / fallback
Fallback modes: static bottom-centre, reduced font, no enhancements

Phase 6: Guarded Enhancement Layers Expressive Work

Goal: Emotion, prosody, reactive typography, WallSpace integration, spatial placement. All as proposals that the Policy Engine approves / modifies / rejects.

Input-Engines upgrades: Moonshine v2, emotion2vec, phoneme analysis (wav2vec2), formants/MFCC
External visual integration: trigger WallSpace visuals, drive shader parameters, lighting control
Immersive + Experimental profiles activated
Spatial positioning with motion constraints

Phase 7: Regression Testing Continuous

Goal: Every enhancement regression-tested against the compliance floor.

Rules tests: detect WPM / line breaks / sync drift / shot straddling / speaker ambiguity
Rendering tests: safe zones, font scaling, overlay avoidance, stable alignment
Perception tests (human): comprehension, comfort, fatigue, trust, user preference
Failure-case scenarios: fallback behaviour verified

⚠ Known Tensions & Mitigations

Concerns flagged in review of Matt's architecture, and how we plan to address each.

Concern	Why it matters	Mitigation
Performance envelope	Five layers + enhancement + decision log could exceed our <200ms target	Decision Log spec defines minimal / standard / verbose levels. Live profile uses minimal. Validate <200ms before Phase 5 ships.
BBC floor vs WallSpace flex	Live VJ, small-audience, and experimental contexts may need different constraints than deaf-accessibility defaults	Class D (project-specific) rules + WallSpace-specific profiles. Class A rules still hold. Every override logged. To review with Matt.
Rewrite cost	v2.6.0 has shipping signal-first code; seven phases looks like big-bang	Acceptable — Jack and Matt are effectively the only users of latest. v2.6.0 work was a single-day exploration; rewrite is fine if the architecture unlocks something better. ML-assisted layers will replace hand-tuned heuristics anyway.
Profile-switching UX undefined	How does a user move from Compliance to Immersive mid-session?	Likely per-caption-layer in WallSpace (each layer gets a profile); per-session-default in A.EYE.ECHO. To be designed in Phase 4.
Offline audio profile missing	Podcasts, audiobooks, recorded interviews have fundamentally different latency constraints	Add a sixth Offline / Post-Processing profile. Same rule/policy framework, relaxed timing. Flagged for discussion with Matt.
Decision-log volume in live contexts	30+ caption updates/sec × full log = heavy I/O	Live profile uses `minimal` logging + sampling (Decision Log spec §13–14). Log material decisions only. Verbose mode available for R&D.
Claude's role inside the system	Matt specifies Claude as a constrained reasoning engine, not a generative assistant	Accepted. Claude does rule extraction, compliance scoring, gap analysis — always within the architecture, always emitting decision-log-compatible output.

✅ Execution Layer — Matt's Gap Review Response

Matt's Gap → Fix → Implementation Matrix v1 (2026-04-20) flagged that the v3.0.0 architecture was strong but the execution layer was incomplete — no benchmark framework, no migration path from the current 2.6.1 system, an under-specified canonical data model, and an abstract policy engine. This section tightens those pieces so the plan is buildable and testable. Every block below responds to a numbered gap in Matt's matrix.

Deferred to Jack ↔ Matt review

Gap #10 (Rule Flexibility — Class A/B/C/D formalisation) is intentionally left open. Matt's matrix calls for explicit definitions of Class B (preferred) and Class C (expressive), plus a rules.json structure and policy-engine rejection/override logging. That decision is bound up with the Two-Track Compliance question (A.EYE.ECHO strict floor vs WallSpace flex profiles) which we still need to agree on in person before committing the rule classes to code. See the matching row in “Known Tensions” above.

1. Benchmark Framework Gap #2 CRITICAL

Without a fixed benchmark suite, every model decision (Whisper vs Moonshine vs Deepgram, DSP improvements, phoneme/consonant work) becomes subjective. We add an offline benchmark harness that scores every candidate model against the same audio, same metrics, same pass/fail thresholds.

/benchmarks /audio clean_speech.wav conference.wav live_music_bleed.wav accents.wav overlapping_speakers.wav mumbled_consonants.wav sarcasm_cases.wav /runs <timestamp>/<model>.json <timestamp>/<model>.csv run.ts # npm run benchmark

Metrics

latency_partial_ms — first partial token
latency_final_ms — finalised caption
WER / CER
hallucination_rate
speaker_accuracy
emotion_accuracy (human-rated)
consonant_confidence_score

Pass / Fail thresholds

latency_final_ms < 300 (target)
WER < 10% (clean speech)
WER < 20% (live noisy)
hallucination_rate < 2%
speaker_accuracy > 85%
Result per model: PASS / WARNING / FAIL

Ranking order: latency → accuracy → stability (failure rate). We pick the best model per use case (live music, conference, mobile 1-on-1), not a single global winner. Corpus curation — particularly accents and live-music-bleed samples — is a Jack ↔ Matt open item.

2. Staged Migration Path (M1 → M5) Gap #3 CRITICAL

We already have a working v2.6.1 pipeline. Building the v3.0.0 architecture as a hot-swap rewrite is how working systems break. Instead, each layer lands in observe-only mode first, gated behind a feature flag, with a single-flag rollback.

Phase	What ships	Risk if wrong
M1 — Canonical adapter	Wrap current pipeline so it emits `CanonicalCaption` objects alongside existing output. No behaviour change.	None — current renderer still drives output.
M2 — Rules Engine observe-only	Evaluate every caption against rules matrix. Log violations only. Does not affect what the user sees.	None — log volume only.
M3 — Policy Engine shadow mode	Generate `RenderInstruction` decisions for every caption. Do not apply them. Compare against live output in dashboards.	None — decisions written to decision log only.
M4 — Dual rendering	Run current renderer live + policy renderer to a hidden test surface. Visual A/B diff.	GPU cost; mitigated by sampling.
M5 — Feature flag cutover	`ENABLE_POLICY_RENDER=true` flips live output to the new stack.	Mitigated by mandatory rollback — a single flag reverts the entire new stack and the system immediately runs on the unchanged 2.6.1 pipeline.

Rollback contract (mandatory): no phase is allowed to land without a verified one-flag rollback to the previous phase. Current pipeline code stays in the tree until M5 has been green for an agreed soak period.

3. Canonical Caption — Strict Schema Gap #4 CRITICAL

Phase 2 listed the fields conceptually. Matt's gap review calls for a strict schema — streaming (is_partial + revision_id), token-level timing, per-token confidence, overlapping speakers, explicit uncertainty flags, audio-context typing, and source-engine tracking. The TypeScript interface below is the canonical definition every input engine writes and every downstream layer consumes.

type CanonicalCaption = { id: string // ordering (critical for streaming updates) sequence_id: number // ensures correct ordering of updates // timing start_ms: number end_ms: number // token-level detail (for precision + phoneme alignment) tokens?: { text: string start_ms: number end_ms: number confidence?: number }[] // text state text: string is_partial: boolean // true = streaming partial, false = final revision_id: number // increments on each update // confidence confidence_overall: number confidence_tokens?: number[] // speaker speaker_id?: string speaker_confidence?: number // audio context audio_type: "speech" | "music" | "mixed" | "noise" // uncertainty flags (explicit system awareness) uncertainty: { lexical?: boolean timing?: boolean speaker?: boolean emotion?: boolean } // source tracking source_engine: "whisper" | "moonshine" | "deepgram" // timestamps created_at: number updated_at?: number // control + fallback behaviour fallback_applied?: boolean suppression_reason?: string // debug / traceability debug?: { raw_text?: string processing_time_ms: number } }

Schema lives as both a TypeScript interface and a JSON schema in the shared @wallspace/captions-core package consumed by A.EYE.ECHO and WallSpace. Validation runs in CI; any engine emitting a non-conforming object fails the build.

4. Policy Engine — Decision Pipeline Gap #5 CRITICAL

Phase 4 states the priority order. The gap review wants deterministic process: how we score, how conflicts resolve, when we fall back to a safe mode. This is that spec.

Input: - canonicalCaption - constraints (from Rules Engine) Process: 1. Score: readability_score timing_score safety_score speaker_score 2. Apply priority: readability > timing > safety > speaker > enhancement 3. Resolve conflicts: - extend_duration - split_caption - suppress_enhancement 4. Fallback trigger: if confidence_overall < threshold: → fallback_mode = SAFE Output: RenderInstruction JSON (+ decision log entry)

SAFE fallback mode = static bottom-centre placement, no enhancements, reduced font scale, no spatial positioning. It is the pipeline's “degrade gracefully” target; the accessibility-testing hard constraint (below) means SAFE mode must always remain ≥ baseline comprehension.

5. Accessibility Testing Framework Gap #6 HIGH

Accessibility is central to the system but we have no structured way to say “feature X made things better/worse.” We define four repeatable scenarios, four metrics, and one hard constraint.

Test scenarios

A — 1-on-1 conversation
B — group discussion (3+ speakers)
C — noisy environment (cafe / street)
D — live event captions (music, crowd)

Metrics

Comprehension accuracy (% correct answers to probe questions)
Latency perception (user rating 1–5)
Fatigue (time-to-fatigue self-report)
Lipreading support (qualitative + rating)

Hard constraint

No new feature may reduce comprehension score vs baseline. Regression on any scenario blocks the feature. Results stored as JSON per test run; baseline vs current tracked over time. Matt is primary user-tester for A.EYE.ECHO; WallSpace needs a second deaf/HoH tester cohort (open item).

6. Emotional Sovereignty — Privacy & Control Model Gap #7 HIGH

Emotion inference is mentioned conceptually throughout the plan (Gadi's framing). It becomes a real product, ethical, and legal concern the moment it ships. We lock down the rules now, before any ML emotion model lands.

Config: emotion_enabled = true | false emotion_storage = "none" | "session" | "persistent" Default rules: - emotion_enabled = ON - emotion_storage = "session" # discarded on app close - user can disable completely in settings - logs redact emotion fields unless debug mode is explicitly on UI requirements: [ ] Enable emotion detection [ ] Store emotion data (both surfaced in onboarding + settings, not buried) Decision log: emotion_inference: logged (only if enabled)

Hard rule

Emotion inference must NEVER be enabled without explicit user awareness. First-run onboarding must show the emotion setting; silent telemetry of emotion data is banned. This complements the Spotify/media compliance guardrails — emotion data is display-only, not a layer, and never leaves the device unless the user opts into a cloud emotion service.

7. Cloud Path — Operational Constraints Gap #8

The cloud-ASR path (Deepgram, fal, or a shared WallSpace service) has good architecture but no defined behaviour under failure, latency spikes, or cost overruns. We pin numbers.

Latency budget: - ingest: 20ms - ASR: 100ms - response: 50ms ──────────────── end-to-end < 200ms target Fallback: if WebSocket fails: → revert to local Whisper (degraded but available) Reconnect: - exponential backoff - session resume (preserve sequence_id) Cost controls: - max minutes per session - rate limit per user Queue: - max queue size = 2 - drop oldest if overflow Failure priority: 1. Maintain caption output (even degraded) 2. Reduce latency 3. Disable enhancements if required

8. Windows Path Gap #9

The v3.0.0 plan is Apple-centric (Core Audio, Metal, Vulkan-on-Metal). Windows needs an explicit parity plan or it will drift. Matt's own dev environment is Windows-capable — this is not hypothetical.

Layer	Mac	Windows
Audio capture	Core Audio / Screen Capture Kit	WASAPI
ASR runtime	whisper.cpp subprocess, Moonshine subprocess	Subprocess (Moonshine / Whisper), optional ONNX runtime
GPU inference	Metal	Vulkan (if supported), fallback CPU

Windows parity checklist (per release):

[ ] Audio capture working on a real Windows box (not just builds)
[ ] ASR latency within 1.5× Mac baseline
[ ] Rendering performance acceptable at target scene complexity

9. Delivery Matrix — Phase → Owner → Deliverable → Metric → Deadline Gap #11

The Jack ↔ Matt split was described in prose. This table makes ownership explicit. Deadlines are TBD pending our in-person meeting — Matt holds the Boomtown + cohort context that should set realistic dates.

Phase	Owner	Deliverable	Metric	Deadline
P1	Matt	`rules-matrix.json` (BBC-derived)	Coverage % of BBC guidelines	TBD
P2	Jack	Canonical schema (TS + JSON)	Validation pass in CI	TBD
P3	Both	Rules Engine	Test pass rate	TBD
P4	Jack	Policy Engine	Decision accuracy vs expected	TBD
P5	Jack	Renderer (baseline + SAFE mode)	Visual stability / no reflow	TBD
P6	Both	Enhancement layers (emotion, prosody, spatial)	Regression pass vs Phase 7 suite	TBD
P7	Both	Testing suite (unit + perception)	Full scenario coverage (A/B/C/D)	TBD

🔄 Cross-System Architecture: WallSpace & A.EYE.ECHO

One of the core goals of the Decision Log specification is cross-system consistency between WallSpace and A.EYE.ECHO. The decision log becomes the common language the two apps use to stay coherent even as they evolve on different cadences with different contributors.

Division of work — how Jack and Matt split it

A.EYE.ECHO is open-source (MIT). Matt is focusing on pushing that side forward; Jack is focusing on WallSpace and the shared cloud services. Both apps share a caption codebase and this architecture, so work on one benefits the other.

Matt leads the Rules Matrix extraction (Phase 1) given his depth on the BBC document, and owns the A.EYE.ECHO implementation of Phases 2–7. Jack leads shared-service architecture (Layers 2–5 as reusable packages) and the WallSpace-specific Phase 6 enhancement integrations (visual / creative).

Shared vs app-specific

Component	Shared	App-specific
Rules Matrix	Class A rules	Class D rules per app
Canonical Caption Data Model	Schema identical	—
Rules Engine	Core engine	—
Policy Engine	Core logic + priority order	Profile definitions
Renderer	Baseline text rendering logic	Platform-specific output (React Native for Echo, Electron + Scope for WallSpace)
Enhancement Layer	Proposal interface	Implementations differ (haptics for Echo; Scope visuals for WallSpace)
Decision Log	Schema + constraint vocabulary	Storage backend

🧠 Role of Claude (Constrained Reasoning Engine)

Per Matt's architecture paper, Claude is not used as a general-purpose assistant. It operates strictly within the defined architecture as a constrained reasoning and validation engine. All documents in the spec set (Rules Matrix, Canonical Caption Data Model, Policy Engine, Renderer, Enhancement Layer, Decision Log) are provided to Claude as authoritative inputs.

Claude must

Interpret BBC-derived rules via the Rules Matrix
Evaluate caption data against those rules
Apply policy-driven decision logic
Generate decision-log-compatible output
Identify violations, gaps, inconsistencies
Propose fixes within system constraints

Claude must not

Invent new rules or behaviours outside supplied specs
Override constraint priorities
Introduce presentation behaviour not governed by the Policy Engine
Make decisions without emitting a decision log entry

Tasks for Claude

Rule extraction from BBC document into Rules Matrix entries
Rule-to-system mapping
Test case generation
Compliance scoring
Gap analysis — identify missing rules, behaviours, or coverage
Auditing decision chains
Detecting inconsistent policy behaviour
Validating enhancement safety decisions

★ Quick Wins (This Month)

Upgrade whisper.cpp to v1.8.3

12x GPU performance boost via Vulkan API. Immediate improvement with no architecture changes.

Add Spectral Flux

Frame-to-frame spectral change for consonant edges. More robust than current energy delta method.

Evaluate Moonshine v2

Download, benchmark against current whisper.cpp. If it works: immediate 100x speed improvement.

Try Groq Whisper API

Same Whisper model, 299x faster inference via cloud LPU. Quick test with no local changes needed.

Add Formant Estimation (F1/F2)

LPC-based formant extraction. Gives us vowel space analysis and accent profiling capability.

Test Deepgram Nova-3

$200 free credit. WebSocket streaming API. Could be our cloud fallback for best accuracy.

💬 Key Insights

"The future of voice AI is not better transcription, but deeper audio-native understanding that combines signal processing with semantic reasoning." — Claude Analysis Report

"You cross-reference the text with a kind of sonic analysis and you try to provide a tone-of-voice tag. For instance, they will do consonants really well. Transient analysis is very important." — Gadi Sassoon, Vocal Forensics Consultation

"The hard engineering question is also quite interesting... I used to do vocal synthesis with four months in Csound in 2003. The processes that I have been developing are designed for the original design for video editing." — Gadi Sassoon, on bridging DSP and real-time systems

"I've built a really super crazy stack of agents that has been growing and growing... one of the things they build is basically a models librarian which runs on a constant cron job and scrubs the internet for the latest developments in AI models specifically with a particular interest in audio." — Gadi Sassoon, on staying current with audio AI research

📱 How A.EYE.ECHO Works Today

A.EYE.ECHO is a React Native / Expo mobile app (com.wallspace.aeyeecho) built for deaf and hard-of-hearing accessibility. It uses native speech APIs exclusively — no Whisper, no ML models, no DSP. The philosophy: leverage what the OS already does well, and focus engineering on accessibility UX.

14

Services

2

Platforms (iOS/Android)

26

ASL Letters Recognized

6

Haptic Patterns

3

Audio Sources

Mobile Pipeline

Audio In (mic / system audio / URL ingest) | v [expo-speech-recognition] | +---> iOS: SFSpeechRecognizer (on-device, 55s auto-restart) | +---> Android: Google SpeechRecognizer (on-device) | v [TranscriptSegment] — partial + final results, hallucination filtered | +---> [AudioDiarization] — energy + timing + pause heuristics +---> [SpeakerService] — camera face detection + lip-sync correlation +---> [TranslationService] — DeepL → LibreTranslate +---> [VibrationService] — expo-haptics (6 patterns) +---> [CaptionNetworkService] — WebSocket relay broadcast +---> [Database] — SQLite persistence (expo-sqlite) | v [Caption Display] — face-anchored, speaker-colored, accessible fonts +---> [ASL Recognition] — Apple Vision hand pose (21 joints)

Platform Comparison

Capability	WallSpace (Electron)	A.EYE.ECHO (Mobile)
Platform	Electron (macOS / Win / Linux)	Expo / React Native (iOS / Android)
Speech Engine	Whisper subprocess + Web Speech + Native	expo-speech-recognition (native only)
DSP Features	18 features @ 100ms	None
Emotion Analysis	Lexicon + voice + sarcasm	None
Translation	CTranslate2 offline → DeepL → LibreTranslate	DeepL → LibreTranslate
Speaker ID	Spectral centroid profiling	Camera face + lip-sync correlation
Diarization	Spectral centroid shift	Energy + timing heuristics
Sign Language	None	ASL (26 letters, Vision hand pose)
Haptic Feedback	None	6 patterns (expo-haptics)
URL Ingest	None	YouTube, HLS, direct media
Caption Sharing	None	WebSocket relay (room codes)
Persistence	Session-only (JSON/SRT)	SQLite (sessions + segments)
Beat / Music	FFT onset, MIDI, tap tempo	None
Scope Integration	Real-time prompt modifiers	None

Feature Distribution

Echo Has, WallSpace Doesn't

ASL recognition — Apple Vision hand pose, 21 joints, geometry-based finger classification
Haptic feedback — 6 vibration patterns (speech start/end, speaker change, punctuation grammar)
Face-anchored speakers — MLKit face detection + lip-sync correlation with audio amplitude
URL ingest — YouTube (Piped API + react-native-ytdl), HLS streams, direct media files
Caption sharing — WebSocket relay with 6-digit room codes for multi-device broadcasting
SQLite persistence — full session/segment/speaker history with offline-first architecture
Power management — battery-adaptive modes (full / balanced / saver)
Accessible fonts — OpenDyslexic, Atkinson Hyperlegible, configurable 24-120pt

WallSpace Has, Echo Doesn't

18 real-time DSP features — F0 pitch, transients, spectral centroid, ZCR, energy variance
Emotion analysis — 800+ term weighted lexicon, phrase matching, negation, sarcasm detection
Voice+text emotion blending — weighted fusion with confidence scoring
Whisper subprocess — multiple model sizes (tiny/base/small), multilingual
CTranslate2 offline translation — Opus-MT models, no network needed
AudioWorklet processing — real-time PCM capture + DSP on audio thread
Scope integration — emotion → AI prompt modifiers for generative visuals
Beat/kick detection — FFT onset, MIDI clock, tap tempo for music-reactive visuals

🔨 Enhancing Echo with Vocal Forensics

Key Opportunity

Voice features can drive accessibility-specific outputs on mobile that don't exist yet. Emotion detection maps to haptic intensity patterns — deaf users could feel the emotional tone of speech through their phone's vibration motor. Pitch contour maps to caption text styling (italic for questions, bold for emphasis). Volume maps to caption font size (whisper → shouting). These are novel accessibility features that neither iOS nor Android provide natively.

Beyond the phone: Haptic feedback isn't limited to mobile vibration motors. Deaf and hard-of-hearing audience members at live events may wear haptic wearables — vests (SubPac, Woojer), wristbands (Basslet), or seat transducers — that translate sound into physical sensation. WallSpace could drive these devices from the stage, sending both speech emotion haptics (feel the tone of a speaker) and music-reactive haptics (feel the beat, bass, and dynamics). WallSpace already has beat/kick detection (FFT onset, MIDI clock, tap tempo) and frequency band analysis (sub/bass/mid/high) — this data is ready to drive haptic output.

Voice Feature → Platform-Specific Outputs

Voice Feature	Echo (Phone)	WallSpace (Visuals)	Haptic Wearables (Live Events)	Feasibility
Emotion	Phone vibration patterns	Caption color tint + Scope prompts	Vest/wristband intensity + zone mapping	Text lexicon (free)
Pitch direction	Caption styling (italic)	Caption styling + question detection	Rising/falling sensation on body	Light DSP
Volume	Caption font size scaling	Caption size + output emphasis	Haptic intensity scaling	Amplitude available
Speaking rate	Caption scroll speed	Caption pacing + scene timing	Pulse rhythm matching speech cadence	Timing heuristics
Speaker change	Triple-pulse vibration	Speaker label + color switch	Directional haptic (left/right speaker)	Already in both
Beat / kick	Not implemented	Visual triggers + scene changes	Bass transducer pulses on beat	WallSpace has FFT onset
Frequency bands	Not implemented	Audio-reactive layer effects	Sub/bass/mid/high mapped to body zones	WallSpace has 7 bands
Transients	Alert vibration	Scope visual intensity	Sharp tactile clicks on consonants	Needs raw audio
Trembling	Gentle double pulse	Visual softening effect	Subtle tremor sensation	Needs DSP or cloud
Sarcasm	Visual indicator (~)	Caption annotation + mood shift	Contradictory pulse (sharp then soft)	Needs text + voice sync

Haptic Wearables: The Live Event Opportunity

At live music and speech events, deaf audience members increasingly use haptic wearable technology to experience sound physically. WallSpace is uniquely positioned to drive these devices because it already has the audio analysis pipeline running in real-time:

Speech Haptics

Emotion intensity → vibration strength
Speaker identity → directional zones
Volume dynamics → amplitude mapping
Consonant transients → tactile clicks

Music Haptics

Beat/kick detection → bass transducer pulses
7 frequency bands → body zone mapping (sub in chest, highs in wrists)
MIDI clock sync → rhythmic patterns
Onset detection → dynamic intensity curves

Devices: SubPac M2X (backpack/vest), Woojer Vest Edge, Basslet (wristband), ButtKicker (seat mount), custom Arduino/ESP32 builds via Bluetooth LE or OSC. WallSpace's existing OSC bridge (src/main/oscBridge.ts) could output haptic control messages alongside visual triggers — same data, different output modality.

What Can Run On-Device vs Cloud

Mobile-Feasible (On-Device)

RMS amplitude — available via expo-av metering (dB level)
Speaking rate — word count / segment duration (timing heuristics)
Silence detection — pause duration from segment gaps
Text-based emotion — lexicon analysis is pure TypeScript, no audio needed

Limitation: expo-av provides only dB amplitude, not raw PCM buffers. For real DSP (pitch, spectral centroid, transients), you need AVAudioEngine.installTap (iOS) or AudioRecord (Android) via a custom Expo native module.

Cloud-Required (Too Heavy for Mobile)

Full VoiceFeatureExtractor — needs AudioWorklet equivalent (doesn't exist in RN)
ML emotion models — emotion2vec, SenseVoice (large ONNX models)
Formant analysis — LPC computation on raw audio
MFCC extraction — Mel filterbank + DCT on raw spectral data
Transient detection — energy delta analysis on raw waveform
Sarcasm detection — synchronized text + voice analysis pipeline

Native Audio Analysis Alternatives

To run DSP directly on mobile without a cloud service, you'd need custom native modules:

iOS: AVAudioEngine with installTap(onBus:) for raw PCM buffers; Accelerate framework for vDSP FFT; could run pitch detection + RMS + basic spectral analysis natively
Android: Oboe / AAudio for low-latency audio capture; AudioRecord for raw PCM; basic DSP feasible natively
Cross-platform: An Expo native module exposing onAudioBuffer(Float32Array) callback would allow porting a subset of VoiceFeatureExtractor — but maintaining Swift + Kotlin implementations is significant engineering

The AudioWorklet Gap

WallSpace uses Web Audio API's AudioWorklet for real-time DSP in the Electron renderer process. React Native has no equivalent. expo-av provides only amplitude metering. The practical path forward: (a) basic amplitude/timing features locally, (b) heavy DSP via a shared cloud service that accepts audio chunks and returns enriched data. This avoids the significant native engineering of building cross-platform audio buffer access.

☁ Shared Web Service Architecture

A cloud service both apps share for heavy audio processing. Mobile gets capabilities it can't run locally. Desktop gets a cloud fallback when local processing is insufficient. Built on the existing wallspace.studio Cloudflare infrastructure.

A.EYE.ECHO (Mobile) WallSpace (Electron) | | audio chunks (opus/PCM) audio chunks (PCM) via WebSocket via WebSocket | | v v +=========================================================+ | wallspace.studio/api/audio | | Cloudflare Worker + Durable Object | | | | [Deepgram Nova-3] [emotion2vec] | | streaming ASR speech emotion | | WER ~5.26% ML-based | | | | | | v v | | [Response Assembler] | | { transcript, emotion, voiceFeatures, | | speakerId, hapticTrigger, confidence } | | | | Auth: wallspace.studio JWT (existing) | | Storage: D1 (session logs) + R2 (audio cache) | +=========================================================+ | | v v Enriched captions Cloud fallback + haptic triggers when local DSP unavailable + emotion colors

Service Endpoints

Endpoint	Method	Purpose	Auth
`wss://wallspace.studio/api/audio/stream`	WebSocket	Send audio chunks, receive enriched transcripts in real-time	JWT
`POST /api/audio/analyze`	HTTP	One-shot analysis of an audio buffer (batch mode)	JWT
`GET /api/audio/models`	HTTP	List available processing models and capabilities	Public
`POST /api/audio/session`	HTTP	Create or end a processing session	JWT

Cloudflare Durable Objects for Session State

Existing Infra WebSocket Native Per-Session State

Each active audio session maps to a Durable Object instance. The DO holds: current speaker profile, emotion history (for hysteresis smoothing), accumulated transient buffer (1-second window), session metadata. Audio chunks arrive via WebSocket, get processed by external API (Deepgram), results streamed back. Durable Objects provide: per-session state without external database, WebSocket hibernation (cost-efficient idle sessions), automatic cleanup on disconnect.

Option A: Deepgram Proxy

Worker receives audio from client, forwards to Deepgram Nova-3, enriches response with emotion data before returning.

Advantage: Best WER accuracy (~5.26%), streaming, speaker diarization built in
Cost: $0.0077/min ($200 free credit available)
Latency: <300ms end-to-end

Option B: Workers AI (On-Edge)

Cloudflare Workers AI for on-edge inference. Run whisper-tiny or emotion classification directly on Cloudflare's GPU fleet.

Advantage: No external API dependency, lower latency for nearby PoPs
Limitation: Model availability, cold start latency
Cost: Workers AI pricing (pay per inference)

Response Format

{ "type": "enriched-caption", "transcript": { "text": "That's just great.", "confidence": 0.94, "isFinal": true }, "emotion": { "primary": "anger", "confidence": 0.72, "source": "voice+text", "sarcasmDetected": true }, "voiceFeatures": { "pitchHz": 185, "pitchDirection": "falling", "volumeCategory": "loud", "rmsLevel": 0.71 }, "speaker": { "id": "spk_01", "changeDetected": false }, "hapticTrigger": { "pattern": "anger-pulse", "intensity": "strong" }, "timestamp": { "startMs": 14200, "endMs": 15800 } }

Auth Integration

Both apps already share the wallspace.studio domain. The existing JWT auth system (/api/auth/login, /api/auth/me, Google/GitHub/Apple SSO) authenticates both Electron and mobile clients. The mobile app stores the JWT in expo-secure-store. The existing _shared.ts auth helper already has CORS support and token verification. No new auth infrastructure needed.

🔄 Code Reuse Opportunities

Already Ported Once

WallSpace's translationService.ts was literally ported from A.EYE.ECHO. The two files are 90% identical — same DeepL client, same LibreTranslate fallback, same LRU cache. Time to extract shared code into packages both apps import from.

Reuse Audit

Service	WallSpace File	Echo File	Overlap	Action
Translation	`renderer/services/translationService.ts`	`src/services/translationService.ts`	90%	Extract `@wallspace/translation`
Emotion Lexicon	`renderer/utils/sentimentAnalyzer.ts`	none	Pure TS	Copy to Echo (zero deps)
Caption Network	none	`src/services/captionNetworkService.ts`	Echo only	Port to WallSpace
Diarization	VoiceFeatureExtractor (spectral)	`src/services/audioDiarization.ts`	Different approach	Merge: timing + spectral
Vibration	none	`src/services/vibrationService.ts`	Echo only	Add emotion → haptic map
DB Schema	session-only (JSON/SRT)	`src/services/database.ts`	Partial	Align with cloud D1 schema
Types	Various renderer types	`src/types/index.ts`	~80%	Extract `@wallspace/types`

Proposed Shared Packages

`@wallspace/translation`

TranslationCache (LRU, 200 entries)
DeepL HTTP client + language mappings
LibreTranslate HTTP client
Rate limiting + fallback cascade

Platform-specific bits stay separate: WallSpace keeps CTranslate2 offline via electronAPI; Echo keeps expo-constants API key resolution.

`@wallspace/types`

TranscriptSegment
Speaker, SpeakerProfile
Emotion, EmotionResult
TranscriptionStatus
WhisperLanguage

Both apps define these nearly identically. Union them (Echo adds source?: 'speech' | 'sign-language').

`@wallspace/emotion`

WEIGHTED_LEXICON (800+ terms)
analyzeTextEmotion()
EMOTION_VISUALS mapping
emotionToHaptic() (new)

Pure TypeScript, zero DOM dependencies. Runs in React Native without modification. Add emotionToHaptic mapping for mobile.

Package Strategy: Monorepo First

Option A (recommended start): npm workspace monorepo — add packages/ directory to crt-wall-controller, reference from both projects. Simpler dev workflow, instant iteration.
Option B (later): Published @wallspace/ scoped packages on npm — cleaner separation, works with any project structure, but adds publish/version overhead.
Start with A for speed, migrate to B when the shared API is stable.

🎯 Recommendations: Echo Integration Roadmap

Funding Reality: Free First, Paid Later

WallSpace.Studio is a commercial creative tool. A.EYE.ECHO is open-source (MIT-licensed) so it stays freely available to the deaf community. Matt's going to focus on pushing A.EYE.ECHO forward while Jack focuses on WallSpace and the shared cloud services, so both sides of the caption system move in parallel and benefit from the same underlying work.

The two surfaces share a caption codebase but have different funding targets. Cloud ASR services (Deepgram, AssemblyAI, Workers AI) have per-minute costs that add up — especially for a service intended to stay free for deaf users. We start with what costs nothing: native OS speech APIs, existing code reuse, and pure TypeScript logic that runs on any platform. Paid cloud services come later, funded either through grants / community support on the A.EYE.ECHO side, or via WallSpace commercial revenue subsidizing both.

Code Signing: Free ASR Unlocked in v2.6.0

WallSpace has a complete native SFSpeechRecognizer implementation — the same free, on-device Apple speech engine that Echo uses. It's fully built: native C++/Objective-C addon (native/speech-recognition/src/speech_mac.mm), main process bridge (src/main/nativeSpeechBridge.ts), renderer service (src/renderer/services/nativeSpeechEngine.ts), IPC plumbing, 25+ language support, 55-second auto-restart.

Abandoned in v2.6.1: Despite adding entitlements, signing, notarizing, and switching to native arm64 builds, both SFSpeechRecognizer and Web Speech API consistently crash the Electron renderer process. Tested on Rosetta x64 (SIGTRAP), arm64 native (grey screen / SIGSEGV), with and without active Apple Developer agreement. The crash occurs at the native addon boundary and is not recoverable via try-catch. Moonshine v2 replaces all three speech engines (Whisper, native, web) with a single local streaming ML model — that is the path forward.

Requirement	Status	Action
Native addon (`speech_mac.mm`)	Complete	No changes needed
IPC bridge + renderer service	Complete	No changes needed
Hardened runtime	Enabled	Already in electron-builder config
Speech entitlement in plist	Added in v2.6.0	`com.apple.security.speech-recognition` added to `entitlements.mac.plist`
Apple Developer certificate	Working	Team ID 9K65QDV874 — signing successful
Notarized build	v2.6.0 shipped	Signed, notarized, published April 12, 2026

Path forward: Moonshine v2 provides local streaming ASR without native addon dependencies — runs as WASM or subprocess, no SFSpeechRecognizer, no Web Speech API, no Electron renderer crash risk. Replaces Whisper too (lower latency, streaming output). Echo continues using native speech APIs on iOS where they work reliably.

Phase 1: Zero-Cost Foundation Start Here

Everything in this phase costs nothing — no API keys, no subscriptions, no cloud services. Pure code reuse, native OS capabilities, and TypeScript logic that already exists.

1. Fix Code Signing + Enable Native Speech on WallSpace Abandoned v2.6.1

Outcome: Despite full implementation (entitlements, signing, notarization, arm64 native build), SFSpeechRecognizer and Web Speech API both crash the Electron renderer on every tested configuration. Replaced by Moonshine v2 in the roadmap below.

~~Add com.apple.security.speech-recognition entitlement~~ ✓ (done, didn't help)
~~Build a signed, notarized release~~ ✓ (done, didn't help)
~~Switch to arm64 native (no Rosetta)~~ ✓ (done, still crashes)
~~Renew Apple Developer agreement~~ ✓ (done, still crashes)
Conclusion: The native addon boundary in Electron is fundamentally unstable for speech APIs. Moonshine v2 (pure JS/WASM, no native addon) is the correct path.

Status: Abandoned. Whisper remains the only ASR engine until Moonshine v2.

2. Copy Emotion Lexicon to Echo $0

Goal: Give Echo text-based emotion analysis with zero additional dependencies

Copy sentimentAnalyzer.ts to Echo — pure TypeScript, zero DOM or Electron deps
Wire into transcript pipeline: every caption gets an emotion tag
Map emotions to vibration patterns via existing VibrationService
Result: Deaf users feel the emotional tone of speech via haptics — novel accessibility feature

Effort: Half a day | Cost: $0

3. Emotion-Driven Haptic Patterns $0

Goal: Deaf users feel the emotional tone of speech — on phone and wearable devices

Echo (phone): Extend VibrationService to accept EmotionResult from lexicon
WallSpace (wearables): Output emotion + beat data via OSC to haptic vests/wristbands at live events
Anger: strong rapid pulses Sadness: slow gentle pulses
Joy: light double-tap Fear: increasing intensity
WallSpace already has the OSC bridge + beat detection + 7 frequency bands — map to haptic zones
Result: Emotional context through touch, from phone vibration to full-body haptic vest

Effort: 1-2 days | Cost: $0 (OSC output is free, wearable hardware is user-provided)

4. Volume-Based Caption Styling $0

Goal: Visual representation of how loud someone is speaking

Echo already has amplitude metering via expo-av (dB level)
Map volume levels to caption font size: whisper (small, light) → shouting (large, bold)
WallSpace already has volumeCategory in VoiceFeatureExtractor — reuse the thresholds
Result: Deaf users see volume visually without any new audio processing

Effort: Half a day | Cost: $0

5. Extract Shared Translation Package $0

Goal: Eliminate 90% duplicated code between both apps

Factor out TranslationCache, DeepL client, LibreTranslate client into @wallspace/translation
Both apps already use LibreTranslate as free fallback — no API key required
DeepL free tier (500K chars/month) available when API key is configured
Keep platform-specific adapters separate (CTranslate2 for Electron, expo-constants for mobile)

Effort: Half a day | Cost: $0

6. Caption Network: WallSpace ↔ Echo $0

Goal: WallSpace can broadcast/receive captions to/from mobile devices

Port Echo's CaptionNetworkService to WallSpace (WebSocket relay + room codes)
Existing Glitch relay (caption-relay.glitch.me) is free-tier hosted
Enables: CRT wall showing captions from mobile users in audience
Result: Live performance captioning where audience phones feed the wall

Effort: 1-2 days | Cost: $0 (Glitch free tier)

7. Speaking Rate → Caption Scroll Speed $0

Goal: Captions adapt pacing to speech speed

Compute WPM from segment timing (word count / duration) — no audio analysis needed
Fast speech: captions scroll faster, auto-truncate older lines
Slow speech: captions hold longer, auto-pause on silence
Result: Better readability for deaf users without any cloud or DSP dependency

Effort: Half a day | Cost: $0

Phase 2: Cloud-Powered Upgrades Test & Validate Now, Scale When Funded

These features use paid APIs. Development and testing can happen now using free credits and careful usage management. Production-scale always-on use requires funding. The key principle: free engines run by default, paid engines are opt-in and never persist across restarts.

Deepgram: On-Demand Premium Toggle

Deepgram is not an always-on replacement for native speech — it's a premium mode you switch on when you need the best accuracy or cloud-based voice emotion analysis, then switch off. Free native ASR handles day-to-day captioning; Deepgram handles demos, live events, and testing.

Usage Pattern	Monthly Cost	$200 Credit Lasts
2-hour live events, 2x/month	$1.85	~9 years
1 hour/day testing & development	$14	~14 months
4 hours/day regular use	$55	~108 days
12 hours/day always-on (avoid this)	$166	~36 days

Required Safeguards

Always off on restart: Deepgram must never auto-enable on app launch. User manually toggles it on each session. Prevents accidental cost accumulation from leaving the app running overnight or forgetting it's on cloud mode.
Session cost tracker: Show real-time cost in the caption panel (e.g., "Deepgram: 14 min — $0.11 this session"). Similar pattern to existing cost-aware API displays in the app.
Warning thresholds: Alert at configurable spend limits (e.g., $1/session, $10/week, $50/month). Prompt to switch back to free engine.
Auto-disable on idle: If no speech detected for 5+ minutes while Deepgram is active, prompt to switch to free engine or auto-switch with notification.

Broader API Cost Management (Planned)

Deepgram is one of several cost-based APIs in the WallSpace ecosystem (Deepgram, DeepL, fal.ai/RunPod for Scope GPU, future cloud services). A comprehensive API cost management initiative is needed across all paid services — unified cost tracking dashboard, per-service budgets, usage alerts, and spend reporting. Matt already has tickets scoped around this topic. For now, Deepgram follows the same patterns as other cost-based APIs in the app (manual enable, session tracking, warnings). A unified cost management system will be planned and addressed as a separate initiative.

8. Deepgram Integration (Test & Validate) $200 Free Credit

Goal: Validate cloud ASR quality + voice emotion pipeline, manage costs carefully

Add Deepgram Nova-3 as an engine option in the caption engine selector
WebSocket streaming integration (same pattern as existing engines)
Default: OFF. Never persists across app restarts. Manual toggle only.
Add session cost counter + spend warning system to caption panel
Test and validate: accuracy vs native speech, latency, speaker diarization quality
$200 free credit is sufficient for months of development and testing

Effort: 1-2 days | Cost: $0 prefunding ($200 credit), ~$0.0077/min after

9. Shared Cloud Audio Service When Funded

Goal: Both apps get enriched captions via wallspace.studio/api/audio

Cloudflare Worker + Durable Object proxying to Deepgram
Adds emotion classification from audio (not just text) to the response
Mobile gets: cloud ASR + voice-based emotion when premium mode is on
Same toggle-on/off safeguards apply — free native speech remains the default

Effort: 1-2 days | Cost: Deepgram usage + Cloudflare Workers (free tier generous)

10. Migrate Caption Relay to Cloudflare Durable Objects When Funded

Goal: Replace Glitch free-tier relay with production-grade infrastructure

Move from caption-relay.glitch.me to wallspace.studio Durable Objects
Better reliability, no cold starts, WebSocket hibernation
Unified auth with existing wallspace.studio JWT system

Effort: 1-2 days | Cost: Cloudflare Workers Paid ($5/mo base)

11. ML Emotion Models (Cloud) Future

Goal: Replace heuristic emotion rules with trained ML models

emotion2vec or SenseVoice-Small via cloud inference
Workers AI or dedicated GPU endpoint
Same cost management principles: opt-in, tracked, budgeted

Effort: 1-2 weeks | Cost: Workers AI pricing or GPU hosting

Phase 3: Deep Engineering Long-Term

11. Mobile DSP Expo Module Engineering

Goal: On-device voice feature extraction without cloud dependency

Custom Expo module: AVAudioEngine.installTap (iOS) + AudioRecord (Android)
Port subset of VoiceFeatureExtractor: RMS, basic pitch, spectral centroid
Would give Echo on-device emotion from voice (not just text) — completely free at runtime

Effort: 1-2 weeks | Cost: $0 (engineering time only)

12. Unified Speaker Detection Engineering

Goal: Merge camera-based (Echo) + audio-based (WallSpace) speaker identification

Echo sends face IDs + lip-sync correlation data
WallSpace sends spectral centroid profiles
Fused speaker identity across platforms and sessions

Effort: 2+ weeks | Cost: $0 (engineering time only)

    Mobile Constraints to Remember
    iOS 60-second limit: SFSpeechRecognizer sessions auto-terminate at ~60s.
        Echo already handles this with auto-restart at 55s (SESSION_RESTART_MS = 55_000).
        WallSpace's native bridge uses the same pattern (speech_mac.mm restarts at 55s)
Background audio: iOS limits background processing to ~30s without background mode entitlement.
        Continuous captioning requires foreground or Audio background mode
Battery: Continuous audio processing + camera face detection drain battery rapidly.
        Echo already has power management (full / balanced / saver modes)
Hearing aid routing: Bluetooth hearing aids, cochlear implant processors, and MFi devices
        have unique audio routing requirements on iOS
Bundle size: ML models in mobile bundle increase app store download size.
        Cloud processing or text-only analysis avoids this entirely

  

Phase 1: Free ($0)

Fix code signing → native speech on WallSpace
Emotion lexicon → Echo (pure TS)
Emotion → phone haptics + wearable OSC
Beat/frequency → music haptics via OSC
Volume → caption font size
Translation package extraction
Caption network port
Speaking rate → scroll speed

Total cost: $0

Phase 2: Test Now, Scale When Funded

Deepgram (test with $200 credit)
Always off on restart, manual toggle
Session cost tracker + warnings
Durable Object caption relay
ML emotion models (cloud)
Broader API cost mgmt (planned)

$1-5/mo if toggled sparingly

Phase 3: Engineering

Mobile native DSP module
Unified speaker identity
On-device ML emotion
FocalCodec voice tokenization
Decision matrix fusion

$0 runtime, weeks of dev time

📝 For Review: Questions for Gadi, Matt & Echo

For Gadi

Which formant extraction method do you recommend for real-time? LPC vs cepstral?
Your "time-window script" concept — what chunk size worked best in your offline tools?
Any specific Chinese models from your librarian agent we should evaluate?
Would you want to contribute DSP code directly? We can set up a branch for testing.
Your emotional sovereignty papers — any framework we should adopt for the decision matrix?
PLAUD recorder — any insights from their approach worth borrowing?

For Matt

With the new transient detection in v2.6.0, can you notice consonant-heavy speech being handled differently?
Is the emotion debug panel useful for understanding what the system is "hearing"?
Deepgram vs Whisper — want to A/B test both during a session?
Would phoneme-level confidence indicators in captions be useful? (e.g., dim uncertain words)
Speaker change detection — is it triggering correctly in multi-person conversations?
Priority: faster transcription (Phase 1) or better emotion accuracy (Phase 3)?

For Echo Integration

Should the cloud service use Deepgram proxy or Workers AI for inference?
Which emotion-to-haptic mappings feel right for deaf users? Need user testing.
Should caption relay migrate from Glitch to wallspace.studio Durable Objects?
Which Echo features should WallSpace adopt first? (URL ingest? Caption sharing? ASL?)
Monorepo or published npm packages for shared code?
Is expo-secure-store sufficient for JWT storage on mobile?