Skip to content

Claude/fix health check ap3 de#88

Open
mirai-gpro wants to merge 70 commits intoaigc3d:masterfrom
mirai-gpro:claude/fix-health-check-ap3De
Open

Claude/fix health check ap3 de#88
mirai-gpro wants to merge 70 commits intoaigc3d:masterfrom
mirai-gpro:claude/fix-health-check-ap3De

Conversation

@mirai-gpro
Copy link

No description provided.

Test scripts to verify A2E (Audio2Expression) lip sync quality
with Japanese audio input, before investing in ZIP motion replacement
or VHAP Japanese FLAME params.

Includes:
- generate_test_audio.py: EdgeTTS Japanese/English/Chinese audio samples
- test_a2e_cpu.py: A2E model loading, Wav2Vec2 feature extraction, ZIP validation
- save_a2e_output.py: Capture A2E 52-dim ARKit blendshape output
- analyze_blendshapes.py: Lip sync quality scoring and language comparison
- setup_oac_env.py: Auto-detect known OpenAvatarChat issues (CPU mode, deps, config)
- chat_with_lam_jp.yaml: Corrected config (Gemini API + EdgeTTS ja-JP-NanamiNeural)
- run_all_tests.py: Master test runner
- TEST_PROCEDURE.md: Step-by-step test procedure

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Fix RuntimeError: Input data type <class 'list'> is not supported.
- diagnose_onnx_error.py: Tests SileroVAD ONNX, SenseVoice, data flow
- patch_vad_handler.py: Fixes timestamp[0] NoneType bug, adds defensive
  numpy type checking on ONNX inputs, handles 2/3-output model variants
- setup_oac_env.py: Adds VAD handler bug detection (check 7/7)

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Simple test script that verifies environment, model files,
data_bundle.py fix, Wav2Vec2 loading, and A2E module import.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Gemini's OpenAI-compatible API sometimes returns delta.content as dict/list
instead of string, causing TypeError in set_main_data(). This patch script
detects and safely converts non-string content before passing to data_bundle.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
gemini-2.0-flash returns 404 "no longer available to new users".
The error dict then cascades into the set_main_data TypeError.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
SenseVoice auto-detection defaults to Chinese (<|zh|>), causing
Japanese speech to be misrecognized as Chinese text. This patch
forces language="ja" in the generate() call.

- patch_asr_language.py: Auto-patches asr_handler_sensevoice.py
- chat_with_lam_jp.yaml: Added language: "ja" to SenseVoice config
- TEST_PROCEDURE.md: Added Step 4.5 for patch application

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Instead of creating a separate config file, this script patches
the existing working config/chat_with_lam.yaml with 3 changes:
1. TTS voice → ja-JP-NanamiNeural
2. LLM system_prompt → Japanese
3. ASR language → ja

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Root cause analysis from production logs:
- 1st ASR call: rtf=0.629 (1.25s) - OK
- 2nd ASR call: rtf=15.027 (29.83s) - GPU memory exhausted, CPU fallback
- fastrtc 60s timeout triggers, resets frame pipeline → system unresponsive

Fix: Add torch.cuda.empty_cache() + gc.collect() after each SenseVoice
and LAM inference to free GPU memory between calls. Also adds startup
wrapper with PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Create the missing Audio2Expression inference service that bridges
gourmet-support backend (which already has A2E hooks in /api/tts/synthesize)
with the actual Wav2Vec2 + LAM A2E decoder pipeline.

Services:
- audio2exp-service: Flask API accepting MP3 audio, returning 52-dim
  ARKit blendshape coefficients at 30fps. Includes Wav2Vec2 feature
  extraction and fallback mode when A2E decoder is unavailable.
- Frontend ExpressionManager: Maps A2E blendshapes to GVRM bone system,
  syncing with audio playback via currentTime.

Architecture: TTS → MP3 → audio2exp-service → 52-dim blendshapes → frontend

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
The a2e_engine now searches multiple patterns for the checkpoint:
- models/LAM_audio2exp_streaming.tar (flat, user's actual layout)
- models/LAM_audio2exp/pretrained_models/*.tar (OpenAvatarChat layout)
- models/LAM_audio2exp/*.tar (intermediate layout)
Falls back to rglob search if none match.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Full drop-in replacement for gourmet-sp's concierge-controller.ts with
Audio2Expression integration applied. Key changes marked with ★ comments:
- ExpressionManager import and initialization
- session_id added to /api/tts/synthesize requests
- A2E expression data used for lip sync when available
- FFT-based lip sync preserved as fallback
- Proper cleanup in stopAvatarAnimation() and dispose()

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Replaces the scaffold version with the real concierge-controller.ts from
gourmet-sp (claude/test-concierge-modal-rewGs branch). A2E integration is
already built-in via applyExpressionFromTts() + lamAvatarController.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
uvicorn is an ASGI server (FastAPI/Starlette) and cannot serve Flask
(WSGI). This caused the Cloud Run container to fail to start and listen
on the port, resulting in deployment timeout.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Covers all components: backend (gourmet-support), frontend (gourmet-sp),
audio2exp-service, A2E frontend patches, official HF Spaces ZIP generation
procedure, test suite, deployment config, and end-to-end data flow diagrams.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
The audio2exp-service returns frames as arrays of numbers (number[][]),
but applyExpressionFromTts expected objects with a .weights property
({weights: number[]}[]), causing TypeError and empty frame buffer.

Changed f.weights[i] to frameData[i] to match the actual backend format.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
…AvatarController)

The previous implementation used window.lamAvatarController which doesn't
exist in this codebase, causing lip sync to completely fail (buffer=0,
jaw=0, mouth=0). Additionally, the data format was wrong (f.weights[i]
vs the actual number[][] response).

Now uses ExpressionManager (vrm-expression-manager.ts) which:
- Correctly handles the number[][] frame format from audio2exp-service
- Syncs to audioElement.currentTime for accurate lip sync timing
- Maps ARKit blendshapes (jawOpen, mouthFunnel, etc.) to GVRM bone system
- Calls renderer.updateLipSync() directly

Changes:
- Import ExpressionManager and initialize in init()
- Replace lamAvatarController dependency with ExpressionManager
- Add expressionManager.stop() in stopAvatarAnimation()
- All 5 call sites (speakTextGCP, speakResponseInChunks x2, shop TTS x2)
  now correctly drive lip sync through ExpressionManager

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
The import '../avatar/vrm-expression-manager' caused a Vite build error
because that file doesn't exist in gourmet-sp's src/scripts/avatar/.

Solution: inline the ExpressionManager class directly into
concierge-controller.ts. This eliminates the need to copy a separate
file into gourmet-sp and avoids import resolution issues.

The ARKIT_INDEX map is trimmed to only the 7 mouth-related blendshapes
actually used for lip sync (jawOpen, mouthFunnel, mouthPucker, etc.)

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Root cause: this.guavaRenderer doesn't exist on CoreController.
LAMAvatar.astro has its own animation loop with buffer/ttsActive state.
The ExpressionManager approach was completely wrong architecture.

Correct approach: use window.lamAvatarController exposed by LAMAvatar.astro
- setExternalTtsPlayer(): links ttsPlayer so LAMAvatar can track playback
- queueExpressionFrames(): feeds A2E frames into LAMAvatar's buffer
- clearFrameBuffer(): clears buffer on stop/new segment

Changes:
- Remove inlined ExpressionManager class (120 lines of dead code)
- Restore lamAvatarController.setExternalTtsPlayer() with retry (500ms x 20)
- applyExpressionFromTts: convert number[][] → {name: value}[] and queue
- stopAvatarAnimation: call clearFrameBuffer() to close mouth

Console should now show:
- "[Concierge] ✅ Linked ttsPlayer with LAMAvatar controller"
- "[Concierge] A2E: N frames queued @ 30fps"
- LAM Health: buffer>0, ttsActive=true during speech

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
… code

Read the ACTUAL LAMAvatar.astro, lam-websocket-manager.ts, and
audio-sync-player.ts from gourmet-sp to understand the real architecture.

Key findings:
- LAMAvatar.getExpressionData() is called at 60fps by renderer
- It reads frameBuffer[floor(ttsPlayer.currentTime * frameRate)]
- Requires: externalTtsPlayer linked, frameBuffer filled, ttsActive=true
- ttsActive is set by play event (requires setExternalTtsPlayer first)

4 chains must ALL work for lip sync:
  Chain1: Backend must return expression data (needs AUDIO2EXP_SERVICE_URL)
  Chain2: setExternalTtsPlayer must link ttsPlayer with LAMAvatar
  Chain3: applyExpressionFromTts must convert & queue frames
  Chain4: LAMAvatar renders from frameBuffer synced to currentTime

Added diagnostic logs at each chain point:
  [A2E Chain1] expression received or null (backend config issue)
  [A2E Chain2] setExternalTtsPlayer success or LAMAvatar not found
  [A2E Chain3] frames queued with jawOpen sample value

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
…meBuffer, support both frame formats

Compared with the ORIGINAL gourmet-sp concierge-controller.ts (from
claude/test-concierge-modal-rewGs branch) and found 2 bugs:

1. stopAvatarAnimation() called clearFrameBuffer() which resets
   fadeOutStartTime=null, breaking LAMAvatar's graceful 200ms fade-out.
   The ORIGINAL code trusts LAMAvatar's own ended event handler.
   → Removed clearFrameBuffer() from stopAvatarAnimation()

2. Frame data format mismatch:
   - Original gourmet-sp: f.weights[i] (expects {weights: number[]}[])
   - audio2exp-service: number[][] (raw arrays)
   → Now supports BOTH formats: Array.isArray(f) ? f : f.weights

Key fact: before A2E changes, lip sync was working via the renderer's
built-in FFT analysis. The A2E code path was dead code (AUDIO2EXP_SERVICE_URL
not set). These changes ensure A2E is a pure overlay that doesn't break
the existing FFT lip sync.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Root cause: When AUDIO2EXP_SERVICE_URL is set, the backend returns
expression data. The original code's applyExpressionFromTts used
f.weights[i] on raw number[] arrays, causing TypeError → caught by
outer try/catch → isAISpeaking=false → STT worked (lucky bug).

My both-format fix removed this error, so audio playback proceeds.
But if the browser blocks autoplay (fires play then immediate pause),
onended never fires → playPromise never resolves → initializeSession
hangs → buttons never enabled → STT completely broken.

Fix: Add onpause deadlock prevention to ALL 8 play-and-wait patterns,
matching the existing pattern in ack playback (line 588):
  this.ttsPlayer.onpause = () => {
    if (this.ttsPlayer.currentTime < 0.1) done();
  };

This detects "play then immediate pause" (autoplay block) and resolves
the promise, preventing deadlock. Normal mid-playback pauses (currentTime
> 0.1) are not affected.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
オリジナルのgourmet-sp concierge-controller.tsとの差分を最小化。
唯一の実質変更は applyExpressionFromTts メソッドのみ:
- フレーム形式: f.weights[i] → Array.isArray(f) ? f : (f.weights || [])
  (audio2exp-service の number[][] 形式に対応)
- try/catch で非致命的エラーとして処理
- その他全メソッド(speakTextGCP, STT, sendMessage等)はオリジナルと同一

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
…ration

Previous patches removed all GVRM renderer integration (import, guavaRenderer,
setupAudioAnalysis, startLipSyncLoop) and replaced with non-existent
window.lamAvatarController calls, causing all A2E data to be silently dropped
and lip sync to degrade to basic jaw flapping.

This rewrite is based on the actual production concierge-controller.ts with
minimal A2E additions:
- Restore GVRM import, guavaRenderer, setupAudioAnalysis, startLipSyncLoop
- Add a2eFrames/a2eFrameRate/a2eNames properties for expression storage
- Add setA2EFrames() to store expression data from TTS response
- Add computeMouthOpenness() to convert 52-dim ARKit blendshapes to scalar
- Modify startLipSyncLoop() to use A2E frames when available, FFT as fallback
- Override speakTextGCP() with inline fetch to include session_id
- Add session_id to ALL TTS requests (ack, chunks, shop flow)

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
…t GVRM)

Root cause: The patch was based on gourmet-support's concierge-controller.ts
which uses GVRM renderer, but the actual deployed frontend (gourmet-sp) uses
LAMAvatar.astro with a completely different rendering pipeline.

Previous patch problems:
- Added GVRM import/renderer that doesn't exist in gourmet-sp
- Missing linkTtsPlayer() - LAMAvatar never received ttsPlayer reference
  -> ttsActive=false, buffer=0, lip sync completely dead
- Added setupAudioAnalysis()/startLipSyncLoop() for FFT - unnecessary with LAMAvatar
- Called clearFrameBuffer() in stopAvatarAnimation() - breaks LAMAvatar fade-out

Fix: Use the exact gourmet-sp version which correctly:
- Links ttsPlayer to LAMAvatar via setExternalTtsPlayer() in init()
- Sends A2E frames via applyExpressionFromTts() -> lamAvatarController.queueExpressionFrames()
- Lets LAMAvatar handle all lip sync rendering internally
- Does NOT call clearFrameBuffer() in stopAvatarAnimation()

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
…rpolate frames

Changes to applyExpressionFromTts():
1. Mouth blendshape amplification: Scale jawOpen (1.4x), mouthFunnel/Pucker (1.5x),
   mouthSmile (1.3x), mouthStretch (1.2x) etc. for more visible Japanese vowel
   distinctions (あ/い/う/え/お)
2. Frame interpolation: 30fps→60fps via linear interpolation between consecutive
   frames, matching the renderer's ~60fps render loop for smoother animation
3. Diagnostic logging: jawOpen/mouthFunnel/mouthSmile max/avg values logged per
   expression segment for live quality monitoring
4. LinkTtsPlayer retry: Multiple retry attempts (500ms, 1s, 2s, 4s) with logging
   to reliably connect ttsPlayer to LAMAvatar even with async initialization

Quality context: A2E streaming model (wav2vec2-base-960h, no transformer) produces
subtle Japanese phoneme variations. Frontend amplification makes these visible.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
… objects)

The user rewrote audio2exp-service with a2e_engine.py (Flask) which returns
frames as plain arrays [[0.1, ...], ...] instead of the old FastAPI format
[{"weights": [0.1, ...]}, ...].

Frontend now detects both formats: Array.isArray(f) ? f : f.weights

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Step 1: Add __testLipSync() diagnostic to concierge-controller.ts patch
  - Generates 5 Japanese vowel patterns (あいうえお) with known ARKit values
  - Creates silent WAV audio, queues frames to LAMAvatar, plays through ttsPlayer
  - Verifies whether renderer supports full 52-dim blendshapes

Step 3: Fix a2e_engine.py to use the proper LAM INFER pipeline
  - Restore LAM_Audio2Expression module (engines, models, utils, configs)
  - Rewrite _load_a2e_decoder → _try_load_infer_pipeline using INFER.build()
  - Use infer_streaming_audio() with context for chunked processing
  - Includes full postprocessing: smooth_mouth, frame_blending, savitzky_golay,
    symmetrize, eye_blinks
  - Falls back to Wav2Vec2 energy-based approximation when INFER unavailable
  - Add librosa, scipy, addict to requirements.txt
  - Add libsndfile to Dockerfile

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Three issues fixed during local testing:
1. transformers v5.x requires ignore_mismatched_sizes=True and
   attn_implementation="eager" for Wav2Vec2Model.from_pretrained()
2. HuggingFace checkpoint is double-wrapped (tar.gz containing
   pretrained_models/lam_audio2exp_streaming.tar) - auto-extract
3. Bare except in infer.py swallowed tracebacks and crashed on
   uninitialized output_dict - now logs actual error and recovers

Result: audio2exp-service starts with mode="infer" and produces
52-dim ARKit blendshapes from audio input.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Exclude downloaded model weights (wav2vec2, LAM checkpoint ~1.1GB)
from version control.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
Flask's app.run() auto-loads .env files, which crashes with
UnicodeDecodeError if a non-UTF-8 .env exists in the path.
Pass load_dotenv=False since env vars are set externally.

https://claude.ai/code/session_01RyVVZ8QGYAn4hoWN6YBteM
claude and others added 30 commits February 24, 2026 01:24
pucker reaching 1.0 (raw ~0.35 × 2.5x boost) caused FLAME LBS
numerical overflow (jaw=1.56e+23), destroying the face mesh mid-speech.

Changes:
- Add BLENDSHAPE_SAFE_MAX=0.7 clamp for all amplified channels
- Reduce pucker boost from 2.5→1.0 (raw already sufficient at ~0.35)
- Reduce funnel boost from 2.5→2.0 (prevent approaching limit)

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
…en motion

A2E model outputs weak jawOpen (avg~0.05) but excessive mouthLowerDown
(raw~0.84), causing "lower lip pull" instead of natural jaw opening.

Changes:
- jawOpen: 0.85→1.0 (restore to let jaw drive mouth opening)
- mouthLowerDown: 0.75→0.35 (suppress dominant lip-pull artifact)
- mouthUpperUp: 0.85→0.5 (suppress similarly excessive upper lip)

This shifts visual motion from "lip pulling" to "jaw opening" which
is anatomically correct for speech.

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
__testLipSync() diagnostic proved LAMAvatar.astro only passes jawOpen
and mouthLowerDown to the WebGL renderer - all other 50 blendshapes
(funnel, pucker, smile, stretch) are silently ignored.

- Reorganize MOUTH_AMPLIFY with clear sections: rendered vs pending-patch
- Adjust mouthLowerDown 0.35→0.5 (one of only 2 rendered values)
- Add LAMAVATAR_PATCH.md: documents the getExpressionData() fix needed
  in gourmet-sp repo to pass full 52-dim blendshape dict

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
The closed-source SDK (gaussian-splat-renderer-for-lam) only uses jawOpen
and mouthLowerDown for FLAME mesh deformation — all other 50 blendshapes
are silently ignored. LAMAvatar.astro was already returning full 52-dim
data; the bottleneck is the SDK, not our code.

Added remapForSdkLimitation() to synthesize composite jawOpen/lowerDown
from the full blendshape data:
- jawOpen += smile*0.5 + funnel*0.35 + pucker*0.2 + stretch*0.3
- lowerDown += smile*0.6 + stretch*0.4 - pucker*0.15
This encodes vowel-specific mouth shapes into the 2 available channels.

Also enhanced diagnostic logging (health check + TTS-Sync) to show
funnel, smile, pucker, stretch alongside jaw/mouth values.

Updated LAMAVATAR_PATCH.md to correct the initial hypothesis and
document the actual SDK limitation and composite workaround.

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Two fixes for LAMAvatar.astro:

1. SDK feedback loop explosion (jaw=5億):
   - Root cause: SDK writes FLAME LBS overflow values back to the
     returned expressionData object (passed by reference)
   - Fix: sanitizeExpressionData() at frame start breaks feedback loop
   - Fix: safeReturnExpression() returns shallow copy to prevent
     SDK from polluting internal state

2. Mouth over-opening (mouth=0.58 for natural speech):
   - Composite weights were too aggressive for real A2E output
   - Reduced: smile*0.5→0.3, funnel*0.35→0.2, stretch*0.3→0.2 (jaw)
   - Reduced: smile*0.6→0.35, stretch*0.4→0.25, pucker*0.15→0.1 (mouth)
   - Expected mouth range: 0.3-0.4 (vs 0.5-0.6 before)

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
…suppress

A2E model outputs weak jawOpen (avg~0.055) but excessive lowerDown (raw~0.87).
Previous 1.0x/0.5x amplify still resulted in mouth dominating jaw by 1.8-3.3x.

Changes:
- concierge-controller.ts: jawOpen 1.0→2.5, lowerDown 0.5→0.25
- LAMAvatar.astro remap: jaw composite weights increased (smile 0.3→0.35,
  funnel 0.2→0.25, stretch 0.2→0.25), mouth composite decreased
  (smile 0.35→0.2, stretch 0.25→0.15)

Expected: jaw > mouth in most frames (natural jaw-led lip movement)

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
jawOpen 2.5x caused frequent 0.7 clamp hits (mouth opening too wide).
1.5x keeps jaw > mouth ratio while staying in natural range (~0.3-0.55).
Composite jaw weights reverted to conservative: smile*0.3, funnel*0.2, stretch*0.2.

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
…ition fade-out bug

Issue 1: Mouth movement too large
- jawOpen amplify: 1.5→1.0 (no boost, let JAW_MAX cap handle peaks)
- Added JAW_MAX=0.30 and MOUTH_MAX=0.20 explicit caps in remapForSdkLimitation()
- Previous approach (amplify tuning) couldn't prevent peaks from exceeding 0.5+
- Cap approach: preserves natural dynamic range for moderate frames, clips peaks

Issue 2: Lip sync stopping during TTS chunk transitions
- Root cause: clearFrameBuffer() sets ttsActive=false, but old audio's ended=true
  remains → getExpressionData() triggers premature fade-out to neutral
- Fix: added ttsTransitioning flag set in clearFrameBuffer(), cleared in play handler
- Fade-out logic now checks !ttsTransitioning to avoid false triggers

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Original TEST_PROCEDURE.md assumed the rendering pipeline would use all 52
ARKit blendshapes. Live testing revealed the SDK only uses jawOpen +
mouthLowerDown (2 channels). A2E data quality for Japanese is sufficient;
the bottleneck is rendering.

Revised plan focuses on:
- Phase 0: SDK internal investigation (shader intercept, npm decompile)
- Phase 1B: WebGL shader patch to enable 52-channel rendering
- Phase 2: Alternative renderer (Three.js + custom FLAME) as fallback

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
…-channel limitation findings

- Clarify WebGL 2.0 (not WebGPU), device GPU (not CPU-only)
- Add section 1.4: SDK 2-channel limitation details and verification history
- Add section 4.5: lip sync tuning rounds (4 iterations) and TTS bug fix
- Add section 7.2: SDK internal structure investigation results
- Update section 5.1: SDK 2-channel breakthrough as top priority
- Update section 8: Phase 0 shader intercept as next action
- Update section 9: commit history for tuning and investigation phases

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Monkey-patch WebGL2RenderingContext methods BEFORE SDK import to capture:
- shaderSource: all vertex/fragment shader source code
- createShader: shader type tracking
- transformFeedbackVaryings: TF program output variables (key for FLAME deformation)
- linkProgram: shader-to-program mapping
- texImage2D: expression texture candidates (52-dim related sizes)
- getUniformLocation/getAttribLocation: expression-related uniform/attribute names

Features:
- Auto-classifies shaders by expression/blendshape keywords
- Highlights Transform Feedback programs (where FLAME LBS happens)
- Flags potential expression textures (52-channel data)
- Full shader source dump for expression-related shaders
- window.__LAM_SHADER_INTERCEPT.analyze() for structured analysis in DevTools
- All data accessible via window.__LAM_SHADER_INTERCEPT global

Purpose: Identify which blendshape indices the SDK reads in its TF vertex
shader, confirming the 2-channel (jawOpen + mouthLowerDown) limitation
and evaluating feasibility of a 52-channel shader patch (Phase 1B).

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
…boneTexture writes

Add hooks for uniform1i, uniform1f, and texSubImage2D to determine:
- bsCount runtime value (how many blendshapes the shader processes)
- gaussianSplatCount value
- boneTexture weight data at texel 20+ (where BS weights are packed)

This is critical for Phase 0 SDK investigation: the shader's blendshape
loop is generic (for i < bsCount), so the 2-channel limitation must be
in the JS code that sets bsCount and packs weights into boneTexture.

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Full 26,292-char vertex shader aigc3d#2 captured via WebGL intercept.
Contains the critical blendshape loop: for(i < bsCount) that proves
the shader supports N channels, not hardcoded to 2.

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
…endshapes

SDK source analysis revealed the 2-channel limitation was a misdiagnosis:
- SDK's setExpression() copies ALL expression data to splatMesh.bsWeight
- morphTargetDictionary maps ALL blendshape names to boneTexture indices
- Shader's for(i < bsCount) loop applies ALL blendshapes from the model

Changes:
- Remove remapForSdkLimitation() calls (was degrading quality by
  compressing 52 channels into jaw/lowerDown)
- Add logSdkInternals() to verify morphTargetDictionary, bsCount,
  useFlame, gaussianSplatCount at runtime
- Update MOUTH_AMPLIFY to natural values (no more extreme suppression
  of lowerDown or extreme boost of smile)
- Remove JAW_MAX/MOUTH_MAX caps (no longer needed without remapping)

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
…akthrough

- Correct 2-channel misdiagnosis: SDK actually supports all 52 blendshapes
- Document complete data flow: getExpressionData → updateBS → setExpression
  → bsWeight → boneTexture → shader for(i<bsCount) loop
- Update next actions: focus on quality evaluation and MOUTH_AMPLIFY tuning
- Mark Phase 0 SDK investigation as complete
- Add commit history for SDK analysis phase

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
…pam)

Only log float uniforms once (headBoneIndex, splatScale) instead of
every frame. Removes visibleRegionFadeStartRadius and other per-frame
uniforms that cause console spam.

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Phase 0 investigation confirmed bsCount=51 at runtime - SDK supports
all 52 ARKit blendshape channels. The shader intercept is no longer
needed and was causing console log spam (visibleRegionFadeStartRadius).

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
A2E outputs raw mouthLowerDown ~0.84, which even after SAFE_MAX clamp
(0.7) causes unnaturally wide mouth opening. jawOpen effective max is
only ~0.42 (0.28*1.5), creating an imbalance.

Reduce mouthLowerDown amplify from 1.0 to 0.5 so effective max becomes
~0.42, matching jawOpen for natural lip sync balance.

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
jawOpen (amp 1.5x) spikes to 0.48-0.60 while mouthLowerDown adds
0.17-0.22 on top. Combined 0.6-0.7 creates unnatural wide mouth.
Reducing to 1.0x keeps raw max ~0.40, giving natural jaw+mouth sum.

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
mouthLowerDown was dominating all phonemes, making every sound look
like "あ" (mouth open). Reduce lowerDown 0.5→0.3x and boost vowel
channels: funnel 1.5→2.5 (う/お), pucker 1.0→1.5 (う), smile 2.0→3.5
(い), stretch 1.5→2.0 (え). This increases the relative visibility
of vowel-specific mouth shapes.

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
…lip sync

Two issues fixed:
1. A2E output has "dead zones" (values drop to 0.001) during speech,
   causing mumbling effect. Per-segment dynamic range compression
   boosts weak frames toward segment mean (40% blend).
2. Symmetric EMA allowed rapid mouth closing during brief zero frames.
   Asymmetric EMA: fast attack (α=0.82) for crisp openings, slow
   decay (α=0.45) to prevent jarring closures.

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
The mouth movements were too small ("ごにょごにょ") because jawOpen
and lowerDown were over-reduced. Now safe to increase them because
bidirectional dynamic range compression controls peaks while boosting
valleys. Changes:
- jawOpen: 1.0→2.0 (compression tames raw 0.40→0.70 peaks to ~0.35)
- lowerDown: 0.3→0.5 (compression brings raw 0.84 peaks to ~0.30)
- Compression: one-sided→bidirectional, factor 0.4→0.5
  Both peaks AND valleys compressed toward segment mean

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Per-channel compression (compressing each channel independently toward
its mean) was destroying vowel shape differences - "あ" and "い" ended
up looking the same. New approach: normalize total mouth energy per
frame while preserving relative channel proportions (= phoneme shape).

- Energy floor 0.35: weak frames get proportionally scaled up
- Energy ceiling 1.8: peak frames get proportionally scaled down
- Channel RATIOS preserved: jaw/smile/pucker proportions stay intact
- Result: consistent amplitude WITH phoneme differentiation

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Reduce exaggeration and snappiness for natural-looking speech:
- jawOpen: 2.0→1.5, lowerDown: 0.5→0.4 (less extreme opening)
- smile: 3.5→3.0, funnel: 2.5→2.0, stretch: 2.0→1.8 (softer shapes)
- Energy floor: 0.35→0.15 (allow natural quiet moments)
- Energy ceiling: 1.8→1.2 (reduce peak exaggeration)
- EMA attack: 0.82→0.60 (smoother transitions, less snapping)
- EMA decay: 0.45→0.50 (slightly more responsive closing)

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
前回(c1cb895)は誇張/テキパキ過ぎ、直前(c03e410)はもごもご過ぎ。
両者の中間値に調整:
- jawOpen: 1.5→1.7, lowerDown: 0.4→0.45
- smile: 3.0→3.2, funnel: 2.0→2.2, stretch: 1.8→1.9
- energyFloor: 0.15→0.25, energyCeiling: 1.2→1.5
- attackAlpha: 0.60→0.70, decayAlpha: 0.50→0.48

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
中間値でもまだ「もごもご」→振幅不足が主因。
前回「テキパキ過ぎ」の問題はEMA速度が原因だった。
→ 振幅は誇張版の90%に引き上げ、EMAは65%に留めて
  はっきり動きつつ自然な遷移を目指す。

- jawOpen: 1.7→1.9, lowerDown: 0.45→0.48
- smile: 3.2→3.4, funnel: 2.2→2.4, stretch: 1.9→2.0
- energyFloor: 0.25→0.30, energyCeiling: 1.5→1.6
- attackAlpha: 0.70→0.75 (テキパキ版0.82より控えめ)
- decayAlpha: 0.48→0.47

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
根本原因: energyFloorはEMA前に適用されるが、decayAlpha=0.47では
弱フレームが3-4個続くとEMAが値を引き下げてしまい、せっかくの
floor効果が打ち消されていた。

対策:
- decayAlpha: 0.47→0.35 (核心修正: 口が閉じる速度を大幅に遅くし、
  A2Eの死区間でも前フレームの口形状が維持される)
- jawOpen: 1.9→2.2 (誇張版超え: raw値が低いA2E出力を補償)
- energyFloor: 0.30→0.45 (弱フレームをより積極的に引き上げ)
- energyCeiling: 1.6→1.8, attackAlpha: 0.75→0.78
- smile: 3.4→3.5, funnel: 2.4→2.5, lowerDown: 0.48→0.50

decay低下後のシミュレーション (jaw=0.2→弱frame×4):
旧: 0.13→0.09→0.07→0.06 (もごもご)
新: 0.16→0.13→0.11→0.10 (口形状維持)

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
前回の問題: decayAlpha=0.35(遅い減衰)で死区間のもごもごは
改善したが、前の母音形状が長く残り、母音遷移のタイミングがズレた。

根本的にdecay速度だけでは「もごもご防止」と「母音同期」を
同時に達成できない。

新方式: 2段階フロア
1. Step 2.3: プレEMAフロア (0.45) — 振幅ブースト
2. Step 2.5: EMAスムージング — decay=0.50で母音遷移に追従
3. Step 2.7: ポストEMAフロア (0.18) — EMA後の死区間を救済
   → チャンネル比率維持で母音形状を壊さず最低エネルギー保証

EMA decay=0.50に戻して母音切替のタイミングを回復しつつ、
ポストEMAフロアで「decay後に口が閉じ切る」問題を解決。

https://claude.ai/code/session_01TUEGRBQaaga67AXVGbNUs3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants