feat: Add Grok realtime voice CLI #36

jonastemplestein · 2026-01-07T22:39:33Z

Add Grok realtime voice CLI for real-time voice conversations via XAI's WebSocket API. Users can speak into a microphone, receive spoken responses, and optionally type messages instead.

Features

GrokVoiceClient: WebSocket client for XAI realtime voice API with session management
AudioCapture: Microphone input via sox CLI (PCM 16-bit @ 24kHz)
AudioPlayback: Speaker output via sox CLI for low-latency playback
Voice modes: Voice mode for real conversations, text mode for typed input

Usage

bun run mini-agent voice --voice ara — Voice mode with Ara voice
bun run mini-agent voice --text — Text mode (type messages)
bun run mini-agent voice --instructions "..." — Custom system instructions

Requires XAI_API_KEY env var and sox for voice mode.

🤖 Generated with Claude Code

Note

Adds realtime voice support and unifies text/voice under a single session API.

Voice module: GrokVoiceClient (WebSocket to XAI), AudioCapture/AudioPlayback via sox, and a new voice CLI command; wired into src/cli/commands.ts
Unified abstraction: new src/unified/ with domain.ts, makeUnifiedSession, HttpTransportLive (OpenAI-compatible chat), WsTransportLive (Grok voice), and a demo CLI that supports tools and writes YAML event logs
Deps: adds ws and @types/ws; lockfile updated
Docs/Artifacts: PLAN.md added and sample YAML session logs included

^{Written by Cursor Bugbot for commit ced4a65. This will update automatically on new commits. Configure here.}

Add voice command for real-time voice conversations with Grok AI via XAI's realtime WebSocket API. Supports voice mode (microphone/speaker via sox) and text mode for typing messages. Key components: - GrokVoiceClient: WebSocket client for XAI realtime voice API - AudioCapture: Microphone input using sox CLI - AudioPlayback: Speaker output using sox CLI - Voice CLI command with --voice, --text, --instructions options Requires XAI_API_KEY env var and sox for voice mode (brew install sox). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Bypass Schema encoding in favor of direct JSON construction for WebSocket messages to reduce complexity. Update session config to match XAI API requirements: pcm16 format, Whisper transcription, and enhanced VAD settings (threshold, padding, silence detection). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cursor · 2026-01-07T22:45:38Z

src/voice/audio-capture.ts

+          }
+
+          return Chunk.fromIterable(buffers)
+        }),


Audio chunking loses partial data between stream chunks

Medium Severity

The Stream.mapChunks callback creates a fresh accumulated buffer on each invocation (line 57), but state is not preserved between calls. When upstream audio data doesn't align with chunkSize boundaries, partial data at the end of each chunk is emitted immediately as an undersized buffer, then lost. The next upstream chunk starts accumulation from zero instead of continuing with the leftover bytes. This causes inconsistent audio chunk sizes to be sent to the WebSocket API, potentially causing audio quality issues or inefficient network usage. A stateful approach like Stream.mapAccum would be needed to properly accumulate across chunk boundaries.

cursor · 2026-01-07T22:45:38Z

src/voice/client.ts

+          ws.on("error", (error) => {
+            Effect.runSync(Effect.logError(`WebSocket error: ${error.message}`))
+            resume(Effect.fail(error as Error))
+          })


WebSocket errors after connection silently ignored

Medium Severity

The Effect.async callback's resume function can only be called once effectively. When the WebSocket connects successfully, resume(Effect.void) is called on the "open" event (line 145). If a WebSocket error occurs after the connection is established, the "error" handler calls resume(Effect.fail(error)) but this has no effect since resume was already invoked. Errors during an active session (network failures, authentication issues, server errors) are only logged but not propagated to the caller, leaving the application in a confusing state where streams silently stop working without proper error handling.

src/voice/client.ts

cursor · 2026-01-07T22:45:38Z

src/voice/cli.ts

+    yield* connection.close
+  }).pipe(
+    Effect.provide(VoiceLayer),
+    Effect.catchAll((error) => Console.error(`Error: ${error instanceof Error ? error.message : String(error)}`))


Resources not cleaned up when errors occur

Medium Severity

The cleanup code (lines 125-129) that interrupts fibers and closes the player/connection only executes if the text or voice mode block completes without error. If runTextMode or the mic stream throws (e.g., sox crashes, WebSocket disconnects), the error is caught by Effect.catchAll but the cleanup code is skipped entirely. The forked fibers (audioPlaybackFiber, transcriptFiber, userTranscriptFiber), the player process, and the WebSocket connection will remain open, causing resource leaks. The cleanup logic needs to be wrapped in Effect.ensuring or similar to guarantee execution.

cursor · 2026-01-07T22:45:38Z

src/voice/client.ts

+            Effect.runSync(Queue.shutdown(transcriptQueue))
+            Effect.runSync(Queue.shutdown(userTranscriptQueue))
+            Effect.runSync(Queue.shutdown(eventQueue))
+          })


readyQueue not shutdown causes infinite hang on early close

High Severity

When the WebSocket closes, the "close" handler shuts down audioQueue, transcriptQueue, userTranscriptQueue, and eventQueue, but readyQueue is not shutdown. If the WebSocket connects but then closes before session.updated is received (e.g., authentication failure, server rejection, or network issue), waitForReady on line 206 (Queue.take(readyQueue)) will block forever. The CLI will hang indefinitely with no error message or way to recover. The readyQueue needs to be shutdown in the close handler.

Additional Locations (1)

src/voice/client.ts#L205-L206

Introduces a transport-agnostic conversation interface that works with both: - HTTP-based chat completions (OpenAI-compatible APIs) - WebSocket-based voice APIs (Grok realtime) Key components: - domain.ts: Core types (ConversationEvent union, LlmTransport service) - http-transport.ts: Wraps OpenAiChatClient as stateless transport - ws-transport.ts: Wraps GrokVoiceClient as stateful transport - demo.ts: CLI demo with YAML event logging on exit The unified session provides consistent API (sendText, sendAudio, events stream) regardless of transport. Demo supports multiple providers (openrouter, xai, groq, etc). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cursor · 2026-01-08T10:31:10Z

src/unified/demo.ts

+
+process.on("exit", () => {
+  writeEventLog()
+})


Event log written twice on SIGINT exit

Low Severity

The SIGINT handler calls writeEventLog() and then process.exit(0). The 'exit' handler also calls writeEventLog(). When the user presses Ctrl+C, both handlers execute, creating two YAML log files with slightly different timestamps. The log is written once explicitly and once via the exit handler triggered by process.exit(0).

cursor · 2026-01-08T10:31:10Z

src/unified/demo.ts

+        return player.write(event.chunk)
+      }
+      return handleEvent(event)
+    }),


AudioDelta events not logged in voice mode

Low Severity

In voiceDemo, the Stream.tap callback returns player.write(event.chunk) for AudioDelta events without calling handleEvent. Since logEvent is only called inside handleEvent, audio delta events are never logged to the YAML event file. All other event types are logged via handleEvent, but audio playback events are silently skipped from the log output.

src/voice/cli.ts

cursor · 2026-01-08T10:31:10Z

src/voice/cli.ts

+  AudioCapture.Default,
+  AudioPlayback.Default,
+  BunCommandExecutor.layer
+)


Missing BunFileSystem layer dependency for command executor

Medium Severity

The VoiceLayer uses BunCommandExecutor.layer directly without providing BunFileSystem.layer as a dependency. In contrast, demo.ts correctly composes these as BunCommandExecutor.layer.pipe(Layer.provide(BunFileSystem.layer)). The checkSoxAvailable function uses Command.string() which requires the command executor. If BunCommandExecutor depends on BunFileSystem, this layer composition would fail at runtime when the voice command is executed.

- Add ToolDefinition and ToolHandler types - Track pendingToolCalls in ConversationContext - HTTP transport: convert tools to OpenAI format, stream tool call deltas - Voice client: support tools in session config, handle function_call events - Add sendToolResult method to voice connection Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-01-20T15:08:52Z

unified-demo-http-2026-01-08T13-19-09-396Z.yaml

+      usage:
+        prompt_tokens: 24
+        completion_tokens: 13
+        total_tokens: 37


Test output YAML files accidentally committed

Low Severity

Three YAML files containing test/debug output from running the demo CLI were accidentally committed. These files are generated by src/unified/demo.ts which writes event logs to process.cwd() on exit. The files contain actual API responses with timestamps and should be added to .gitignore (e.g., unified-demo-*.yaml) to prevent future accidental commits.

Additional Locations (2)

unified-demo-voice-2026-01-08T13-19-50-260Z.yaml#L1-L784

unified-demo-voice-2026-01-08T13-21-13-135Z.yaml#L1-L244

jonastemplestein and others added 2 commits January 7, 2026 22:04

cursor bot reviewed Jan 7, 2026

View reviewed changes

cursor bot reviewed Jan 8, 2026

View reviewed changes

jonastemplestein and others added 3 commits January 8, 2026 13:04

WIP

ffedc64

woah

adf61f4

cursor bot reviewed Jan 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Grok realtime voice CLI #36

feat: Add Grok realtime voice CLI #36

Uh oh!

jonastemplestein commented Jan 7, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot Jan 7, 2026

Uh oh!

cursor bot Jan 7, 2026

Uh oh!

Uh oh!

cursor bot Jan 7, 2026

Uh oh!

cursor bot Jan 7, 2026

Uh oh!

cursor bot Jan 8, 2026

Uh oh!

cursor bot Jan 8, 2026

Uh oh!

Uh oh!

cursor bot Jan 8, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Add Grok realtime voice CLI #36

Are you sure you want to change the base?

feat: Add Grok realtime voice CLI #36

Uh oh!

Conversation

jonastemplestein commented Jan 7, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

Usage

Uh oh!

cursor bot Jan 7, 2026

Choose a reason for hiding this comment

Audio chunking loses partial data between stream chunks

Uh oh!

cursor bot Jan 7, 2026

Choose a reason for hiding this comment

WebSocket errors after connection silently ignored

Uh oh!

Uh oh!

cursor bot Jan 7, 2026

Choose a reason for hiding this comment

Resources not cleaned up when errors occur

Uh oh!

cursor bot Jan 7, 2026

Choose a reason for hiding this comment

readyQueue not shutdown causes infinite hang on early close

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Event log written twice on SIGINT exit

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

AudioDelta events not logged in voice mode

Uh oh!

Uh oh!

cursor bot Jan 8, 2026

Choose a reason for hiding this comment

Missing BunFileSystem layer dependency for command executor

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Jan 20, 2026

Choose a reason for hiding this comment

Test output YAML files accidentally committed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jonastemplestein commented Jan 7, 2026 •

edited by cursor bot

Loading