# Voice Integration

> AI voice agents via Twilio and OpenAI Realtime API

Struere integrates with Twilio for telephony and OpenAI's Realtime API for voice-to-voice AI conversations. Voice agents handle inbound and outbound phone calls with real-time speech recognition, natural voice synthesis, and optional auditor-based validation.

## Architecture

```
Phone Call (inbound or outbound)
    |
    v
Twilio Media Streams (WebSocket, g711_ulaw)
    |
    v
Voice Gateway (Fly.io, Hono + Bun)
    |
    v
OpenAI Realtime API (voice-to-voice, WebSocket)
    |
    v                       (optional)
Voice Agent responds  <---  Auditor Agent polls /v1/chat
                            validates data, injects corrections
```

The voice gateway is a standalone service (`platform/voice-gateway/`) that bridges Twilio's Media Streams with OpenAI's Realtime API. Audio flows as raw g711_ulaw between Twilio and OpenAI with no transcoding.

## Dual-Agent Architecture

Voice calls support two modes:

### Single Agent

A voice agent handles the call directly. The agent's system prompt and tools are loaded from its config and sent to the OpenAI Realtime session.

### Dual Agent (Voice + Auditor)

A voice agent handles the conversation while an auditor agent runs in the background:

- **Voice agent** -- The OpenAI Realtime session that speaks with the caller. Follows a script, asks questions, collects information.
- **Auditor agent** -- A standard Struere text agent that polls the call transcript every N seconds via `/v1/chat`. Validates collected data, fills entities, and can inject corrections back into the voice call.

The auditor polls the voice gateway at a configurable interval (default 5 seconds). When the auditor calls `voice.inject`, the correction is spoken by the voice agent in its own voice.

## Database Tables

### voiceConnections

Stores the connection between an organization and a Twilio phone number. Scoped by environment.

| Field | Type | Description |
|-------|------|-------------|
| `organizationId` | `Id<"organizations">` | The owning organization |
| `environment` | `"development" \| "production" \| "eval"` | Environment scope |
| `status` | `"disconnected" \| "connected" \| "removed"` | Current connection state |
| `label` | `string?` | Display label for the connection |
| `twilioAccountSid` | `string` | Twilio Account SID |
| `twilioPhoneNumber` | `string` | The Twilio phone number |
| `phoneNumberSid` | `string?` | Twilio Phone Number SID |
| `agentId` | `Id<"agents">?` | Agent assigned to handle inbound calls |
| `routerId` | `Id<"routers">?` | Router assigned to handle inbound calls |

## Setup

### Quick start: bind a phone to a single agent

This is the simplest path: one agent answers every inbound call on the phone number. Use it when you don't need multi-agent routing or custom voice settings.

**Step 1. Configure Twilio credentials.**

```bash
bunx struere integration twilio --account-sid <sid> --auth-token <token>
```

**Step 2. Connect a phone number to your agent.**

```bash
bunx struere integration twilio \
  --phone-number +1XXXXXXXXXX \
  --agent <agent-slug>
```

This creates a `voiceConnections` row binding the phone number directly to the agent. Inbound calls reach the agent through OpenAI Realtime with the platform defaults: `provider: "openai"`, `voice: "alloy"`, single-agent mode (no auditor), `pollInterval: 5000`. The model, turn detection, and noise reduction fall back to OpenAI Realtime built-in defaults.

Use this path when one agent handles all calls. To override voice/model defaults or run multiple agents on the same number, see "Advanced: routing between agents" below.

### Advanced: routing between agents

Use a router when you need either of:

- Multiple agents on the same phone number (e.g. an intake agent that hands off to a support agent)
- Custom `voiceConfig` -- voice, model, auditor, turn detection, or noise reduction

Voice configuration lives on the router definition. **When you provide `voiceConfig`, `auditorAgent` is required at runtime** -- pass an empty string only if you understand single-agent mode is enforced upstream by omitting `voiceConfig` entirely.

```typescript
import { defineRouter } from 'struere'

export default defineRouter({
  name: "Phone Support",
  slug: "phone-support",
  mode: "classify",
  agents: [
    { slug: "intake-agent", description: "Handles new caller intake and data collection" },
    { slug: "support-agent", description: "Handles technical support questions" },
  ],
  classifyModel: { model: "openai/gpt-5-mini" },
  fallback: "intake-agent",
  voiceConfig: {
    provider: "openai-realtime",
    model: "gpt-realtime-mini",
    voice: "coral",
    auditorAgent: "form-auditor",
    pollInterval: 5000,
    turnDetection: {
      type: "semantic_vad",
      eagerness: "medium",
    },
    noiseReduction: "near_field",
  },
})
```

Bind the router to the phone number:

```bash
bunx struere integration twilio \
  --phone-number +1XXXXXXXXXX \
  --router <router-slug>
```

### Disconnect a phone number

Remove a single phone number from your voice setup without touching Twilio credentials:

```bash
bunx struere integration twilio --remove-phone +1XXXXXXXXXX
```

This soft-deletes the voice connection. You can reconnect the same number later with `--phone-number` + `--agent` (or `--router`).

To remove the entire Twilio integration (credentials AND all phone connections for the current environment), use:

```bash
bunx struere integration twilio --remove
```

## Footguns

Behaviors that aren't obvious from the type signatures.

### Symptom: Caller hears a confused vanilla model
**Cause:** `voice.call` was invoked without `agentSlug` -- the LLM didn't pass it because the system prompt didn't reference it. The voice gateway falls back to "You are a helpful voice assistant." with no application context.

**Fix:** Ensure the agent's system prompt instructs the LLM to pass `agentSlug` literally when invoking `voice.call`. As of CLI v0.14.8, `sync` blocks this case at validation time.

### Symptom: Sync fails with `voiceConfig.auditorAgent references unknown agent: undefined`
**Cause:** The SDK type marks `auditorAgent` as optional, but the runtime in `platform/voice-gateway/src/auditor/poller.ts` starts the poller whenever `voiceConfig` is provided, and crashes when the slug is undefined.

**Fix:** Set `auditorAgent` explicitly. For single-agent setups it can self-reference the same agent (e.g. `auditorAgent: 'voice-suplente'`). Or omit `voiceConfig` entirely to use platform defaults.

### Symptom: Voice agent doesn't know the match/customer/order it's calling about
**Cause:** Voice sessions don't inherit context from the caller. Each `voice.call` spawns an isolated thread (`platform/voice-gateway/src/ws/media-stream.ts:221`).

**Fix:** Thread context explicitly in the orchestrator's message -- pass IDs and key fields literally in the `agent.chat` message that triggers the voice agent.

### Symptom: Inbound call still routes to the old agent after you renamed/deleted it
**Cause:** `voiceConnections` stores `agentId`/`routerId`, and `/v1/voice/config` doesn't validate the referenced agent still exists -- silent fallback to vanilla model.

**Fix:** When renaming or deleting agents, reassign the voice connection via `bunx struere integration twilio --phone-number ... --agent <new-slug>` or via the dashboard.

### Symptom: Voice connection stuck in `pending` status
**Cause:** Race in `media-stream.ts:269` where status is set after thread creation; if thread creation hangs, status never advances.

**Fix:** Hang up and retry. If persistent, clear with `--remove-phone` and reconnect.

### Symptom: Auditor injects corrections too aggressively, agent gets repeatedly cut off
**Cause:** `pollInterval` set very low.

**Fix:** Keep `pollInterval >= 5000`. Values below 3000ms can race and produce overlapping injections.

See [Platform Gotchas](/platform/gotchas) for cross-cutting silent failures across the platform.

## Voice Configuration Reference

> Voice configuration is set on a router via `voiceConfig`. When a phone is bound directly to an agent (no router), the platform uses these defaults: `provider: "openai"`, `voice: "alloy"`, single-agent mode (no auditor), `pollInterval: 5000`. Override these by switching to a router.

The `voiceConfig` object on a router controls how voice calls are handled.

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `provider` | `string` | `"openai-realtime"` | Voice provider. Currently only `"openai-realtime"` is supported. |
| `model` | `string` | `"gpt-realtime-mini"` | OpenAI Realtime model. Options: `"gpt-realtime-mini"`, `"gpt-realtime-1.5"` |
| `voice` | `string` | `"alloy"` | Voice for speech synthesis |
| `auditorAgent` | `string?` | -- | Slug of the auditor agent for dual-agent mode |
| `pollInterval` | `number` | `5000` | Auditor polling interval in milliseconds |
| `turnDetection` | `object` | `semantic_vad, medium` | How the model detects when the user has finished speaking |
| `noiseReduction` | `string?` | -- | Noise reduction mode: `"near_field"` or `"far_field"` |

### Available Voices

| Voice |
|-------|
| `alloy` |
| `ash` |
| `ballad` |
| `coral` |
| `echo` |
| `sage` |
| `shimmer` |
| `verse` |
| `marin` |
| `cedar` |

### Turn Detection

Turn detection determines when the model considers the user's turn to be complete.

**Semantic VAD** (recommended) -- Uses semantic understanding to detect turn boundaries:

```typescript
turnDetection: {
  type: "semantic_vad",
  eagerness: "medium",
}
```

| Field | Type | Options | Description |
|-------|------|---------|-------------|
| `eagerness` | `string` | `"low"`, `"medium"`, `"high"`, `"auto"` | How eagerly the model responds. Lower values wait longer for the user to finish. |

**Server VAD** -- Traditional voice activity detection based on audio levels:

```typescript
turnDetection: {
  type: "server_vad",
  threshold: 0.5,
  silenceDurationMs: 500,
  prefixPaddingMs: 300,
}
```

| Field | Type | Description |
|-------|------|-------------|
| `threshold` | `number` | Audio level threshold (0.0 to 1.0) |
| `silenceDurationMs` | `number` | Milliseconds of silence before turn ends |
| `prefixPaddingMs` | `number` | Milliseconds of audio to include before detected speech |

## Voice Tools

### voice.call

Initiates an outbound voice call to a phone number. The call connects through Twilio and starts an OpenAI Realtime session with the configured voice settings.

**Parameters:**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `phoneNumber` | `string` | Yes | Phone number to call (E.164 format) |
| `routerSlug` | `string` | No | Router slug to use for voice config and agent routing |
| `agentSlug` | `string` | No | Agent slug to handle the call directly |
| `entityType` | `string` | No | Entity type slug for form-filling scenarios |
| `entityId` | `string` | No | Existing entity ID to update during the call |
| `metadata` | `object` | No | Extra context passed to the voice agent |

**Returns:**

```typescript
{
  callSid: string
  status: "initiated"
}
```

**Example:**

```typescript
import { defineAgent } from 'struere'

export default defineAgent({
  name: "Outreach Agent",
  slug: "outreach",
  systemPrompt: "You schedule follow-up calls with leads.",
  model: { model: "openai/gpt-5-mini" },
  tools: ["entity.query", "voice.call"],
})
```

### voice.inject

Injects a message into an active voice call. This tool is designed for auditor agents -- when called, the message is spoken by the voice agent in its own voice during the call.

**Parameters:**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `message` | `string` | Yes | Message to inject into the voice call |
| `action` | `string` | Yes | Action type: `"correction"`, `"complete"`, or `"abort"` |

**Actions:**

| Action | Behavior |
|--------|----------|
| `correction` | The voice agent speaks the correction message to the caller |
| `complete` | Signals the form/process is complete |
| `abort` | Signals the call should end |

**Example auditor agent:**

```typescript
import { defineAgent } from 'struere'

export default defineAgent({
  name: "Form Auditor",
  slug: "form-auditor",
  systemPrompt: `You validate data collected during voice calls.
When you detect incorrect or missing data, use voice.inject to correct the caller.
When all required fields are filled, use voice.inject with action "complete".`,
  model: { model: "openai/gpt-5-mini" },
  tools: ["entity.query", "entity.update", "voice.inject"],
})
```

## Auditor Correction Flow

When using dual-agent mode, the auditor correction flow works as follows:

```
Voice Gateway polls /v1/chat every {pollInterval}ms with transcript delta
    |
    v
Auditor agent processes transcript, validates data
    |
    v
Auditor calls voice.inject (tool result stored in _executionMeta.toolCallSummary)
    |
    v
Voice Gateway reads tool result from response
    |
    v
Correction injected into OpenAI Realtime session
    |
    v
Voice agent speaks correction in its own voice
```

## Thread Data

Voice calls create threads with `channel: "voice"`. Thread metadata includes:

| Field | Value |
|-------|-------|
| `channel` | `"voice"` |
| `channelStatus` | `"pending"`, `"active"`, `"stopped"`, `"completed"`, or `"failed"` |
| `channelParams.callerNumber` | Caller's phone number |

System prompt access:

```
Channel: {{threadContext.channel}}
Caller: {{threadContext.params.callerNumber}}
```

## Billing

Voice calls are billed through the standard credit system. The voice gateway reports token usage (input + output) to Convex when the call ends, and credits are deducted based on the model's pricing.

A cleanup cron marks voice threads stuck in `"active"` status for over 2 hours as `"failed"`.

## Required Environment Variables

| Variable | Location | Description |
|----------|----------|-------------|
| `VOICE_GATEWAY_URL` | Convex | URL of the voice gateway service |
| `VOICE_GATEWAY_SECRET` | Convex + Voice Gateway | Shared secret for gateway authentication |
| `OPENAI_API_KEY` | Voice Gateway | OpenAI API key for Realtime sessions |
| `TWILIO_AUTH_TOKEN` | Voice Gateway (optional) | For verifying inbound Twilio webhook signatures |
