# Voice Integration

> AI voice agents via Twilio and OpenAI Realtime API

Struere integrates with Twilio for telephony and OpenAI's Realtime API for voice-to-voice AI conversations. Voice agents handle inbound and outbound phone calls with real-time speech recognition, natural voice synthesis, and optional auditor-based validation.

## Architecture

```
Phone Call (inbound or outbound)
    |
    v
Twilio Media Streams (WebSocket, g711_ulaw)
    |
    v
Voice Gateway (Fly.io, Hono + Bun)
    |
    v
OpenAI Realtime API (voice-to-voice, WebSocket)
    |
    v                       (optional)
Voice Agent responds  <---  Auditor Agent polls /v1/chat
                            validates data, injects corrections
```

The voice gateway is a standalone service (`platform/voice-gateway/`) that bridges Twilio's Media Streams with OpenAI's Realtime API. Audio flows as raw g711_ulaw between Twilio and OpenAI with no transcoding.

## Dual-Agent Architecture

Voice calls support two modes:

### Single Agent

A voice agent handles the call directly. The agent's system prompt and tools are loaded from its config and sent to the OpenAI Realtime session.

### Dual Agent (Voice + Auditor)

A voice agent handles the conversation while an auditor agent runs in the background:

- **Voice agent** -- The OpenAI Realtime session that speaks with the caller. Follows a script, asks questions, collects information.
- **Auditor agent** -- A standard Struere text agent that polls the call transcript every N seconds via `/v1/chat`. Validates collected data, fills entities, and can inject corrections back into the voice call.

The auditor polls the voice gateway at a configurable interval (default 5 seconds). When the auditor calls `voice.inject`, the correction is spoken by the voice agent in its own voice.

## Database Tables

### voiceConnections

Stores the connection between an organization and a Twilio phone number. Scoped by environment.

| Field | Type | Description |
|-------|------|-------------|
| `organizationId` | `Id<"organizations">` | The owning organization |
| `environment` | `"development" \| "production" \| "eval"` | Environment scope |
| `status` | `"disconnected" \| "connected" \| "removed"` | Current connection state |
| `label` | `string?` | Display label for the connection |
| `twilioAccountSid` | `string` | Twilio Account SID |
| `twilioPhoneNumber` | `string` | The Twilio phone number |
| `phoneNumberSid` | `string?` | Twilio Phone Number SID |
| `agentId` | `Id<"agents">?` | Agent assigned to handle inbound calls |
| `routerId` | `Id<"routers">?` | Router assigned to handle inbound calls |

## Setup

### 1. Configure Twilio Integration

Provide your Twilio credentials via the CLI:

```bash
bunx struere integration twilio --account-sid <sid> --auth-token <token>
```

### 2. Add a Phone Number

Add a Twilio phone number to your organization. This creates a `voiceConnections` record and configures Twilio to send inbound calls to the voice gateway.

### 3. Assign an Agent or Router

Assign an agent or router to handle inbound calls on the phone number. When a router is assigned, its `voiceConfig` determines the voice settings for the call.

### 4. Configure Voice Settings

Voice configuration lives on the router (not individual agents). Add a `voiceConfig` block to your router definition:

```typescript
import { defineRouter } from 'struere'

export default defineRouter({
  name: "Phone Support",
  slug: "phone-support",
  mode: "classify",
  agents: [
    { slug: "intake-agent", description: "Handles new caller intake and data collection" },
    { slug: "support-agent", description: "Handles technical support questions" },
  ],
  classifyModel: { model: "openai/gpt-5-mini" },
  fallback: "intake-agent",
  voiceConfig: {
    provider: "openai-realtime",
    model: "gpt-realtime-mini",
    voice: "coral",
    auditorAgent: "form-auditor",
    pollInterval: 5000,
    turnDetection: {
      type: "semantic_vad",
      eagerness: "medium",
    },
    noiseReduction: "near_field",
  },
})
```

## Voice Configuration Reference

The `voiceConfig` object on a router controls how voice calls are handled.

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `provider` | `string` | `"openai-realtime"` | Voice provider. Currently only `"openai-realtime"` is supported. |
| `model` | `string` | `"gpt-realtime-mini"` | OpenAI Realtime model. Options: `"gpt-realtime-mini"`, `"gpt-realtime-1.5"` |
| `voice` | `string` | `"alloy"` | Voice for speech synthesis |
| `auditorAgent` | `string?` | -- | Slug of the auditor agent for dual-agent mode |
| `pollInterval` | `number` | `5000` | Auditor polling interval in milliseconds |
| `turnDetection` | `object` | `semantic_vad, medium` | How the model detects when the user has finished speaking |
| `noiseReduction` | `string?` | -- | Noise reduction mode: `"near_field"` or `"far_field"` |

### Available Voices

| Voice |
|-------|
| `alloy` |
| `ash` |
| `ballad` |
| `coral` |
| `echo` |
| `sage` |
| `shimmer` |
| `verse` |
| `marin` |
| `cedar` |

### Turn Detection

Turn detection determines when the model considers the user's turn to be complete.

**Semantic VAD** (recommended) -- Uses semantic understanding to detect turn boundaries:

```typescript
turnDetection: {
  type: "semantic_vad",
  eagerness: "medium",
}
```

| Field | Type | Options | Description |
|-------|------|---------|-------------|
| `eagerness` | `string` | `"low"`, `"medium"`, `"high"`, `"auto"` | How eagerly the model responds. Lower values wait longer for the user to finish. |

**Server VAD** -- Traditional voice activity detection based on audio levels:

```typescript
turnDetection: {
  type: "server_vad",
  threshold: 0.5,
  silenceDurationMs: 500,
  prefixPaddingMs: 300,
}
```

| Field | Type | Description |
|-------|------|-------------|
| `threshold` | `number` | Audio level threshold (0.0 to 1.0) |
| `silenceDurationMs` | `number` | Milliseconds of silence before turn ends |
| `prefixPaddingMs` | `number` | Milliseconds of audio to include before detected speech |

## Voice Tools

### voice.call

Initiates an outbound voice call to a phone number. The call connects through Twilio and starts an OpenAI Realtime session with the configured voice settings.

**Parameters:**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `phoneNumber` | `string` | Yes | Phone number to call (E.164 format) |
| `routerSlug` | `string` | No | Router slug to use for voice config and agent routing |
| `agentSlug` | `string` | No | Agent slug to handle the call directly |
| `entityType` | `string` | No | Entity type slug for form-filling scenarios |
| `entityId` | `string` | No | Existing entity ID to update during the call |
| `metadata` | `object` | No | Extra context passed to the voice agent |

**Returns:**

```typescript
{
  callSid: string
  status: "initiated"
}
```

**Example:**

```typescript
import { defineAgent } from 'struere'

export default defineAgent({
  name: "Outreach Agent",
  slug: "outreach",
  systemPrompt: "You schedule follow-up calls with leads.",
  model: { model: "openai/gpt-5-mini" },
  tools: ["entity.query", "voice.call"],
})
```

### voice.inject

Injects a message into an active voice call. This tool is designed for auditor agents -- when called, the message is spoken by the voice agent in its own voice during the call.

**Parameters:**

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| `message` | `string` | Yes | Message to inject into the voice call |
| `action` | `string` | Yes | Action type: `"correction"`, `"complete"`, or `"abort"` |

**Actions:**

| Action | Behavior |
|--------|----------|
| `correction` | The voice agent speaks the correction message to the caller |
| `complete` | Signals the form/process is complete |
| `abort` | Signals the call should end |

**Example auditor agent:**

```typescript
import { defineAgent } from 'struere'

export default defineAgent({
  name: "Form Auditor",
  slug: "form-auditor",
  systemPrompt: `You validate data collected during voice calls.
When you detect incorrect or missing data, use voice.inject to correct the caller.
When all required fields are filled, use voice.inject with action "complete".`,
  model: { model: "openai/gpt-5-mini" },
  tools: ["entity.query", "entity.update", "voice.inject"],
})
```

## Auditor Correction Flow

When using dual-agent mode, the auditor correction flow works as follows:

```
Voice Gateway polls /v1/chat every {pollInterval}ms with transcript delta
    |
    v
Auditor agent processes transcript, validates data
    |
    v
Auditor calls voice.inject (tool result stored in _executionMeta.toolCallSummary)
    |
    v
Voice Gateway reads tool result from response
    |
    v
Correction injected into OpenAI Realtime session
    |
    v
Voice agent speaks correction in its own voice
```

## Thread Data

Voice calls create threads with `channel: "voice"`. Thread metadata includes:

| Field | Value |
|-------|-------|
| `channel` | `"voice"` |
| `channelStatus` | `"pending"`, `"active"`, `"stopped"`, `"completed"`, or `"failed"` |
| `channelParams.callerNumber` | Caller's phone number |

System prompt access:

```
Channel: {{threadContext.channel}}
Caller: {{threadContext.params.callerNumber}}
```

## Billing

Voice calls are billed through the standard credit system. The voice gateway reports token usage (input + output) to Convex when the call ends, and credits are deducted based on the model's pricing.

A cleanup cron marks voice threads stuck in `"active"` status for over 2 hours as `"failed"`.

## Required Environment Variables

| Variable | Location | Description |
|----------|----------|-------------|
| `VOICE_GATEWAY_URL` | Convex | URL of the voice gateway service |
| `VOICE_GATEWAY_SECRET` | Convex + Voice Gateway | Shared secret for gateway authentication |
| `OPENAI_API_KEY` | Voice Gateway | OpenAI API key for Realtime sessions |
| `TWILIO_AUTH_TOKEN` | Voice Gateway (optional) | For verifying inbound Twilio webhook signatures |
