Audio Overview
HPD Agent audio support has three different paths. Pick the path first, then configure providers.
| Path | Use it when | Main docs |
|---|---|---|
| Finite speech-to-text input | A user uploads or sends an audio file and the model should receive transcript text. | Speech To Text Input |
| Assistant text-to-speech output | The model produces text and the app also wants an audio artifact or playback flow. | Text To Speech Output |
| Native realtime audio | The model/provider owns realtime audio and transcript streaming during the turn. | Realtime Audio |
Input and output are separate choices. A user can send text, audio, or both; the assistant can return text, audio, or both. The runtime path decides how those shapes are bridged.
| User sends | Assistant returns | What to expect |
|---|---|---|
| Text | Text | Normal chat turn. Audio runtime does not run finite input handling. |
| Text | Audio | The model produces text, then TTS or realtime provider audio produces spoken output. |
| Text | Text + audio | Assistant text remains primary; audio is a realtime stream or synthesized artifact. |
| Audio | Text | Finite STT can inject transcript text before the normal model turn, or realtime can produce transcript text during the turn. |
| Audio | Audio | Use realtime audio for provider-owned voice loops, or finite STT input plus TTS output for pipeline-style voice UX. |
| Audio | Text + audio | Common assistant voice UX: durable text/transcript plus spoken response. |
| Text + audio | Text, audio, or both | Typed text is preserved. Audio can become transcript context, realtime media, or reference content depending on input mode. |
The important distinction is that audio input, model input, assistant output, and durability are different axes. Audio does not always mean realtime, and realtime does not always mean audio-only.
Audio runtime behavior is attached with WithAudio() or WithAudioRuntimeAttachment(...).
using HPD.Agent;
using HPD.Agent.Audio;var agent = await new AgentBuilder()
.WithChatClient(chatClient)
.WithAudio()
.BuildAsync();Use WithAudio() for the default attachment. Use WithAudioRuntimeAttachment(...) when the app needs explicit speech-to-text, text-to-speech, artifact, branch projection, or playback behavior.
Provider Families
Audio providers are resolved by client family, not by the chat provider alone.
| Family | Config slot | Use it for |
|---|---|---|
| Speech to text | Clients.SpeechToText | Audio input transcription, including finite and provider-streaming STT paths. |
| Text to speech | Clients.TextToSpeech | Assistant text synthesis. |
| Realtime | Clients.Realtime | Native realtime model turns. |
Source-confirmed provider families include:
| Provider key | Families |
|---|---|
openai | speech to text, text to speech, realtime |
elevenlabs | speech to text, text to speech |
A chat provider package alone does not imply audio support. Audio provider packages register their own family slots.
What Gets Stored
The audio runtime treats text as the durable default:
- Finite input audio can be transcribed and the transcript can be injected into the model input.
- Branch history stores derived transcript text by default, not raw input audio.
- Assistant TTS keeps assistant text as the primary output, then adds synthesized audio artifacts when configured.
- Assistant audio artifacts are stored through
IContentStorewhen content-store capture is enabled.
This matters for privacy and replay. Store raw audio only when your app has an explicit retention policy for it.
Audio uses the same generic content pipeline as other binary inputs: user bytes can be uploaded into branch-scoped content references, then resolved into provider-facing content before the model call. Audio detection happens before upload so transcripts and input metadata can still refer back to the original media. See Content Upload And Resolution.
Boundaries
HPD Agent audio does not include microphone capture, browser device handling, or WebRTC UI plumbing. Those belong in the host application. The TypeScript client and hosted text streaming APIs also do not currently expose the full audio submission and realtime configuration surface.