Skip to content

Audio Overview

HPD Agent audio support has three different paths. Pick the path first, then configure providers.

PathUse it whenMain docs
Finite speech-to-text inputA user uploads or sends an audio file and the model should receive transcript text.Speech To Text Input
Assistant text-to-speech outputThe model produces text and the app also wants an audio artifact or playback flow.Text To Speech Output
Native realtime audioThe model/provider owns realtime audio and transcript streaming during the turn.Realtime Audio

Input and output are separate choices. A user can send text, audio, or both; the assistant can return text, audio, or both. The runtime path decides how those shapes are bridged.

User sendsAssistant returnsWhat to expect
TextTextNormal chat turn. Audio runtime does not run finite input handling.
TextAudioThe model produces text, then TTS or realtime provider audio produces spoken output.
TextText + audioAssistant text remains primary; audio is a realtime stream or synthesized artifact.
AudioTextFinite STT can inject transcript text before the normal model turn, or realtime can produce transcript text during the turn.
AudioAudioUse realtime audio for provider-owned voice loops, or finite STT input plus TTS output for pipeline-style voice UX.
AudioText + audioCommon assistant voice UX: durable text/transcript plus spoken response.
Text + audioText, audio, or bothTyped text is preserved. Audio can become transcript context, realtime media, or reference content depending on input mode.

The important distinction is that audio input, model input, assistant output, and durability are different axes. Audio does not always mean realtime, and realtime does not always mean audio-only.

Audio runtime behavior is attached with WithAudio() or WithAudioRuntimeAttachment(...).

csharp
using HPD.Agent;
using HPD.Agent.Audio;
csharp
var agent = await new AgentBuilder()
    .WithChatClient(chatClient)
    .WithAudio()
    .BuildAsync();

Use WithAudio() for the default attachment. Use WithAudioRuntimeAttachment(...) when the app needs explicit speech-to-text, text-to-speech, artifact, branch projection, or playback behavior.

Provider Families

Audio providers are resolved by client family, not by the chat provider alone.

FamilyConfig slotUse it for
Speech to textClients.SpeechToTextAudio input transcription, including finite and provider-streaming STT paths.
Text to speechClients.TextToSpeechAssistant text synthesis.
RealtimeClients.RealtimeNative realtime model turns.

Source-confirmed provider families include:

Provider keyFamilies
openaispeech to text, text to speech, realtime
elevenlabsspeech to text, text to speech

A chat provider package alone does not imply audio support. Audio provider packages register their own family slots.

What Gets Stored

The audio runtime treats text as the durable default:

  • Finite input audio can be transcribed and the transcript can be injected into the model input.
  • Branch history stores derived transcript text by default, not raw input audio.
  • Assistant TTS keeps assistant text as the primary output, then adds synthesized audio artifacts when configured.
  • Assistant audio artifacts are stored through IContentStore when content-store capture is enabled.

This matters for privacy and replay. Store raw audio only when your app has an explicit retention policy for it.

Audio uses the same generic content pipeline as other binary inputs: user bytes can be uploaded into branch-scoped content references, then resolved into provider-facing content before the model call. Audio detection happens before upload so transcripts and input metadata can still refer back to the original media. See Content Upload And Resolution.

Boundaries

HPD Agent audio does not include microphone capture, browser device handling, or WebRTC UI plumbing. Those belong in the host application. The TypeScript client and hosted text streaming APIs also do not currently expose the full audio submission and realtime configuration surface.

Where To Go Next

Built for production .NET agent applications.