Skip to content

Audio Runtime Attachment

The audio runtime attachment is the bridge between agent turns and audio behavior. It decides how finite audio input is handled, how assistant text is synthesized, where artifacts go, and whether committed transcripts are projected into branch history.

Attach The Runtime

Use the default attachment when you want the configured audio behavior without extra code:

csharp
using HPD.Agent;
using HPD.Agent.Audio;
csharp
var agent = await new AgentBuilder()
    .WithChatClient(chatClient)
    .WithAudio()
    .BuildAsync();

Use WithAudioRuntimeAttachment(...) when the application needs explicit options:

csharp
var agent = await new AgentBuilder()
    .WithChatClient(chatClient)
    .WithAudioRuntimeAttachment(audio =>
    {
        audio.InputMode = AudioInputMode.BatchSpeechToText;
        audio.AssistantOutputSynthesisMode = AssistantOutputSynthesisMode.FinalText;
    })
    .BuildAsync();

There are also overloads for passing attachment options, a branch projection sink, or a session store. Use the session-store overload when committed transcript and assistant-output projections should be written into durable session history.

Configuration Precedence

Runtime options are compiled in layers:

LayerPurpose
AudioRuntimeAttachmentOptionsBase attachment defaults and explicit builder options.
AgentConfig.AudioAgent-level audio policy.
AgentRunConfig.AudioPer-run audio overrides.

When the same behavior is configured in more than one place, the run config is the most specific layer.

Input Modes

ModeWhat it means
AutoUse the attachment default policy.
NoneDo not run split finite-audio input handling.
BatchSpeechToTextTranscribe finite input audio and inject transcript text.
ReferenceOnlyKeep audio as reference content without batch transcription.
RejectReject audio input for this agent or run.
ProviderRealtimeLeave finite-audio handling alone for native realtime provider transport.

Use BatchSpeechToText for uploaded audio files. Use ProviderRealtime with Realtime Audio, where the realtime model path owns the audio interaction.

Input detection is content-based:

  • text-only user messages do not run finite audio input handling,
  • audio-containing messages are detected before content upload,
  • mixed text-and-audio messages keep the original text,
  • committed transcripts are added as additional text content when transcript projection into the user message is enabled,
  • ReferenceOnly keeps media identity available without batch transcription.

Output Modes

ModeWhat it means
AutoUse the attachment default policy.
NoneDo not synthesize assistant audio.
TextOnlyKeep assistant text only.
TextToSpeechSynthesize assistant text through the TTS path.
ProviderRealtimeAudioLeave assistant audio to the native realtime provider path.

The most common non-realtime output mode is final-text TTS: the model produces text, then the runtime synthesizes that text.

Branch Projection

The runtime can project committed transcripts and assistant output into branch history. By default, the useful durable representation is text:

  • user audio input becomes transcript text,
  • assistant text remains the source of truth,
  • assistant audio artifacts are stored separately through IContentStore.

Content uploads and audio artifacts are branch-scoped. Normal RunAsync(..., sessionId: ...) execution supplies the active branch context for you. When you construct event input directly, include both SessionId and BranchId so upload, artifact, and branch-projection middleware can use the same durable scope. See Content Upload And Resolution for the generic upload and resolver flow.

Be deliberate before storing raw audio in durable history. Audio retention usually has stricter product and privacy requirements than text transcript retention.

Active Boundaries

Some policy objects contain lower-level privacy, trace, and projection flags. Prefer documenting behavior you have wired and tested in your app rather than exposing every internal knob. The stable user-facing model is: text is durable by default, raw media is explicit, and realtime provider transport is distinct from finite STT/TTS runtime work.

Built for production .NET agent applications.