Skip to content

ONNX Runtime GenAI Provider

Provider Key: onnx-runtime

Overview

The ONNX Runtime GenAI provider enables HPD-Agent to run foundation models locally using Microsoft's ONNX Runtime GenAI library. This provider supports high-performance local inference with various hardware accelerators, making it ideal for privacy-sensitive applications, offline deployments, and cost-effective AI solutions.

Key Features:

  • Local model inference (no cloud API required)
  • Multiple hardware accelerators (CPU, CUDA, DirectML, QNN, OpenVINO, TensorRT, WebGPU)
  • Streaming support for real-time responses
  • Vision support (Phi Vision and other multi-modal models)
  • Multiple model architectures (Llama, Mistral, Phi, Gemma, Qwen, and more)
  • Multi-LoRA adapter support for model customization
  • Constrained decoding (JSON schema, grammar, regex)
  • Deterministic generation with seed control
  • Prompt caching for improved performance
  • Beam search and advanced sampling strategies

For detailed API documentation, see:

Quick Start

Minimal Example

csharp
using HPD.Agent;
using HPD.Agent.Providers.OnnxRuntime;

var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/phi-3-mini-4k-instruct-onnx")
    .Build();

var response = await agent.RunAsync("What is the capital of France?");
Console.WriteLine(response);

Installation

bash
dotnet add package HPD-Agent.Providers.OnnxRuntime

Dependencies:

  • Microsoft.ML.OnnxRuntimeGenAI - ONNX Runtime GenAI library
  • Microsoft.Extensions.AI - AI abstractions

Configuration

Configuration Patterns

The ONNX Runtime provider supports all three configuration patterns. Choose the one that best fits your needs.

1. Builder Pattern (Fluent API)

Best for: Simple configurations and quick prototyping.

csharp
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model",
        configure: opts =>
        {
            opts.MaxLength = 2048;
            opts.Temperature = 0.7f;
            opts.DoSample = true;
            opts.TopP = 0.9f;
        })
    .Build();

2. Config Pattern (Data-Driven)

Best for: Serialization, persistence, and configuration files.

C# Config Object:

csharp
var config = new AgentConfig
{
    Name = "OnnxAgent",
    Provider = new ProviderConfig
    {
        ProviderKey = "onnx-runtime",
        ModelName = "phi-3-mini"
    }
};

var onnxOpts = new OnnxRuntimeProviderConfig
{
    ModelPath = "/path/to/phi-3-mini",
    MaxLength = 2048,
    Temperature = 0.7f,
    DoSample = true,
    TopP = 0.9f
};
config.Provider.SetTypedProviderConfig(onnxOpts);

var agent = await config.BuildAsync();

JSON Config File:

json
{
    "Name": "OnnxAgent",
    "Provider": {
        "ProviderKey": "onnx-runtime",
        "ModelName": "phi-3-mini",
        "ProviderOptionsJson": "{\"modelPath\":\"/path/to/phi-3-mini\",\"maxLength\":2048,\"temperature\":0.7,\"doSample\":true,\"topP\":0.9}"
    }
}
csharp
var agent = await AgentConfig
    .BuildFromFileAsync("onnx-config.json");

Best for: Production deployments with reusable configuration and runtime customization.

csharp
// Define base config once
var config = new AgentConfig
{
    Name = "OnnxAgent",
    Provider = new ProviderConfig
    {
        ProviderKey = "onnx-runtime",
        ModelName = "phi-3-mini"
    }
};

var onnxOpts = new OnnxRuntimeProviderConfig
{
    ModelPath = "/path/to/phi-3-mini",
    MaxLength = 2048,
    Temperature = 0.7f
};
config.Provider.SetTypedProviderConfig(onnxOpts);

// Reuse with different runtime customizations
var agent1 = new AgentBuilder(config)
    .WithServiceProvider(services)
    .WithToolkit<MathToolkit>()
    .Build();

var agent2 = new AgentBuilder(config)
    .WithServiceProvider(services)
    .WithToolkit<FileToolkit>()
    .Build();

Provider-Specific Options

The OnnxRuntimeProviderConfig class provides comprehensive configuration options organized by category:

Core Generation Parameters

csharp
configure: opts =>
{
    // Maximum tokens to generate
    opts.MaxLength = 2048;

    // Minimum tokens to generate
    opts.MinLength = 10;

    // Batch size for parallel generation
    opts.BatchSize = 1;
}

Sampling Parameters

csharp
configure: opts =>
{
    // Enable randomized sampling (false = greedy decoding)
    opts.DoSample = true;

    // Sampling temperature (0.0-2.0, higher = more random)
    opts.Temperature = 0.8f;

    // Top-K sampling (keep top K tokens)
    opts.TopK = 50;

    // Top-P nucleus sampling (0.0-1.0)
    opts.TopP = 0.9f;
}

Repetition Control

csharp
configure: opts =>
{
    // Repetition penalty (> 1.0 penalizes repetition)
    opts.RepetitionPenalty = 1.1f;

    // Prevent n-gram repetition (size of n-grams)
    opts.NoRepeatNgramSize = 3;
}

Beam Search Parameters

csharp
configure: opts =>
{
    // Number of beams for beam search (1 = greedy)
    opts.NumBeams = 4;

    // Number of sequences to return
    opts.NumReturnSequences = 2;

    // Stop when num_beams sentences are finished
    opts.EarlyStopping = true;

    // Length penalty (> 1.0 promotes longer sequences)
    opts.LengthPenalty = 1.2f;

    // Diversity penalty for beam groups
    opts.DiversityPenalty = 0.5f;
}

Determinism & Randomness

csharp
configure: opts =>
{
    // Seed for deterministic output (-1 = random)
    opts.RandomSeed = 42;
}

Stop Sequences

csharp
configure: opts =>
{
    // Custom stop sequences
    opts.StopSequences = new List<string>
    {
        "<|end|>",
        "<|user|>",
        "<|system|>"
    };
}

Performance Optimization

csharp
configure: opts =>
{
    // Share past/present KV buffers (CUDA only)
    opts.PastPresentShareBuffer = true;

    // Chunk size for prefill chunking
    opts.ChunkSize = 512;

    // Enable conversation caching
    opts.EnableCaching = true;
}

Constrained Decoding

csharp
configure: opts =>
{
    // Guidance type: "json", "grammar", "regex"
    opts.GuidanceType = "json";

    // JSON schema, grammar specification, or regex pattern
    opts.GuidanceData = @"{
        ""type"": ""object"",
        ""properties"": {
            ""name"": { ""type"": ""string"" },
            ""age"": { ""type"": ""number"" }
        },
        ""required"": [""name"", ""age""]
    }";

    // Enable fast-forward tokens for guidance
    opts.GuidanceEnableFFTokens = true;
}

Execution Provider Configuration

csharp
configure: opts =>
{
    // Execution providers (order matters - tried in sequence)
    opts.Providers = new List<string> { "cuda", "cpu" };

    // Provider-specific options
    opts.ProviderOptions = new Dictionary<string, Dictionary<string, string>>
    {
        ["cuda"] = new Dictionary<string, string>
        {
            ["device_id"] = "0",
            ["cudnn_conv_algo_search"] = "DEFAULT",
            ["gpu_mem_limit"] = "2147483648" // 2GB
        },
        ["cpu"] = new Dictionary<string, string>
        {
            ["intra_op_num_threads"] = "8"
        }
    };
}

Multi-LoRA Adapter Support

csharp
configure: opts =>
{
    // Path to adapter file
    opts.AdapterPath = "/path/to/adapters.onnx_adapter";

    // Adapter name to activate
    opts.AdapterName = "math_adapter";
}

Model Setup

Downloading Models

ONNX Runtime GenAI requires models in ONNX format. You can:

  1. Download pre-converted models from Hugging Face:

    bash
    # Example: Download Phi-3 Mini
    git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx
  2. Convert models yourself using the ONNX Runtime GenAI model builder:

    bash
    # Install model builder
    pip install onnxruntime-genai
    
    # Convert a model
    python -m onnxruntime_genai.models.builder \
        -m microsoft/Phi-3-mini-4k-instruct \
        -o ./phi-3-mini-onnx \
        -p int4 \
        -e cpu

Model Directory Structure

A valid ONNX model directory must contain:

phi-3-mini-onnx/
├── genai_config.json          # Model configuration
├── model.onnx                  # Main model file
├── model.onnx.data (optional)  # Model weights
└── tokenizer files             # Tokenizer configuration

Supported Models

ONNX Runtime GenAI supports a wide range of model architectures:

ArchitectureExamplesNotes
LlamaLlama 2, Llama 3, Llama 3.1, Llama 3.2Including instruct and chat variants
MistralMistral 7B, Mixtral 8x7BMoE models supported
PhiPhi-2, Phi-3, Phi-3.5Microsoft small language models
GemmaGemma 2B, Gemma 7BGoogle's open models
QwenQwen 1.5, Qwen 2, Qwen 2.5Alibaba's multilingual models
DeepSeekDeepSeek Coder, DeepSeek MathSpecialized models
SmolLMSmolLM3Small, efficient models
VisionPhi Vision, Qwen-VLMulti-modal models
AudioWhisperSpeech-to-text models

Hardware Requirements

HardwareProviderNotes
CPUcpuWorks on all platforms, slower than GPU
NVIDIA GPUcudaRequires CUDA Toolkit 11.8+
AMD GPUrocmRequires ROCm 5.4+
Intel GPUopenvinoRequires OpenVINO 2023.0+
Windows GPUdmlDirectML, works with any GPU
Qualcomm NPUqnnSnapdragon processors
WebGPUwebgpuBrowser-based inference
TensorRTtrtNVIDIA GPUs, optimized

Advanced Features

Hardware Acceleration

Configure execution providers for optimal performance:

csharp
// CUDA (NVIDIA GPU)
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model",
        configure: opts =>
        {
            opts.Providers = new List<string> { "cuda", "cpu" };
            opts.ProviderOptions = new Dictionary<string, Dictionary<string, string>>
            {
                ["cuda"] = new Dictionary<string, string>
                {
                    ["device_id"] = "0",
                    ["gpu_mem_limit"] = "4294967296" // 4GB
                }
            };
        })
    .Build();

// DirectML (Windows, any GPU)
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model",
        configure: opts =>
        {
            opts.Providers = new List<string> { "dml", "cpu" };
        })
    .Build();

// OpenVINO (Intel hardware)
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model",
        configure: opts =>
        {
            opts.Providers = new List<string> { "openvino", "cpu" };
        })
    .Build();

Constrained Generation

Force the model to generate structured output:

JSON Schema Constraint

csharp
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model",
        configure: opts =>
        {
            opts.GuidanceType = "json";
            opts.GuidanceData = @"{
                ""type"": ""object"",
                ""properties"": {
                    ""name"": { ""type"": ""string"" },
                    ""age"": { ""type"": ""number"" },
                    ""email"": { ""type"": ""string"", ""format"": ""email"" }
                },
                ""required"": [""name"", ""age""]
            }";
        })
    .Build();

var response = await agent.RunAsync("Extract user info: John Doe, 30 years old, john@example.com");
// Guaranteed to return valid JSON matching the schema

Grammar Constraint

csharp
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model",
        configure: opts =>
        {
            opts.GuidanceType = "grammar";
            opts.GuidanceData = @"
                root ::= sentence+
                sentence ::= noun verb noun '.'
                noun ::= 'cat' | 'dog' | 'bird'
                verb ::= 'chases' | 'loves' | 'sees'
            ";
        })
    .Build();

Multi-LoRA Adapters

Use LoRA adapters to customize base models:

csharp
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/base-model",
        configure: opts =>
        {
            opts.AdapterPath = "/path/to/adapters.onnx_adapter";
            opts.AdapterName = "math_specialist";
        })
    .Build();

var response = await agent.RunAsync("Solve: 2x + 5 = 15");

Deterministic Generation

Generate reproducible outputs:

csharp
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model",
        configure: opts =>
        {
            opts.RandomSeed = 42;
            opts.DoSample = true;
            opts.Temperature = 0.7f;
        })
    .Build();

// Same input + same seed = same output every time
var response1 = await agent.RunAsync("Generate a random story");
var response2 = await agent.RunAsync("Generate a random story");
// response1 == response2

Explore multiple generation paths:

csharp
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model",
        configure: opts =>
        {
            opts.NumBeams = 5;
            opts.NumReturnSequences = 3;
            opts.EarlyStopping = true;
            opts.LengthPenalty = 1.2f;
        })
    .Build();

Error Handling

The ONNX Runtime provider includes intelligent error classification and retry logic.

Error Categories

CategoryRetry BehaviorExamples
ClientErrorNo retryInvalid configuration, model not found, corrupted model
TransientExponential backoffOut of memory, device busy, timeout

Common Exceptions

Model Loading Errors

OnnxRuntimeGenAIException: Failed to load model

Solution: Verify model path and ensure all required files exist

Out of Memory

OnnxRuntimeGenAIException: Out of memory

Solution:

  • Reduce MaxLength or BatchSize
  • Use quantized model (int4, int8)
  • Set GPU memory limit in provider options

Invalid Configuration

OnnxRuntimeGenAIException: Invalid parameter

Solution: Check parameter ranges (e.g., temperature 0-2, topP 0-1)

Provider Not Available

OnnxRuntimeGenAIException: Provider 'cuda' not available

Solution: Verify CUDA/GPU drivers installed, or fall back to CPU

Examples

Example 1: Basic Local Chat

csharp
using HPD.Agent;
using HPD.Agent.Providers.OnnxRuntime;

var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/phi-3-mini-4k-instruct-onnx")
    .Build();

var response = await agent.RunAsync("Explain quantum computing in simple terms.");
Console.WriteLine(response);

Example 2: GPU-Accelerated Generation

csharp
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/llama-3-8b-instruct-onnx",
        configure: opts =>
        {
            opts.Providers = new List<string> { "cuda", "cpu" };
            opts.MaxLength = 4096;
            opts.Temperature = 0.7f;
            opts.DoSample = true;
        })
    .Build();

var response = await agent.RunAsync("Write a detailed essay about artificial intelligence.");

Example 3: Streaming Responses

csharp
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model")
    .Build();

await foreach (var chunk in agent.RunAsync("Write a short story about space exploration."))
{
    Console.Write(chunk);
}

Example 4: Structured JSON Output

csharp
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model",
        configure: opts =>
        {
            opts.GuidanceType = "json";
            opts.GuidanceData = @"{
                ""type"": ""object"",
                ""properties"": {
                    ""product"": { ""type"": ""string"" },
                    ""price"": { ""type"": ""number"" },
                    ""inStock"": { ""type"": ""boolean"" }
                },
                ""required"": [""product"", ""price"", ""inStock""]
            }";
        })
    .Build();

var response = await agent.RunAsync("Extract product info: MacBook Pro for $2499, available now");
// Returns: {"product":"MacBook Pro","price":2499,"inStock":true}

Example 5: Deterministic Sampling

csharp
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model",
        configure: opts =>
        {
            opts.DoSample = true;
            opts.TopK = 50;
            opts.TopP = 0.9f;
            opts.Temperature = 0.8f;
            opts.RandomSeed = 12345;
            opts.RepetitionPenalty = 1.1f;
        })
    .Build();

// Reproducible creative generation
var story1 = await agent.RunAsync("Generate a creative story about robots.");
var story2 = await agent.RunAsync("Generate a creative story about robots.");
// story1 == story2 (deterministic due to seed)

Example 6: Multi-LoRA Adapter Switching

csharp
// Math-specialized agent
var mathAgent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/base-model",
        configure: opts =>
        {
            opts.AdapterPath = "/path/to/adapters.onnx_adapter";
            opts.AdapterName = "math_adapter";
        })
    .Build();

// Code-specialized agent (same base model, different adapter)
var codeAgent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/base-model",
        configure: opts =>
        {
            opts.AdapterPath = "/path/to/adapters.onnx_adapter";
            opts.AdapterName = "code_adapter";
        })
    .Build();

Example 7: Beam Search for Quality

csharp
var agent = await new AgentBuilder()
    .WithOnnxRuntime(
        modelPath: "/path/to/model",
        configure: opts =>
        {
            opts.NumBeams = 4;
            opts.NumReturnSequences = 1;
            opts.EarlyStopping = true;
            opts.LengthPenalty = 1.2f;
            opts.NoRepeatNgramSize = 3;
        })
    .Build();

var response = await agent.RunAsync("Translate to French: The weather is beautiful today.");
// Higher quality translation due to beam search

Troubleshooting

"Model path does not exist"

Problem: Invalid or missing model path.

Solution:

csharp
// Verify path exists
if (!Directory.Exists(modelPath))
{
    throw new DirectoryNotFoundException($"Model not found: {modelPath}");
}

// Use absolute path
.WithOnnxRuntime(modelPath: Path.GetFullPath("/path/to/model"))

// Or use environment variable
Environment.SetEnvironmentVariable("ONNX_MODEL_PATH", "/path/to/model");
.WithOnnxRuntime(modelPath: Environment.GetEnvironmentVariable("ONNX_MODEL_PATH"))

"Out of memory" errors

Problem: Model too large for available memory.

Solution:

csharp
configure: opts =>
{
    // Reduce max length
    opts.MaxLength = 1024; // Instead of 4096

    // Set GPU memory limit
    opts.ProviderOptions = new Dictionary<string, Dictionary<string, string>>
    {
        ["cuda"] = new Dictionary<string, string>
        {
            ["gpu_mem_limit"] = "2147483648" // 2GB
        }
    };

    // Use quantized model (int4 instead of fp16)
    // Download int4 quantized version of your model
}

"Provider 'cuda' not available"

Problem: CUDA provider requested but not installed.

Solution:

csharp
// Option 1: Fall back to CPU
configure: opts =>
{
    opts.Providers = new List<string> { "cuda", "cpu" }; // Tries CUDA first, falls back to CPU
}

// Option 2: Check CUDA installation
// Verify: nvidia-smi command works
// Install: CUDA Toolkit 11.8 or later

// Option 3: Use DirectML (Windows) instead
configure: opts =>
{
    opts.Providers = new List<string> { "dml", "cpu" };
}

"Temperature must be between 0 and 2"

Problem: Invalid temperature value.

Solution:

csharp
configure: opts =>
{
    opts.Temperature = 0.7f;  //  Valid (0.0-2.0)
    // NOT: opts.Temperature = 2.5f;  // Invalid for ONNX Runtime
}

Model loading is very slow

Problem: Model loading from disk is slow.

Solution:

csharp
// Option 1: Use SSD instead of HDD for model storage
// Option 2: Enable caching
configure: opts =>
{
    opts.EnableCaching = true;
}

// Option 3: Load model once and reuse agent instance
// Don't create new agents for each request

Generation is very slow

Problem: Slow inference speed.

Solution:

csharp
// Option 1: Use GPU acceleration
configure: opts =>
{
    opts.Providers = new List<string> { "cuda" };
}

// Option 2: Use quantized model (int4/int8)
// Download quantized version: significantly faster

// Option 3: Reduce max length
configure: opts =>
{
    opts.MaxLength = 512; // Instead of 2048
}

// Option 4: Enable performance optimizations
configure: opts =>
{
    opts.PastPresentShareBuffer = true; // CUDA only
    opts.ChunkSize = 512; // Enable chunking
}

OnnxRuntimeProviderConfig API Reference

Core Generation Parameters

PropertyTypeRangeDefaultDescription
ModelPathstring?-RequiredPath to ONNX model directory
MaxLengthint?≥ 1Model-specificMaximum tokens to generate
MinLengthint?≥ 00Minimum tokens to generate
BatchSizeint?≥ 11Batch size for parallel generation

Sampling Parameters

PropertyTypeRangeDefaultDescription
DoSamplebool?-falseEnable randomized sampling
Temperaturefloat?0.0-2.01.0Sampling temperature
TopKint?≥ 150Top-K sampling
TopPfloat?0.0-1.0-Nucleus sampling threshold

Repetition Control

PropertyTypeRangeDefaultDescription
RepetitionPenaltyfloat?> 01.0Repetition penalty (1.0 = no penalty)
NoRepeatNgramSizeint?≥ 0-Prevent n-gram repetition

Beam Search Parameters

PropertyTypeRangeDefaultDescription
NumBeamsint?≥ 11Number of beams (1 = greedy)
NumReturnSequencesint?≥ 1, ≤ NumBeams1Sequences to return
EarlyStoppingbool?-trueStop when num_beams finished
LengthPenaltyfloat?> 01.0Length penalty for beam search
DiversityPenaltyfloat?≥ 0-Diversity penalty for beams

Performance Optimization

PropertyTypeDefaultDescription
PastPresentShareBufferbool?falseShare KV buffers (CUDA only)
ChunkSizeint?-Chunk size for prefill chunking
EnableCachingboolfalseEnable conversation caching

Determinism & Randomness

PropertyTypeRangeDefaultDescription
RandomSeedint?-1 or ≥ 0-1RNG seed (-1 = random)

Stop Sequences

PropertyTypeDefaultDescription
StopSequencesList<string>?-Custom stop sequences

Constrained Decoding

PropertyTypeDefaultDescription
GuidanceTypestring?-Guidance type: "json", "grammar", "regex"
GuidanceDatastring?-Schema/grammar/pattern for guidance
GuidanceEnableFFTokensboolfalseEnable fast-forward tokens

Execution Provider Configuration

PropertyTypeDefaultDescription
ProvidersList<string>?-Execution providers in priority order
ProviderOptionsDictionary<string, Dictionary<string, string>>?-Provider-specific options

Multi-LoRA Adapters

PropertyTypeDefaultDescription
AdapterPathstring?-Path to adapter file
AdapterNamestring?-Adapter name to activate

Additional Resources

Released under the MIT License.