Skip to content

Ollama Provider

Provider Key: ollama

Overview

The Ollama provider enables HPD-Agent to use local and remote Ollama instances for AI inference. Ollama allows you to run large language models locally on your machine or connect to remote Ollama servers, providing privacy, control, and zero API costs.

Key Features:

  • Local and remote model execution
  • All Ollama models supported (Llama, Mistral, Qwen, DeepSeek, Gemma, etc.)
  • Streaming support for real-time responses
  • Function/tool calling (model-dependent)
  • Vision support for multimodal models
  • Comprehensive parameter control (40+ options)
  • Reasoning model support (DeepSeek-R1, etc.)
  • JSON output mode
  • Performance tuning (GPU offloading, threading, memory management)
  • No authentication required for local instances

For detailed API documentation, see:

Quick Start

Minimal Example

csharp
using HPD.Agent;
using HPD.Agent.Providers.Ollama;

// Uses local Ollama at http://localhost:11434
var agent = await new AgentBuilder()
    .WithOllama(model: "llama3:8b")
    .Build();

var response = await agent.RunAsync("What is the capital of France?");
Console.WriteLine(response);

Installation

bash
dotnet add package HPD-Agent.Providers.Ollama

Dependencies:

  • OllamaSharp - .NET client library for Ollama API
  • Microsoft.Extensions.AI - AI abstractions

Prerequisites:

  • Ollama installed and running (Download Ollama)
  • At least one model pulled: ollama pull llama3:8b

Configuration

Configuration Patterns

The Ollama provider supports all three configuration patterns. Choose the one that best fits your needs.

1. Builder Pattern (Fluent API)

Best for: Simple configurations and quick prototyping.

csharp
var agent = await new AgentBuilder()
    .WithOllama(
        model: "llama3:8b",
        configure: opts =>
        {
            opts.Temperature = 0.7f;
            opts.NumPredict = 2048;
            opts.NumCtx = 4096;
            opts.TopP = 0.9f;
        })
    .Build();

2. Config Pattern (Data-Driven)

Best for: Serialization, persistence, and configuration files.

C# Config Object:

csharp
var config = new AgentConfig
{
    Name = "OllamaAgent",
    Provider = new ProviderConfig
    {
        ProviderKey = "ollama",
        ModelName = "llama3:8b",
        Endpoint = "http://localhost:11434"
    }
};

var ollamaOpts = new OllamaProviderConfig
{
    Temperature = 0.7f,
    NumPredict = 2048,
    NumCtx = 4096
};
config.Provider.SetTypedProviderConfig(ollamaOpts);

var agent = await config.BuildAsync();

JSON Config File:

json
{
    "Name": "OllamaAgent",
    "Provider": {
        "ProviderKey": "ollama",
        "ModelName": "llama3:8b",
        "Endpoint": "http://localhost:11434",
        "ProviderOptionsJson": "{\"temperature\":0.7,\"numPredict\":2048,\"numCtx\":4096}"
    }
}
csharp
var agent = await AgentConfig
    .BuildFromFileAsync("ollama-config.json");

Best for: Production deployments with reusable configuration and runtime customization.

csharp
// Define base config once
var config = new AgentConfig
{
    Name = "OllamaAgent",
    Provider = new ProviderConfig
    {
        ProviderKey = "ollama",
        ModelName = "llama3:8b"
    }
};

var ollamaOpts = new OllamaProviderConfig
{
    Temperature = 0.7f,
    NumCtx = 4096
};
config.Provider.SetTypedProviderConfig(ollamaOpts);

// Reuse with different runtime customizations
var agent1 = new AgentBuilder(config)
    .WithServiceProvider(services)
    .WithToolkit<MathToolkit>()
    .Build();

var agent2 = new AgentBuilder(config)
    .WithServiceProvider(services)
    .WithToolkit<FileToolkit>()
    .Build();

Provider-Specific Options

The OllamaProviderConfig class provides comprehensive configuration options organized by category:

Core Parameters

csharp
configure: opts =>
{
    // Maximum tokens to generate (default: 128, -1 = infinite, -2 = fill context)
    opts.NumPredict = 2048;

    // Context window size (default: 2048)
    opts.NumCtx = 4096;
}

Sampling Parameters

csharp
configure: opts =>
{
    // Temperature: 0.0 = deterministic, 2.0 = very creative (default: 0.8)
    opts.Temperature = 0.7f;

    // Top-P nucleus sampling (default: 0.9)
    opts.TopP = 0.95f;

    // Top-K sampling - limits to K most likely tokens (default: 40)
    opts.TopK = 50;

    // Min-P - minimum probability threshold (default: 0.0)
    opts.MinP = 0.05f;

    // Typical-P - locally typical sampling (default: 1.0)
    opts.TypicalP = 0.95f;

    // Tail-free sampling Z (default: 1.0)
    opts.TfsZ = 2.0f;
}

Repetition Control

csharp
configure: opts =>
{
    // Penalize repetitions (default: 1.1)
    opts.RepeatPenalty = 1.2f;

    // Look-back window for repetitions (default: 64, -1 = num_ctx)
    opts.RepeatLastN = 128;

    // Presence penalty (default: 0.0, range: 0.0-2.0)
    opts.PresencePenalty = 0.5f;

    // Frequency penalty (default: 0.0, range: 0.0-2.0)
    opts.FrequencyPenalty = 0.5f;

    // Penalize newline tokens (default: true)
    opts.PenalizeNewline = false;
}

Determinism & Stop Sequences

csharp
configure: opts =>
{
    // Random seed for reproducible output (default: 0)
    opts.Seed = 12345;

    // Stop sequences
    opts.Stop = new[] { "\n\n", "END", "---" };
}

Mirostat Sampling

Advanced sampling for controlling perplexity:

csharp
configure: opts =>
{
    // Enable Mirostat: 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0
    opts.MiroStat = 2;

    // Mirostat learning rate (default: 0.1)
    opts.MiroStatEta = 0.15f;

    // Mirostat target entropy (default: 5.0)
    opts.MiroStatTau = 4.5f;
}

Context & Memory

csharp
configure: opts =>
{
    // Tokens to keep from initial prompt (default: 4, -1 = all)
    opts.NumKeep = 10;
}

Performance & Hardware

csharp
configure: opts =>
{
    // GPU layers to offload (default: platform-dependent)
    opts.NumGpu = 35;

    // Primary GPU for small tensors (default: 0)
    opts.MainGpu = 0;

    // Enable low VRAM mode (default: false)
    opts.LowVram = true;

    // Use FP16 for key/value cache (default: false)
    opts.F16kv = true;

    // Return logits for all tokens (default: false)
    opts.LogitsAll = true;
}

Threading & Batch Processing

csharp
configure: opts =>
{
    // CPU threads to use (default: auto-detected)
    opts.NumThread = 8;

    // Prompt processing batch size (default: 512)
    opts.NumBatch = 1024;

    // GQA groups for specific models (e.g., 8 for llama2:70b)
    opts.NumGqa = 8;
}

Memory Management

csharp
configure: opts =>
{
    // Use memory mapping (default: true)
    opts.UseMmap = false;

    // Lock model in memory (default: false)
    opts.UseMlock = true;

    // Enable NUMA support (default: false)
    opts.Numa = true;

    // Load only vocabulary, not weights (default: false)
    opts.VocabOnly = false;
}

Ollama-Specific Options

csharp
configure: opts =>
{
    // How long to keep model loaded: "5m", "1h", "-1" = indefinite
    opts.KeepAlive = "10m";

    // Response format: "json" for JSON mode
    opts.Format = "json";

    // Custom prompt template (overrides Modelfile)
    opts.Template = "{{.System}}\nUser: {{.Prompt}}\nAssistant:";

    // Enable thinking for reasoning models: true/false or "high"/"medium"/"low"
    opts.Think = true;
}

Connection & Endpoints

Local Ollama (Default)

csharp
// Uses http://localhost:11434
var agent = await new AgentBuilder()
    .WithOllama(model: "llama3:8b")
    .Build();

Remote Ollama Server

csharp
var agent = await new AgentBuilder()
    .WithOllama(
        model: "mistral",
        endpoint: "http://gpu-server:11434")
    .Build();

Environment Variable Configuration

bash
export OLLAMA_ENDPOINT="http://localhost:11434"
# or
export OLLAMA_HOST="http://localhost:11434"
csharp
// Automatically uses OLLAMA_ENDPOINT or OLLAMA_HOST
var agent = await new AgentBuilder()
    .WithOllama(model: "llama3:8b")
    .Build();

Endpoint Priority

  1. Explicit endpoint parameter in WithOllama()
  2. OLLAMA_ENDPOINT environment variable
  3. OLLAMA_HOST environment variable
  4. Default: http://localhost:11434

Supported Models

Ollama supports a wide variety of models. For the complete and up-to-date list, see:

Ollama Model Library

  • Meta Llama - llama3:8b, llama3:70b, llama3.1, llama3.2, llama3.2-vision
  • Mistral AI - mistral, mistral-nemo, mixtral
  • Qwen - qwen3:4b, qwen3:32b, qwq (reasoning)
  • DeepSeek - deepseek-r1:8b, deepseek-r1:70b (reasoning), deepseek-coder
  • Microsoft Phi - phi4, phi4-reasoning
  • Google Gemma - gemma2, gemma2:27b
  • Code Models - codellama, deepseek-coder, qwen2.5-coder
  • Vision Models - llava, bakllava, llama3.2-vision

Model Tags

Ollama uses tags to specify model variants:

model-name:size

Examples:

  • llama3:8b - Llama 3 with 8 billion parameters
  • mistral:latest - Latest Mistral model
  • qwen3:4b - Qwen 3 with 4 billion parameters
  • deepseek-r1:70b - DeepSeek R1 reasoning model, 70B

Pulling Models

Before using a model, pull it with Ollama CLI:

bash
# Pull a specific model
ollama pull llama3:8b

# List available models
ollama list

# Remove a model
ollama rm llama3:8b

Advanced Features

JSON Output Mode

Force the model to respond with valid JSON:

csharp
var agent = await new AgentBuilder()
    .WithOllama(
        model: "llama3:8b",
        configure: opts => opts.Format = "json")
    .Build();

var response = await agent.RunAsync(
    "List 3 programming languages with their release years in JSON format");

Reasoning Models

Use models with thinking/reasoning capabilities:

csharp
var agent = await new AgentBuilder()
    .WithOllama(
        model: "deepseek-r1:8b",
        configure: opts =>
        {
            opts.Think = true; // Enable thinking mode
            opts.NumCtx = 8192; // Larger context for reasoning
            opts.Temperature = 0.6f;
        })
    .Build();

var response = await agent.RunAsync(@"
Solve this logic puzzle:
Three people have different jobs.
- Alice is not a Doctor
- Bob is not an Engineer
- The Doctor's name comes before the Engineer's alphabetically
Who has which job?");

Deterministic Generation

Produce consistent outputs for the same input:

csharp
var agent = await new AgentBuilder()
    .WithOllama(
        model: "llama3:8b",
        configure: opts =>
        {
            opts.Seed = 12345; // Fixed seed
            opts.Temperature = 0.0f; // No randomness
        })
    .Build();

// Same input will always produce same output
var response1 = await agent.RunAsync("Count from 1 to 10");
var response2 = await agent.RunAsync("Count from 1 to 10");
// response1 == response2

Performance Tuning for Large Models

Optimize for GPU-accelerated inference:

csharp
var agent = await new AgentBuilder()
    .WithOllama(
        model: "llama3:70b",
        configure: opts =>
        {
            // GPU configuration
            opts.NumGpu = 40; // Offload 40 layers to GPU
            opts.MainGpu = 0; // Primary GPU

            // CPU/Memory configuration
            opts.NumThread = 8; // 8 CPU threads
            opts.NumBatch = 512; // Batch size

            // Context size
            opts.NumCtx = 4096;
        })
    .Build();

Keep Model Loaded

Control how long models stay in memory:

csharp
configure: opts =>
{
    // Keep loaded for 30 minutes
    opts.KeepAlive = "30m";

    // Keep loaded indefinitely
    // opts.KeepAlive = "-1";

    // Unload immediately after use
    // opts.KeepAlive = "0";
}

Vision Models

Use multimodal models with image understanding:

csharp
var agent = await new AgentBuilder()
    .WithOllama(model: "llava")
    .Build();

// Vision capabilities require image input through message API
// (implementation depends on your image handling approach)

Tool/Function Calling

Enable function calling for compatible models:

csharp
public class WeatherToolkit
{
    [Function("Get current weather for a location")]
    public string GetWeather(string location)
    {
        return $"The weather in {location} is sunny, 72°F";
    }
}

var agent = await new AgentBuilder()
    .WithOllama(model: "llama3.1:8b") // Function calling support
    .WithToolkit<WeatherToolkit>()
    .Build();

var response = await agent.RunAsync("What's the weather in Seattle?");

Note: Tool calling support varies by model. Compatible models include:

  • llama3.1:8b and larger
  • mistral-nemo
  • qwen2.5:7b and larger

Error Handling

The Ollama provider includes intelligent error classification and automatic retry logic.

Error Categories

CategoryHTTP StatusRetry BehaviorExamples
ClientError400, 404No retryModel not found, invalid parameters
Transient503, ConnectionRetry with backoffModel loading, connection errors
ServerError500-599Retry with backoffInternal server errors
Timeout408Retry with backoffRequest timeout

Ollama-Specific Errors

The provider handles OllamaSharp-specific exceptions:

  • OllamaException - General Ollama errors
  • ResponseError - Invalid API responses
  • ModelDoesNotSupportToolsException - Tool calling not supported by model

Common Issues

Model Not Found (404)

Problem: Model doesn't exist locally

Solution:

bash
ollama pull llama3:8b
ollama list  # Verify model is available

Model Loading (503)

Problem: Model is being loaded into memory

Solution: The provider automatically retries. First load may take 10-30 seconds.

Connection Refused

Problem: Ollama is not running

Solution:

bash
# Start Ollama
ollama serve

# Or on macOS/Windows, start Ollama app

Out of Memory

Problem: Model too large for available RAM/VRAM

Solution:

csharp
configure: opts =>
{
    opts.NumCtx = 2048; // Reduce context
    opts.LowVram = true; // Enable low VRAM mode
    opts.NumGpu = 20; // Reduce GPU layers
}

Or use a smaller model variant:

csharp
// Instead of llama3:70b, use:
model: "llama3:8b"  // or "llama3:7b"

Tool Calling Not Working

Problem: ModelDoesNotSupportToolsException

Solution: Use a model with function calling support:

csharp
//  Supported
model: "llama3.1:8b"
model: "mistral-nemo"
model: "qwen2.5:7b"

// Not supported
model: "llama3:8b"  // Llama 3.0 doesn't support tools

Examples

Example 1: Basic Chat with Llama

csharp
using HPD.Agent;
using HPD.Agent.Providers.Ollama;

var agent = await new AgentBuilder()
    .WithOllama(model: "llama3:8b")
    .Build();

var response = await agent.RunAsync("Explain quantum computing in simple terms.");
Console.WriteLine(response);

Example 2: Creative Writing

csharp
var agent = await new AgentBuilder()
    .WithOllama(
        model: "mistral",
        configure: opts =>
        {
            opts.Temperature = 1.2f; // More creative
            opts.TopP = 0.95f;
            opts.RepeatPenalty = 1.2f; // Reduce repetition
        })
    .Build();

var story = await agent.RunAsync("Write a creative short story about a robot learning to paint");

Example 3: Code Generation

csharp
var agent = await new AgentBuilder()
    .WithOllama(
        model: "qwen2.5-coder:7b",
        configure: opts =>
        {
            opts.Temperature = 0.2f; // Low for precise code
            opts.NumPredict = 2048;
        })
    .Build();

var code = await agent.RunAsync(@"
Write a Python function to calculate fibonacci numbers using dynamic programming.
Include docstring and type hints.");

Example 4: Streaming Responses

csharp
var agent = await new AgentBuilder()
    .WithOllama(model: "llama3:8b")
    .Build();

Console.Write("Response: ");
await foreach (var chunk in agent.RunAsync("Write a short story about AI"))
{
    Console.Write(chunk);
}
Console.WriteLine();

Example 5: Multi-Model Deployment

csharp
// Fast small model for simple queries
var quickAgent = await new AgentBuilder()
    .WithOllama(
        model: "qwen3:4b",
        configure: opts =>
        {
            opts.NumCtx = 4096;
            opts.NumPredict = 512;
        })
    .Build();

// Powerful large model for complex tasks
var powerAgent = await new AgentBuilder()
    .WithOllama(
        model: "llama3:70b",
        configure: opts =>
        {
            opts.NumCtx = 8192;
            opts.NumGpu = 40;
        })
    .Build();

// Route based on complexity
var response = complexity == "simple"
    ? await quickAgent.RunAsync(prompt)
    : await powerAgent.RunAsync(prompt);

Example 6: JSON Structured Output

csharp
var agent = await new AgentBuilder()
    .WithOllama(
        model: "qwen3:4b",
        configure: opts =>
        {
            opts.Format = "json";
            opts.Temperature = 0.3f;
        })
    .Build();

var jsonResponse = await agent.RunAsync(@"
List 5 programming languages with their release years.
Format: {""languages"": [{""name"": ""..."", ""year"": 1990}]}");

// Parse the JSON response
var data = JsonSerializer.Deserialize<LanguageData>(jsonResponse);

Example 7: Reasoning with DeepSeek-R1

csharp
var agent = await new AgentBuilder()
    .WithOllama(
        model: "deepseek-r1:8b",
        configure: opts =>
        {
            opts.Think = true; // Enable thinking/reasoning
            opts.NumCtx = 8192; // Large context for reasoning chains
            opts.Temperature = 0.6f;
        })
    .Build();

var response = await agent.RunAsync(@"
A farmer has 17 sheep, and all but 9 die. How many are left?
Think through this step-by-step.");

Example 8: Remote Server with Custom Configuration

csharp
var agent = await new AgentBuilder()
    .WithOllama(
        model: "mixtral",
        endpoint: "http://gpu-server:11434",
        configure: opts =>
        {
            opts.Temperature = 0.7f;
            opts.NumCtx = 8192;
            opts.NumPredict = 4096;
            opts.KeepAlive = "1h"; // Keep loaded for 1 hour
        })
    .Build();

Troubleshooting

"Model name is required"

Problem: Missing model name in configuration.

Solution:

csharp
//  Correct
.WithOllama(model: "llama3:8b")

// Incorrect
.WithOllama(model: null)

"Connection refused to localhost:11434"

Problem: Ollama is not running.

Solution:

bash
# Check if Ollama is running
curl http://localhost:11434/api/tags

# If not, start Ollama
ollama serve  # Linux/CLI
# or launch Ollama desktop app on macOS/Windows

"Model 'llama3:8b' not found"

Problem: Model not pulled locally.

Solution:

bash
# Pull the model
ollama pull llama3:8b

# Verify it's available
ollama list

# Use the exact name shown in the list

Slow First Response

Problem: Model is being loaded into memory.

Solution: This is normal. Keep the model loaded:

csharp
configure: opts => opts.KeepAlive = "30m"  // Keep in memory

Or pre-load:

bash
ollama run llama3:8b "test"  # Pre-loads the model

"Temperature must be between 0 and 2"

Problem: Invalid temperature value.

Solution:

csharp
//  Valid range for Ollama
opts.Temperature = 0.7f;  // 0.0 to 2.0

// Invalid
opts.Temperature = 3.0f;

Model Using Too Much Memory

Problem: Model consuming too much RAM/VRAM.

Solution:

csharp
configure: opts =>
{
    // Reduce context size
    opts.NumCtx = 2048;  // Instead of 4096+

    // Enable low VRAM mode
    opts.LowVram = true;

    // Reduce GPU layers
    opts.NumGpu = 20;  // Instead of 40
}

Or use a quantized/smaller model:

bash
ollama pull llama3:8b  # Instead of llama3:70b

Function Calling Returns Errors

Problem: Model doesn't support tool calling.

Solution: Ensure you're using a compatible model:

csharp
//  These support function calling
model: "llama3.1:8b"   // Llama 3.1+
model: "mistral-nemo"
model: "qwen2.5:7b"

// These don't support function calling
model: "llama3:8b"     // Llama 3.0
model: "mistral:7b"    // Base Mistral

OllamaProviderConfig API Reference

Core Parameters

PropertyTypeRangeDefaultDescription
NumPredictint?≥ -2128Max tokens to generate (-1 = infinite, -2 = fill context)
NumCtxint?≥ 12048Context window size

Sampling Parameters

PropertyTypeRangeDefaultDescription
Temperaturefloat?0.0-2.00.8Sampling temperature (creativity)
TopPfloat?0.0-1.00.9Nucleus sampling threshold
TopKint?≥ 140Top-K sampling limit
MinPfloat?0.0-1.00.0Minimum probability threshold
TypicalPfloat?0.0-1.01.0Locally typical sampling
TfsZfloat?≥ 0.01.0Tail-free sampling

Repetition Control

PropertyTypeRangeDefaultDescription
RepeatPenaltyfloat?≥ 0.01.1Penalize repetitions
RepeatLastNint?-64Look-back window (0 = disabled, -1 = num_ctx)
PresencePenaltyfloat?0.0-2.00.0Penalize token presence
FrequencyPenaltyfloat?0.0-2.00.0Penalize token frequency
PenalizeNewlinebool?-truePenalize newline tokens

Determinism

PropertyTypeDefaultDescription
Seedint?0Random seed for reproducibility
Stopstring[]?-Stop sequences

Mirostat Sampling

PropertyTypeRangeDefaultDescription
MiroStatint?0-20Mirostat mode (0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
MiroStatEtafloat?≥ 0.00.1Mirostat learning rate
MiroStatTaufloat?≥ 0.05.0Mirostat target entropy

Context & Memory

PropertyTypeDefaultDescription
NumKeepint?4Tokens to keep from prompt (-1 = all)

Performance & Hardware

PropertyTypeDefaultDescription
NumGpuint?Platform-dependentGPU layers to offload
MainGpuint?0Primary GPU index
LowVrambool?falseEnable low VRAM mode
F16kvbool?falseUse FP16 for key/value cache
LogitsAllbool?falseReturn logits for all tokens

Threading & Batch Processing

PropertyTypeDefaultDescription
NumThreadint?Auto-detectedCPU threads to use
NumBatchint?512Prompt processing batch size
NumGqaint?-GQA groups (model-specific)

Memory Management

PropertyTypeDefaultDescription
UseMmapbool?trueUse memory mapping
UseMlockbool?falseLock model in memory
Numabool?falseEnable NUMA support
VocabOnlybool?falseLoad only vocabulary

Ollama-Specific Options

PropertyTypeDefaultDescription
KeepAlivestring?-How long to keep model loaded ("5m", "1h", "-1")
Formatstring?-Response format ("json" for JSON mode)
Templatestring?-Custom prompt template
Thinkobject?-Enable thinking (true/false or "high"/"medium"/"low")

Advanced Options

PropertyTypeDefaultDescription
AdditionalPropertiesDictionary<string, object>?-Custom parameters

Additional Resources

Released under the MIT License.