A browser-based UI for launching, monitoring, and managing multiple llama.cpp server instances. LlamaMan runs as a lightweight Python container and spawns llama-server as sibling Docker containers using the official llama.cpp images. Includes an Ollama-compatible API proxy so it works as a drop-in replacement for Ollama with Open WebUI.
- Model library - scans
/modelsfor GGUF files, shows quant type and file size - One-click launch - configure GPU layers, context size, threads, multi-GPU, extra args
- Preset configs - save/load per-model launch settings
- Download manager - pull models from HuggingFace with speed throttling and auto-retry on failure
- Model backup and restore - export all model metadata and presets to JSON, restore on any instance by re-queuing missing downloads automatically
- Instance management - stop, restart, remove, view live-streamed logs
- GPU VRAM indicator - per-GPU VRAM and utilization, queried natively (no running instance required)
- Container resource monitoring - live CPU%, core quota, RAM usage, and GPU assignment per running instance card
- Idle timeout - auto-sleep instances after configurable idle period, wake on next request
- Ollama-compatible proxy - OpenWebUI discovers models and auto-starts servers on demand
- Authentication - user accounts with session login, API key management with bearer tokens
- Require auth toggle - enforce bearer token authentication on all endpoints (including model loading) or leave model endpoints open
- Persistent state - instance history and configs survive container restarts
- Storage backends - JSON files (default) or MariaDB/MySQL via SQLAlchemy
- Proxy sampling overrides - force temperature, top-k, top-p, presence penalty, and repeat penalty on all proxied requests, configurable per model preset
- Docker image management - pull any llama.cpp image by name, delete old local images from the Settings UI
- Universal GPU support - single
Dockerfileand image for NVIDIA, AMD (ROCm), Intel Arc, and CPU. GPU vendor is auto-detected at startup;GPU_TYPEoverrides if needed.LLAMA_IMAGEis also auto-selected from the detected vendor. - Native GPU monitoring - VRAM and utilization are queried inside the llamaman container directly (pynvml for NVIDIA,
/sys/class/drmsysfs for AMD/Intel Arc), so the GPU panel works without a running llama-server instance. - Container resource monitoring - each running instance card shows live CPU%, core quota, RAM used/limit, and GPU assignment, updated every 3 seconds via a separate poll.
- Docker image management - pull any llama.cpp image by name, delete old local images, all from the Settings UI.
- Model backup and restore - export all model metadata and presets to JSON; restore on any instance with downloads queued automatically for missing models.
- Repeat penalty in proxy sampling overrides - configurable per preset, default 0 (disabled).
- CPU quota + memory limit - setting CPU Threads now also applies a Docker
nano_cpusquota; a new Memory Limit field caps container RAM.
LlamaMan is a lightweight Python web app with no dependency on llama.cpp itself. When you launch a model, LlamaMan uses the Docker socket to spawn a ghcr.io/ggml-org/llama.cpp:server-* container as a sibling on the host. GPU passthrough, port binding, and volume mounts are configured per-container via the Docker SDK.
Host machine
├── Docker daemon
│ ├── llamaman container (Python only - no GPU usage - only monitoring, no llama.cpp)
│ │ └── /var/run/docker.sock (talks to Docker daemon)
│ ├── llamaman-<id> container (llama.cpp:server-cuda, GPU attached)
│ └── llamaman-<id> container (llama.cpp:server-cuda, GPU attached)
└── GPU hardware
To update llama.cpp - no llamaman rebuild needed:
docker pull ghcr.io/ggml-org/llama.cpp:server-cuda- Docker with access to
/var/run/docker.sock - One of:
- NVIDIA Container Toolkit for NVIDIA GPUs
- ROCm-compatible setup for AMD GPUs
- Intel Arc with
/dev/driaccess for Intel GPUs
- The matching llama.cpp server image pulled on the host (see Quick Start)
Before starting, edit docker-compose.yml and set the two host path variables to match your volume mount sources:
- HOST_MODELS_DIR=/absolute/host/path/to/models
- HOST_LOGS_DIR=/absolute/host/path/to/logsThese must be the real paths on the Docker host. LlamaMan passes them to the Docker daemon when spawning sibling llama-server containers, so they must resolve on the host — not inside the llamaman container.
NVIDIA:
docker pull ghcr.io/ggml-org/llama.cpp:server-cuda
docker compose up --buildFor native VRAM monitoring, also uncomment the deploy.resources.reservations block in docker-compose.yml.
AMD (ROCm):
docker pull ghcr.io/ggml-org/llama.cpp:server-rocm
# Edit docker-compose.yml: set LLAMA_IMAGE=ghcr.io/ggml-org/llama.cpp:server-rocm
docker compose up --buildIntel Arc:
docker pull ghcr.io/ggml-org/llama.cpp:server-sycl
# Edit docker-compose.yml: set LLAMA_IMAGE=ghcr.io/ggml-org/llama.cpp:server-sycl
docker compose up --buildCPU only:
docker pull ghcr.io/ggml-org/llama.cpp:server
# Edit docker-compose.yml: set LLAMA_IMAGE=ghcr.io/ggml-org/llama.cpp:server
docker compose up --build- Management UI: http://localhost:5000
- Llamaman proxy (Ollama-compatible API): http://localhost:42069
- llama-server public instance ports: 8000-8020
On first launch, visit the UI to create an admin account via /setup.
Note: LlamaMan needs access to the Docker socket (
/var/run/docker.sock) to spawn llama-server containers. This is already configured indocker-compose.yml. Be aware of the security implications - a container with Docker socket access has the ability to manage other containers on the host.
LlamaMan has a built-in auth system with two layers:
On first launch, /setup lets you create an admin account. After that, all browser access requires login. Session cookies authenticate UI requests.
Create API keys in the API Keys section of the UI. External clients (OpenWebUI, scripts, etc.) authenticate with:
Authorization: Bearer llm-xxxxxxxxxx
The "Require authentication for all endpoints" toggle (on by default) controls whether model-serving endpoints require a bearer token:
| Toggle | Model endpoints (/api/chat, /v1/chat/completions, etc.) |
Management endpoints (/api/instances, etc.) |
Per-instance proxy ports |
|---|---|---|---|
| ON (default) | Bearer token required | Bearer token or session required | Bearer token required |
| OFF | Open (no auth) | Bearer token or session required | Open (no auth) |
When the toggle is ON, all three port surfaces are protected:
- Port 5000 (management UI + API) - Flask
before_requesthook - Port 42069 (Ollama-compatible proxy) - same Flask app, same hook
- Ports 8000-8020 (per-instance proxies) - WSGI-level auth check
When require_auth is on, configure OpenWebUI to send a valid API key:
open-webui:
environment:
- OLLAMA_BASE_URL=http://llamaman:42069
- OPENAI_API_BASE_URLS=http://llamaman:42069/v1
- OPENAI_API_KEYS=llm-your-api-key-herePlace models inside the models/ volume:
- GGUF files: any
.gguffile (recommended - llama.cpp native format) - HuggingFace repos: directories containing
config.json
Or use the Download button in the UI to pull from HuggingFace.
- Select a model from the sidebar
- Configure launch settings (GPU layers, context size, idle timeout, etc.)
- Click Launch - LlamaMan spawns a llama-server container and the instance appears with a status badge
- Optionally click Save Preset to remember settings for that model
Each instance exposes an OpenAI-compatible API on its assigned port.
When you select a GGUF model, LlamaMan reads the file's metadata to detect the total number of layers (block count). This is displayed next to the GPU Layers input so you can see exactly how many layers are available to offload (e.g. / 32). Set GPU Layers to -1 to offload all layers to GPU.
| Setting | Default | Description |
|---|---|---|
| GPU Layers | -1 |
Number of layers to offload to GPU. -1 = all layers, 0 = CPU only. Total layers are autodetected from the GGUF file. |
| Context Size | 4096 |
Maximum context window in tokens (--ctx-size). |
| Parallel | 1 |
Number of parallel sequences the llama-server can process simultaneously (--parallel). Controls KV cache slot allocation inside the server itself. |
| Idle Timeout min | 0 |
Minutes of inactivity before the server is stopped to free VRAM. 0 = disabled. See Idle Timeout. |
| Max Concurrent | 0 |
Maximum number of inference requests allowed in-flight at once. 0 = unlimited. When set, incoming requests are queued and gated by a semaphore. |
| Max Queue Depth | 200 |
Maximum number of requests that can wait in the queue when Max Concurrent is active. Requests beyond this limit are rejected with HTTP 429. |
| Share Queue | off | When enabled, multiple proxy-managed instances of the same model share a single request queue. Incoming requests are distributed across instances as slots become available, providing simple load balancing. |
| Embedding Model | off | Marks the instance as an embedding model. Embedding instances are excluded from the LLAMAMAN_MAX_MODELS count and will never be evicted by the proxy's LRU policy. |
| CPU Threads | (auto) | Sets both --threads N for llama-server and the container's CPU quota (--cpus N). Leave blank to let the container and llama-server use all available cores. |
| Memory Limit | (none) | Hard memory cap for the llama-server container (e.g. 32g, 8192m). Equivalent to deploy.resources.limits.memory in Docker Compose. Leave blank for no limit. |
| GPU Devices | (global default) | Comma-separated GPU indices to make visible to this container (e.g. 0,1). Overrides LLAMA_GPU_DEVICES for this instance. Leave blank to use the global default. Not supported on Intel Arc. |
| Extra Args | (empty) | Additional flags passed directly to llama-server (e.g. --flash-attn). |
| Proxy Sampling Overrides | off | When enabled, the proxy forces the configured sampling parameters on every request forwarded to this instance, regardless of what the client sends. |
| Temperature | 0.8 |
Sampling temperature to enforce (range: 0.0–2.0). Only active when proxy sampling overrides are enabled. |
| Top K | 40 |
Top-k sampling value to enforce (min: 0). Only active when proxy sampling overrides are enabled. |
| Top P | 0.95 |
Top-p (nucleus) sampling value to enforce (range: 0.01–1.0). Only active when proxy sampling overrides are enabled. |
| Presence Penalty | 0.0 |
Presence penalty to enforce (range: -2.0–2.0). Only active when proxy sampling overrides are enabled. |
| Repeat Penalty | 0.0 |
Repeat penalty to enforce (range: 0.0–2.0). 0 = disabled (not injected). Only active when proxy sampling overrides are enabled. |
When Max Concurrent is set to a value greater than 0, LlamaMan places a concurrency gate in front of the instance. Requests that exceed the limit are held in a FIFO queue (up to Max Queue Depth). If the queue is also full, new requests are rejected with HTTP 429.
The gate tracks active and queued request counts, which are visible in the instance list via the API.
Parallel vs Max Concurrent: Parallel controls how many sequences the llama-server processes internally (KV cache slots). Max Concurrent is an external gate that limits how many requests LlamaMan forwards to the server at once. You can use both together - for example, Parallel=4 with Max Concurrent=4 ensures the server always has enough KV slots for the requests it receives.
LlamaMan queries GPU VRAM and utilization natively - no running llama-server instance required.
| Vendor | Method | Requirement |
|---|---|---|
| NVIDIA | pynvml (NVML library direct) |
Uncomment the deploy.resources.reservations block in docker-compose.yml to grant the llamaman container NVIDIA toolkit utility capability |
| AMD | /sys/class/drm sysfs |
/sys/class/drm:ro volume mount (included in docker-compose.yml by default) |
| Intel Arc | /sys/class/drm sysfs |
Same mount as AMD |
When native access is not configured, LlamaMan falls back to exec-ing nvidia-smi / rocm-smi inside a running llama-server container (previous behavior). Stats always reflect the full host GPU state, not just a single container's usage.
Set Idle Timeout min in the launch form (0 = disabled). When enabled:
- The manager proxies the instance port (transparent to clients)
- After N minutes of no requests, the llama-server container is stopped to free VRAM
- On the next request, a new container is spawned with the same config
- Client sees the same port/API with just a cold-start delay
For instances managed by the llamaman proxy (OpenWebUI), use the LLAMAMAN_IDLE_TIMEOUT env var instead.
When any of the following are enabled for an instance, LlamaMan inserts a WSGI proxy in front of the llama-server container on that port: Idle Timeout, Max Concurrent, or Proxy Sampling Overrides. The public port (e.g. 8000) is handled by the proxy; the llama-server container listens internally on a separate port.
The proxy enforces that requests reach the correct model. On inference endpoints (/v1/chat/completions, /v1/completions, /v1/embeddings, /completion, /chat/completions):
- If the request body includes a
"model"field, the proxy compares it against the loaded model's filename stem (lowercased, without extension). A prefix match is accepted - e.g."qwen2.5-0.5b-instruct-q2"matches"qwen2.5-0.5b-instruct-q2_k". A mismatch returns HTTP 404:{"error": "model 'wrong-model' is not loaded on this port"} - If the request body has no
"model"field, the request is forwarded unconditionally.
This check applies whether the instance is currently running or sleeping. For sleeping instances, a mismatched model name prevents the wake - no container is spawned.
When an instance with idle timeout is sleeping and a request arrives:
- If the request carries a
"model"field that does not match >> HTTP 404, no wake - If the model matches (or no model field) >> a new container is spawned, request is held until healthy, then forwarded
The UI provides download-related options under Settings >> Download Settings:
- Auto-retry failed downloads - automatically retries downloads that fail due to network errors or interruptions. Off by default.
- Retry count per failed download - how many times to retry before marking a download as permanently failed (default: 3, min: 1). Only active when auto-retry is enabled.
Settings >> Docker Images lets you manage the llama.cpp server images used to spawn containers:
- Pull image by name - type any image name (e.g.
ghcr.io/ggml-org/llama.cpp:server-cuda) and pull it directly without it needing to be in the tracked list first - Delete local image - each tracked image has a delete button that removes it from Docker and from the tracked list. Disabled for the active
LLAMA_IMAGE. Returns an error if Docker refuses (e.g. image in use by a running container) - Auto-update - optionally pull the active image on a configurable interval
Settings >> App Settings provides export and restore for model metadata and presets:
- Download Stored Models JSON - exports all scanned models with their preset configs to a timestamped JSON file. Use this to back up your configuration or migrate to a new host.
- Restore from JSON - upload a previously exported JSON. For each model in the file:
- Already present on disk: preset is merged in (existing values are not overwritten)
- Not present but has a HuggingFace source: download is queued immediately and preset is pre-populated at the expected path so it is ready when the file lands
- Not present and no known source: reported as unrestorable
The UI provides automatic cleanup under Settings >> Cleanup Settings:
- Auto-clean completed/failed downloads - removes download records older than a configurable number of hours (default: 24). Only affects completed, failed, or cancelled downloads - active downloads are never touched.
- Auto-clean stopped instances - removes stopped instance records older than a configurable number of hours (default: 24). Only affects stopped instances - running instances are never removed.
- Auto-remove stale instance records - periodically checks all
starting/healthy/sleepinginstance records against their actual Docker container. Records whose backing container is no longer running are marked stopped. Configurable check interval (default: 5 minutes).
Cleanup runs periodically in the background. These settings only remove or update records in the UI/state - they do not delete model files.
The llamaman proxy exposes an Ollama-compatible API on port 42069 (configurable). Point OpenWebUI at it:
open-webui:
environment:
- OLLAMA_BASE_URL=http://llamaman:42069How it works:
- OpenWebUI calls
/api/tags-> LlamaMan returns all available GGUF models - User selects a model in OpenWebUI ->
/api/chatrequest arrives - LlamaMan spawns a llama-server container (using saved preset or defaults)
- Waits for healthy, then proxies the request with format translation
- When
LLAMAMAN_MAX_MODELSlimit is reached, the least-recently-used Ollama-managed model is evicted. Admin UI launched models are never evicted by the Ollama API by default (see Model eviction policy)
Supported Ollama endpoints: /api/tags, /api/chat, /api/generate, /api/show, /api/version, /api/ps
Also supports OpenAI-compatible endpoints with auto-start: /v1/models, /v1/chat/completions
The LLAMAMAN_MAX_MODELS limit controls how many chat models the proxy will keep loaded simultaneously. When a new model is requested and the limit is reached, the least-recently-used (LRU) chat model is evicted to make room.
Admin UI launched models have ultimate priority. The two API surfaces have different eviction rights:
| Launcher | Eviction behaviour | Cannot evict |
|---|---|---|
| Admin UI | Evicts Ollama-managed models first (LRU), then admin UI models if needed | - |
Ollama API (/api/chat, /api/generate) |
Evicts Ollama-managed models (LRU) | Admin UI launched models (by default) |
OpenAI API (/v1/chat/completions) |
No eviction - starts model only if a slot is free | Everything |
If the cap is full, requests that cannot evict return HTTP 503:
model limit reached (LLAMAMAN_MAX_MODELS=N); admin-launched models cannot be evicted via the API
model limit reached (LLAMAMAN_MAX_MODELS=N); the OpenAI API does not evict running models
Two toggles in Settings >> App Settings control eviction behaviour:
- Enforce
LLAMAMAN_MAX_MODELSfor admin UI launches - when on, the admin UI silently evicts the LRU model (Ollama-managed first) before launching. When off (default), the UI prompts you to confirm before exceeding the cap. - Allow Ollama API to evict admin-launched models - when on, the Ollama API can also evict admin UI launched models as a fallback if no Ollama-managed models are available to evict. Off by default. Has no effect on the OpenAI API, which never evicts.
- All running instances count toward the limit - both admin UI and proxy-managed instances. If you manually launch 2 models and
LLAMAMAN_MAX_MODELS=1, the proxy sees you are already over the limit. - Embedding models are excluded. Instances marked as Embedding Model do not count toward the limit and are never evicted. This lets you keep an embedding model loaded permanently alongside your chat models.
LLAMAMAN_MAX_MODELS=0(default) disables eviction entirely. The proxy will launch models on demand without ever stopping existing ones.
Zero-config. Stores data in JSON files under DATA_DIR (/data):
state.json- instances and downloadspresets.json- per-model launch presetsusers.json- user accountssettings.json- global settingsapi_keys.json- API key hashes
Instance and download logs are written to LOGS_DIR (/tmp/llama-logs), which is separate from persistent data.
Create the database and a dedicated user:
CREATE DATABASE llamaman CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'llamaman'@'%' IDENTIFIED BY 'yourpassword';
GRANT ALL PRIVILEGES ON llamaman.* TO 'llamaman'@'%';
FLUSH PRIVILEGES;Then set DATABASE_URL to enable:
environment:
- DATABASE_URL=mysql+pymysql://llamaman:yourpassword@host:3306/llamamanTables are auto-created on first connection. Requires sqlalchemy and pymysql (included in requirements).
| Variable | Default | Description |
|---|---|---|
MODELS_DIR |
/models |
Directory scanned for model files (container path) |
DATA_DIR |
/data |
Directory for persistent config/state (JSON files) |
LOGS_DIR |
/tmp/llama-logs |
Directory for instance and download logs (container path) |
HOST_MODELS_DIR |
(same as MODELS_DIR) |
Host-side absolute path of the models volume — must match the left side of -v /host/path/models:/models. Passed to the Docker daemon when spawning sibling llama-server containers so they can bind-mount the same directory. |
HOST_LOGS_DIR |
(same as LOGS_DIR) |
Host-side absolute path of the logs volume. Same requirement as HOST_MODELS_DIR. |
PORT_RANGE_START |
8000 |
Start of public llama-server/proxy port pool |
PORT_RANGE_END |
8020 |
End of public llama-server/proxy port pool |
INTERNAL_PORT_RANGE_START |
9000 |
Start of internal port pool used when proxy mode is enabled |
INTERNAL_PORT_RANGE_END |
9020 |
End of internal port pool used when proxy mode is enabled |
LLAMAMAN_PROXY_PORT |
42069 |
Port for the Ollama-compatible proxy |
LLAMAMAN_MAX_MODELS |
0 |
Max concurrent chat models via the proxy (LRU eviction, 0 = unlimited) |
LLAMAMAN_IDLE_TIMEOUT |
0 |
Idle timeout in minutes for proxy-managed instances (0 = disabled) |
SECRET_KEY |
(auto) | Flask session secret. Auto-derived from machine-id if unset. Set this for multi-replica deployments. |
DATABASE_URL |
(unset) | MariaDB/MySQL connection string. Unset = use JSON files. |
HEALTH_CHECK_TIMEOUT |
3 |
Timeout in seconds for instance health checks |
MODEL_LOAD_TIMEOUT |
300 |
Seconds to wait for a model to become healthy during launch/relaunch. Increase for very large models. |
REQUEST_TIMEOUT |
300 |
Timeout in seconds for upstream requests to llama-server and gate acquire waits. |
| Variable | Default | Description |
|---|---|---|
LLAMA_IMAGE |
(auto) | llama.cpp Docker image used for all spawned containers. Auto-selected from the detected GPU vendor if not set (server-cuda / server-rocm / server-sycl / server). Set explicitly to pin a specific image or version. |
LLAMA_NETWORK |
llamaman-net |
Docker network that LlamaMan and all llama-server containers are attached to. Created automatically if it doesn't exist. |
LLAMA_CONTAINER_PREFIX |
llamaman- |
Name prefix for spawned llama-server containers (e.g. llamaman-abcd1234). |
GPU_TYPE |
(auto-detect) | Override GPU vendor detection: cuda (NVIDIA), rocm (AMD), intel (Intel Arc). Leave unset to let LlamaMan probe the host automatically. |
LLAMA_GPU_DEVICES |
(unset = all) | Comma-separated GPU indices visible to all spawned llama-server containers, e.g. 0,1,3. Unset exposes all GPUs. Per-instance GPU Devices overrides this when set. Not supported on Intel Arc. |
All endpoints return and accept JSON.
Authentication: Management endpoints require either a session cookie (from browser login) or an Authorization: Bearer <key> header. When require_auth is enabled (default), model-serving endpoints also require a bearer token.
| Method | Endpoint | Description |
|---|---|---|
GET |
/login |
Login page |
POST |
/login |
Authenticate (username, password form data) |
GET |
/setup |
First-run setup page |
POST |
/setup |
Create first user account |
GET |
/logout |
End session |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/api-keys |
List all API keys (hashes stripped) |
POST |
/api/api-keys |
Create a new API key ({"name": "..."}) |
DELETE |
/api/api-keys/<id> |
Revoke an API key |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/instances |
List all instances |
POST |
/api/instances |
Launch a new instance |
GET |
/api/instances/<id> |
Get a single instance |
DELETE |
/api/instances/<id> |
Stop and remove an instance |
POST |
/api/instances/<id>/restart |
Restart a stopped/sleeping instance |
DELETE |
/api/instances/<id>/remove |
Remove a stopped instance from the list |
GET |
/api/instances/<id>/logs |
Last N log lines |
GET |
/api/instances/<id>/logs/stream |
SSE live log tail |
GET |
/api/next-port |
Get next available port from the pool |
Launch body (POST /api/instances):
{
"model_path": "/models/my-model.gguf",
"port": 8000,
"n_gpu_layers": -1,
"ctx_size": 4096,
"threads": null,
"memory_limit": null,
"parallel": null,
"extra_args": "--flash-attn",
"gpu_devices": "",
"idle_timeout_min": 0,
"max_concurrent": 0,
"max_queue_depth": 200,
"share_queue": false,
"proxy_sampling_override_enabled": false,
"proxy_sampling_temperature": 0.8,
"proxy_sampling_top_k": 40,
"proxy_sampling_top_p": 0.95,
"proxy_sampling_presence_penalty": 0.0,
"proxy_sampling_repeat_penalty": 0.0
}gpu_devices: comma-separated GPU indices for this instance (e.g. "0", "0,1"). Leave empty to use LLAMA_GPU_DEVICES (or all GPUs if that is also unset). Not supported on Intel Arc.
memory_limit: Docker memory cap string, e.g. "32g" or "8192m". Omit or null for no limit.
threads: when set, applies --threads N to llama-server and sets the container CPU quota to N cores.
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/downloads |
List all downloads |
POST |
/api/downloads |
Start a new download |
GET |
/api/downloads/<id> |
Get a single download |
DELETE |
/api/downloads/<id> |
Cancel an active download |
DELETE |
/api/downloads/<id>/remove |
Remove a completed/failed entry |
GET |
/api/downloads/<id>/logs |
Download log output |
GET |
/api/downloads/<id>/logs/stream |
SSE live log tail |
Download body (POST /api/downloads):
{
"repo_id": "bartowski/Mistral-7B-Instruct-v0.3-GGUF",
"filename": "Mistral-7B-Instruct-v0.3-Q4_K_M.gguf",
"hf_token": "hf_...",
"speed_limit_mbps": 0
}Leave filename blank to download the full repository.
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/models |
List discovered models in MODELS_DIR (includes repo_id when source is known) |
POST |
/api/models/delete |
Delete a model from disk ({"path": "/models/..."}) |
GET |
/api/model-layers?path=<path> |
Read layer count from GGUF metadata |
GET |
/api/disk-space |
Free/used space on the models volume |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/presets |
List all saved presets |
GET |
/api/presets/<model_path> |
Get preset for a model |
PUT |
/api/presets/<model_path> |
Save/update a preset |
DELETE |
/api/presets/<model_path> |
Delete a preset |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/settings |
Get global settings |
POST |
/api/settings |
Save global settings |
Settings body (POST /api/settings):
{
"require_auth": true,
"admin_ui_enforce_max_models": false,
"allow_ollama_api_override_admin": false,
"auto_retry_failed_downloads": false,
"retry_count_per_failed_download": 3,
"cleanup": {
"downloads_enabled": true,
"downloads_max_age_hours": 24,
"downloads_last_run_at": 1710000000,
"instances_enabled": false,
"instances_max_age_hours": 48,
"instances_last_run_at": 1710000000,
"stale_records_enabled": false,
"stale_records_interval_min": 5,
"stale_records_last_run_at": null
}
}| Method | Endpoint | Description |
|---|---|---|
GET |
/api/system-info |
CPU usage, core count, RAM usage |
GET |
/api/gpu-info |
Per-GPU VRAM and utilization (native query; falls back to container exec if native access is not configured) |
GET |
/health |
Health check ({"status": "ok"}) - always open, no auth required |
| Method | Endpoint | Description |
|---|---|---|
GET |
/api/tags |
List available models (Ollama format) |
GET |
/api/version |
Version info |
POST |
/api/show |
Model metadata |
GET |
/api/ps |
Running models |
POST |
/api/chat |
Chat completion (auto-starts model) |
POST |
/api/generate |
Text generation (auto-starts model) |
GET |
/v1/models |
List models (OpenAI format) |
POST |
/v1/chat/completions |
Chat completion (OpenAI format, auto-starts model) |
| Symptom | Fix |
|---|---|
| Instance stuck on starting | Check logs via the Logs button. Common causes: OOM, model path typo, corrupt GGUF, image not pulled. |
| "Docker image not found" | Pull the matching image: docker pull ghcr.io/ggml-org/llama.cpp:server-cuda (NVIDIA), server-rocm (AMD), server-sycl (Intel Arc), or server (CPU). |
| "Docker API error" on launch | Ensure /var/run/docker.sock is mounted into the LlamaMan container (it is by default in docker-compose.yml). |
| No GPU / CUDA error | Ensure the NVIDIA Container Toolkit is installed and docker run --gpus all works on the host. |
| No GPU / ROCm error | Ensure /dev/kfd and /dev/dri exist on the host and your user is in the video/render groups. |
| No GPU / Intel Arc error | Ensure /dev/dri is accessible and your user is in the video/render groups. |
| GPU stats show unavailable | For NVIDIA: uncomment the deploy.resources.reservations block in docker-compose.yml. For AMD/Intel: ensure /sys/class/drm:ro is mounted (default in docker-compose.yml). |
| Wrong GPU vendor detected | Set GPU_TYPE=cuda, GPU_TYPE=rocm, or GPU_TYPE=intel in the environment to override auto-detection. |
| Port conflict | The form auto-suggests an unused port; adjust if needed. |
| Model not showing in OpenWebUI | Ensure OLLAMA_BASE_URL points to http://llamaman:42069. Check /api/tags returns models. |
| OpenWebUI gets 401 errors | require_auth is on (default). Create an API key in the UI and set OPENAI_API_KEYS in OpenWebUI's environment. |
| "API key required" on all requests | Either create an API key, or turn off the "Require authentication" toggle in the API Keys section. |
| Containers not cleaned up after stop | LlamaMan stops and removes containers when instances are stopped. If containers are orphaned after a crash, run docker ps --filter name=llamaman- to find and remove them manually, or restart LlamaMan (orphan adoption runs on startup). |
This work would not be possible without the work of ggml-org/llama.cpp
LlamaMan is licensed under the Elastic License 2.0. You may use, copy, distribute, and modify the software, subject to the following limitations:
- You may not provide the software to third parties as a hosted or managed service where the service gives users access to a substantial set of its features or functionality.
- You may not remove or obscure any licensing, copyright, or other notices of the licensor.
