Releases: nullata/llamaman
0.9.6
Docker-in-Docker architecture #37
LlamaMan no longer bundles or calls llama.cpp directly. Instead it spawns each model server as a sibling Docker container using the official ghcr.io/ggml-org/llama.cpp:server-* images via the Docker socket. This is the foundational change that everything else in this release builds on.
- LlamaMan is now a lightweight Python-only container with no GPU dependency of its own
- llama-server containers are created, started, stopped, and removed through the Docker SDK
- GPU passthrough, port binding, volume mounts, CPU quota, and memory limits are applied per-container at launch time
- Models volume is passed to sub-containers using a
MODELS_HOST_DIRenv var that resolves the actual host-side path for the bind mount - Backing containers are always cleaned up:
stop_containernow catches errors fromstop()and callsremove()regardless, so already-exited containers don't leave orphaned records
Universal GPU support - single image for all vendors
- Single
Dockerfile-Dockerfile.cudaandDockerfile.rocmare replaced by oneDockerfile. One image tag covers NVIDIA, AMD (ROCm), Intel Arc, and CPU-only - Auto-detection at startup - LlamaMan probes the host: pynvml for NVIDIA,
/sys/class/drmsysfs for AMD and Intel Arc. Detected vendor is logged at startup LLAMA_IMAGEauto-default - if the env var is not set, the image is selected from the detected vendor (server-cuda/server-rocm/server-sycl/server)GPU_TYPEoverride - set tocuda,rocm, orintelto skip auto-detection- Intel Arc support - new
intelbranch in_run_container: mounts/dev/dri, addsvideo/rendergroups, usesserver-syclimage by default. Per-instance GPU device selection is not supported on Intel Arc - Single
docker-compose.yml- the separate ROCm profile service is removed;/sys/class/drm:romount is included by default; NVIDIA toolkitutilitycapability block is present as a commented-out section
Native GPU monitoring
- VRAM and utilization are now queried inside the llamaman container directly - no running llama-server instance required
- NVIDIA: uses
pynvml. Requires uncommenting thedeploy.resources.reservationsblock indocker-compose.ymlto grant toolkitutilitycapability - AMD / Intel Arc: reads
mem_info_vram_used,mem_info_vram_total,gpu_busy_percent, andproduct_namefrom/sys/class/drmsysfs (the:romount in the compose file) - Falls back to the previous
exec-basednvidia-smi/rocm-smiapproach when native access is not configured and a container is running - The GPU panel no longer returns an error when no llama-server containers are running
Container resource monitoring
- Each running instance card shows live stats updated every 3 seconds: CPU%, core quota, RAM used / limit, and GPU assignment
- CPU quota is read from the instance's configured
threadsvalue (the Dockernano_cpussetting), not fromonline_cpuswhich always reflects the host CPU count - GPU assignment is resolved from the instance config against the detected GPU list - no container inspection needed
- Stats are fetched in parallel via a
ThreadPoolExecutorto avoid blocking the UI on slow Docker API calls
Per-container resource limits
- CPU Threads now applies both
--threads Nto llama-server and a Docker CPU quota (nano_cpus) to the container, capping the cores it can use. Leave blank for no limit - Memory Limit - new field in the launch form (e.g.
32g,8192m). Setsmem_limiton the spawned container. Saved in presets. Leave blank for no limit
Docker image management
- Pull image by name - new text input in the Docker Images tab lets you pull any image by name directly (e.g.
ghcr.io/ggml-org/llama.cpp:server-cuda) without it needing to be in the tracked list first - Delete local image - each image in the list now has a delete button that removes it from Docker and from the tracked list. Disabled for the active
LLAMA_IMAGE. Returns an error if Docker refuses (e.g. image in use by a running container)
Model backup and restore #39
- Download Stored Models JSON - exports all scanned models with their preset configs to a timestamped JSON file
- Restore from JSON - upload a previously exported backup. For each model in the file:
- Already present on disk: preset is merged in (existing values are not overwritten)
- Not present but has a HuggingFace source: download is queued immediately and preset is pre-populated at the expected post-download path so it is ready when the file lands
- Not present and no known source: reported as unrestorable
- Results are shown inline with per-model status badges (present / queued / missing / error)
Repeat penalty in proxy sampling overrides
- New Repeat Penalty field in the per-instance proxy sampling overrides section
- Default
0(disabled - not injected into requests). Range0–2.0 - Only injected into proxied requests when set above
0, so leaving it at the default has no effect on clients that set their own value
0.8.9-4
-
Display source repository info on model cards (#40)
Added UI support for showing the HuggingFace repo_id a model was downloaded from. CSS, JS, and template changes only - no backend changes -
Fix per-instance proxy blocking on request body read
_extract_model_from_request was calling wsgi.input.read() with no argument, which reads the raw socket until EOF (blocking until the client disconnects). Fixed by reading exactly CONTENT_LENGTH bytes, so the proxy no longer hangs waiting for the connection to close before forwarding -
Model name validation for healthy instances on per-instance proxy ports
Previously, model name validation (returning 404 for a mismatched "model" field) only ran for sleeping/stopped instances. Extended the check to run after all wake/wait logic so healthy and starting instances are validated consistently - sending the wrong model name to a port always returns 404 regardless of instance state -
Docs and version bump (0.8.9-4) README and DOCKERHUB.md updates covering: per-instance proxy behavior and model validation rules, MariaDB/MySQL setup snippet with CREATE DATABASE/CREATE USER/GRANT commands, and minor docker-compose correction
0.8.9
Model Favorites & Notes (#35)
- Star/favorite models in the sidebar model library - click the star icon to mark favorites, which sort alphabetically at the top of the list
- Favorite toggle in settings - a star button appears in the Launch Instance tab bar (far right) for quick access
- Model notes - a new "Note" text field in the Launch Instance form lets you add a note to any model, saved automatically on blur
- Favorites and notes are stored as part of model presets and persist across sessions
- Added
PATCH /api/presets/<path>endpoint for lightweight partial preset updates (favorite/note only, no full preset required)
Proxy Wake-on-Request by Model Name (#36)
- Fixed: when sending an OpenAI API request directly to a sleeping instance's port (e.g.
POST http://localhost:8000/v1/chat/completions), the idle proxy now inspects themodelfield in the request body and wakes the sleeping instance if the model matches - If the requested model doesn't match the sleeping instance, the proxy returns a clear
404error instead of a generic failure - If the original instance record is gone but a sleeping instance with a matching model exists on that port, the proxy finds and wakes it
- Non-inference requests (health checks, etc.) continue to wake the instance unconditionally
- The main llamaman proxy on port 42069 is unaffected - all changes are scoped to the per-instance idle proxy (ports 8000-8020)
0.8.7
0.8.6
Model Downloads
- Added model list download for redeployment backups
- Integrated repo source download into settings store
- Failed downloads now auto-retry
Proxy / Parameter Overrides
- Added proxy support for temperature, top-p, and top-k
- Added upper bound enforcement and presence_penalty to parameter overrides
UI / Visual Improvements
- Refactored resource bars: consolidated inline CSS into proper classes, adjusted geometry, styling, and progress bar colors - Status bar layout and color tone adjustments
- GPU polling frequency tuned
- System info card visibility state properly initialized
- General QoL visual polish
0.8.3
0.7.8
0.7.3
- Release gate slot when per-instance proxy returns 502 (resp=None path never called gate.release(), leaking slots permanently) (#10)
- Add requests.Timeout to retry exception lists (ReadTimeout doesn't inherit from ConnectionError, so timeouts were never retried) - Add REQUEST_TIMEOUT env var (default 300s) replacing all hardcoded timeout=300 values (#11)
0.7.2
Bug Fixes
-
Preset
idle_timeout_minignored on API-triggered launch - When a model was auto-launched via the Ollama or OpenAI-compatible API endpoints, theidle_timeout_minpreset setting was not passed tolaunch_instance. It always defaulted to0(disabled), so no idle proxy was created and the background poller never enforced the timeout. The model would run indefinitely regardless of preset configuration. (#9) -
Initial request lost during model loading - When a request triggered a model load (cold start),
_ensure_model_runningheld_llamaman_lockfor the entire model load duration (up to 300s) while polling health through the public proxy port. This caused the original request to time out before the prompt was ever forwarded to the model. Fixed by splitting the flow:_ensure_model_runningnow returns as soon as the model is launched, and each request handler waits for readiness independently on the internal port before forwarding. (#8) -
Proxy did not handle
"starting"status - The per-instance idle proxy (ports 8000-8020) only checked for"sleeping"and"stopped"statuses. If a request arrived at a proxy port while the model was still loading ("starting"), it was forwarded immediately to the unready llama-server and failed. The proxy now waits for the model to become healthy before forwarding. (#8) -
Connection errors on forward after model load - Added retry logic (3 attempts, 2s interval) for
ConnectionError/ConnectionRefusedErroron all request forwarding paths: Ollama streaming, Ollama non-streaming, OpenAI passthrough, and idle proxy forwarding. This handles transient failures in the brief window between health-check success and the server being fully ready for inference. (#8) -
Preset settings ignored on model relaunch - When a sleeping or stopped model was relaunched (woken by a request or restarted),
relaunch_inactive_instanceused the config stored from the original launch. Any preset changes made while the model was inactive (e.g. adjusted GPU layers, context size, idle timeout) were ignored. The function now reloads current presets from storage before rebuilding the launch command. (#9)
Improvements
-
_llamaman_lockreleased faster - The global model launch/eviction lock is no longer held during model health polling. This unblocks concurrent launches of different models that were previously serialized behind a single long model load. -
Health checks use internal port directly - Request handlers now poll the llama-server health endpoint on the internal port rather than routing through the idle proxy, avoiding unnecessary proxy thread blocking during model startup.
0.7.1
Bug Fixes
- Fixed requests timing out when an instance is still loading a model. Health checks now use a dedicated MODEL_LOAD_TIMEOUT (default 300s) instead of the short HEALTH_CHECK_TIMEOUT, so large models have time to load without dropping requests. (#8)
New Environment Variable
- MODEL_LOAD_TIMEOUT - seconds to wait for a model to become healthy during launch/relaunch (default: 300)