Skip to content

Releases: nullata/llamaman

0.9.6

12 Apr 22:24
1a494b4

Choose a tag to compare

Docker-in-Docker architecture #37

LlamaMan no longer bundles or calls llama.cpp directly. Instead it spawns each model server as a sibling Docker container using the official ghcr.io/ggml-org/llama.cpp:server-* images via the Docker socket. This is the foundational change that everything else in this release builds on.

  • LlamaMan is now a lightweight Python-only container with no GPU dependency of its own
  • llama-server containers are created, started, stopped, and removed through the Docker SDK
  • GPU passthrough, port binding, volume mounts, CPU quota, and memory limits are applied per-container at launch time
  • Models volume is passed to sub-containers using a MODELS_HOST_DIR env var that resolves the actual host-side path for the bind mount
  • Backing containers are always cleaned up: stop_container now catches errors from stop() and calls remove() regardless, so already-exited containers don't leave orphaned records

Universal GPU support - single image for all vendors

  • Single Dockerfile - Dockerfile.cuda and Dockerfile.rocm are replaced by one Dockerfile. One image tag covers NVIDIA, AMD (ROCm), Intel Arc, and CPU-only
  • Auto-detection at startup - LlamaMan probes the host: pynvml for NVIDIA, /sys/class/drm sysfs for AMD and Intel Arc. Detected vendor is logged at startup
  • LLAMA_IMAGE auto-default - if the env var is not set, the image is selected from the detected vendor (server-cuda / server-rocm / server-sycl / server)
  • GPU_TYPE override - set to cuda, rocm, or intel to skip auto-detection
  • Intel Arc support - new intel branch in _run_container: mounts /dev/dri, adds video/render groups, uses server-sycl image by default. Per-instance GPU device selection is not supported on Intel Arc
  • Single docker-compose.yml - the separate ROCm profile service is removed; /sys/class/drm:ro mount is included by default; NVIDIA toolkit utility capability block is present as a commented-out section

Native GPU monitoring

  • VRAM and utilization are now queried inside the llamaman container directly - no running llama-server instance required
  • NVIDIA: uses pynvml. Requires uncommenting the deploy.resources.reservations block in docker-compose.yml to grant toolkit utility capability
  • AMD / Intel Arc: reads mem_info_vram_used, mem_info_vram_total, gpu_busy_percent, and product_name from /sys/class/drm sysfs (the :ro mount in the compose file)
  • Falls back to the previous exec-based nvidia-smi / rocm-smi approach when native access is not configured and a container is running
  • The GPU panel no longer returns an error when no llama-server containers are running

Container resource monitoring

  • Each running instance card shows live stats updated every 3 seconds: CPU%, core quota, RAM used / limit, and GPU assignment
  • CPU quota is read from the instance's configured threads value (the Docker nano_cpus setting), not from online_cpus which always reflects the host CPU count
  • GPU assignment is resolved from the instance config against the detected GPU list - no container inspection needed
  • Stats are fetched in parallel via a ThreadPoolExecutor to avoid blocking the UI on slow Docker API calls

Per-container resource limits

  • CPU Threads now applies both --threads N to llama-server and a Docker CPU quota (nano_cpus) to the container, capping the cores it can use. Leave blank for no limit
  • Memory Limit - new field in the launch form (e.g. 32g, 8192m). Sets mem_limit on the spawned container. Saved in presets. Leave blank for no limit

Docker image management

  • Pull image by name - new text input in the Docker Images tab lets you pull any image by name directly (e.g. ghcr.io/ggml-org/llama.cpp:server-cuda) without it needing to be in the tracked list first
  • Delete local image - each image in the list now has a delete button that removes it from Docker and from the tracked list. Disabled for the active LLAMA_IMAGE. Returns an error if Docker refuses (e.g. image in use by a running container)

Model backup and restore #39

  • Download Stored Models JSON - exports all scanned models with their preset configs to a timestamped JSON file
  • Restore from JSON - upload a previously exported backup. For each model in the file:
    • Already present on disk: preset is merged in (existing values are not overwritten)
    • Not present but has a HuggingFace source: download is queued immediately and preset is pre-populated at the expected post-download path so it is ready when the file lands
    • Not present and no known source: reported as unrestorable
  • Results are shown inline with per-model status badges (present / queued / missing / error)

Repeat penalty in proxy sampling overrides

  • New Repeat Penalty field in the per-instance proxy sampling overrides section
  • Default 0 (disabled - not injected into requests). Range 02.0
  • Only injected into proxied requests when set above 0, so leaving it at the default has no effect on clients that set their own value

0.8.9-4

08 Apr 14:24
1ee7f9a

Choose a tag to compare

  • Display source repository info on model cards (#40)
    Added UI support for showing the HuggingFace repo_id a model was downloaded from. CSS, JS, and template changes only - no backend changes

  • Fix per-instance proxy blocking on request body read
    _extract_model_from_request was calling wsgi.input.read() with no argument, which reads the raw socket until EOF (blocking until the client disconnects). Fixed by reading exactly CONTENT_LENGTH bytes, so the proxy no longer hangs waiting for the connection to close before forwarding

  • Model name validation for healthy instances on per-instance proxy ports
    Previously, model name validation (returning 404 for a mismatched "model" field) only ran for sleeping/stopped instances. Extended the check to run after all wake/wait logic so healthy and starting instances are validated consistently - sending the wrong model name to a port always returns 404 regardless of instance state

  • Docs and version bump (0.8.9-4) README and DOCKERHUB.md updates covering: per-instance proxy behavior and model validation rules, MariaDB/MySQL setup snippet with CREATE DATABASE/CREATE USER/GRANT commands, and minor docker-compose correction

0.8.9

07 Apr 15:34
0236b5e

Choose a tag to compare

Model Favorites & Notes (#35)

  • Star/favorite models in the sidebar model library - click the star icon to mark favorites, which sort alphabetically at the top of the list
  • Favorite toggle in settings - a star button appears in the Launch Instance tab bar (far right) for quick access
  • Model notes - a new "Note" text field in the Launch Instance form lets you add a note to any model, saved automatically on blur
  • Favorites and notes are stored as part of model presets and persist across sessions
  • Added PATCH /api/presets/<path> endpoint for lightweight partial preset updates (favorite/note only, no full preset required)

Proxy Wake-on-Request by Model Name (#36)

  • Fixed: when sending an OpenAI API request directly to a sleeping instance's port (e.g. POST http://localhost:8000/v1/chat/completions), the idle proxy now inspects the model field in the request body and wakes the sleeping instance if the model matches
  • If the requested model doesn't match the sleeping instance, the proxy returns a clear 404 error instead of a generic failure
  • If the original instance record is gone but a sleeping instance with a matching model exists on that port, the proxy finds and wakes it
  • Non-inference requests (health checks, etc.) continue to wake the instance unconditionally
  • The main llamaman proxy on port 42069 is unaffected - all changes are scoped to the per-instance idle proxy (ports 8000-8020)

0.8.7

01 Apr 15:40

Choose a tag to compare

  • add embeddings endpoint (#32)
  • add standard API embeddings guard (#32)
  • auto add --embeddings server option on UI embedding model option toggle

0.8.6

31 Mar 18:49
c7b2efb

Choose a tag to compare

Model Downloads

  • Added model list download for redeployment backups
  • Integrated repo source download into settings store
  • Failed downloads now auto-retry

Proxy / Parameter Overrides

  • Added proxy support for temperature, top-p, and top-k
  • Added upper bound enforcement and presence_penalty to parameter overrides

UI / Visual Improvements

  • Refactored resource bars: consolidated inline CSS into proper classes, adjusted geometry, styling, and progress bar colors - Status bar layout and color tone adjustments
  • GPU polling frequency tuned
  • System info card visibility state properly initialized
  • General QoL visual polish

0.8.3

30 Mar 14:10
2e305ec

Choose a tag to compare

What's Changed

  • Re-implement eviction policy #25
  • Impl download retry #23
  • Impl app settings tab + eviction policy toggles #21 #24 #25
  • fix ram usage detection fallback #22
  • impl stale record cleanup worker #24

Full Changelog: 0.7.8...0.8.3

0.7.8

29 Mar 17:48
fdc55b8

Choose a tag to compare

What's Changed

  • Implement model loading, download control, and token management by @nullata in #17

New Contributors

Full Changelog: 0.7.3...0.7.8

0.7.3

28 Mar 19:05

Choose a tag to compare

  • Release gate slot when per-instance proxy returns 502 (resp=None path never called gate.release(), leaking slots permanently) (#10)
  • Add requests.Timeout to retry exception lists (ReadTimeout doesn't inherit from ConnectionError, so timeouts were never retried) - Add REQUEST_TIMEOUT env var (default 300s) replacing all hardcoded timeout=300 values (#11)

0.7.2

26 Mar 19:37

Choose a tag to compare

Bug Fixes

  • Preset idle_timeout_min ignored on API-triggered launch - When a model was auto-launched via the Ollama or OpenAI-compatible API endpoints, the idle_timeout_min preset setting was not passed to launch_instance. It always defaulted to 0 (disabled), so no idle proxy was created and the background poller never enforced the timeout. The model would run indefinitely regardless of preset configuration. (#9)

  • Initial request lost during model loading - When a request triggered a model load (cold start), _ensure_model_running held _llamaman_lock for the entire model load duration (up to 300s) while polling health through the public proxy port. This caused the original request to time out before the prompt was ever forwarded to the model. Fixed by splitting the flow: _ensure_model_running now returns as soon as the model is launched, and each request handler waits for readiness independently on the internal port before forwarding. (#8)

  • Proxy did not handle "starting" status - The per-instance idle proxy (ports 8000-8020) only checked for "sleeping" and "stopped" statuses. If a request arrived at a proxy port while the model was still loading ("starting"), it was forwarded immediately to the unready llama-server and failed. The proxy now waits for the model to become healthy before forwarding. (#8)

  • Connection errors on forward after model load - Added retry logic (3 attempts, 2s interval) for ConnectionError/ConnectionRefusedError on all request forwarding paths: Ollama streaming, Ollama non-streaming, OpenAI passthrough, and idle proxy forwarding. This handles transient failures in the brief window between health-check success and the server being fully ready for inference. (#8)

  • Preset settings ignored on model relaunch - When a sleeping or stopped model was relaunched (woken by a request or restarted), relaunch_inactive_instance used the config stored from the original launch. Any preset changes made while the model was inactive (e.g. adjusted GPU layers, context size, idle timeout) were ignored. The function now reloads current presets from storage before rebuilding the launch command. (#9)

Improvements

  • _llamaman_lock released faster - The global model launch/eviction lock is no longer held during model health polling. This unblocks concurrent launches of different models that were previously serialized behind a single long model load.

  • Health checks use internal port directly - Request handlers now poll the llama-server health endpoint on the internal port rather than routing through the idle proxy, avoiding unnecessary proxy thread blocking during model startup.

0.7.1

26 Mar 17:49

Choose a tag to compare

Bug Fixes

  • Fixed requests timing out when an instance is still loading a model. Health checks now use a dedicated MODEL_LOAD_TIMEOUT (default 300s) instead of the short HEALTH_CHECK_TIMEOUT, so large models have time to load without dropping requests. (#8)

New Environment Variable

  • MODEL_LOAD_TIMEOUT - seconds to wait for a model to become healthy during launch/relaunch (default: 300)