Releases · nullata/llamaman

12 Apr 22:24

nullata

0.9.6

1a494b4

0.9.6 Latest

Latest

Docker-in-Docker architecture #37

LlamaMan no longer bundles or calls llama.cpp directly. Instead it spawns each model server as a sibling Docker container using the official ghcr.io/ggml-org/llama.cpp:server-* images via the Docker socket. This is the foundational change that everything else in this release builds on.

LlamaMan is now a lightweight Python-only container with no GPU dependency of its own
llama-server containers are created, started, stopped, and removed through the Docker SDK
GPU passthrough, port binding, volume mounts, CPU quota, and memory limits are applied per-container at launch time
Models volume is passed to sub-containers using a MODELS_HOST_DIR env var that resolves the actual host-side path for the bind mount
Backing containers are always cleaned up: stop_container now catches errors from stop() and calls remove() regardless, so already-exited containers don't leave orphaned records

Universal GPU support - single image for all vendors

Single Dockerfile - Dockerfile.cuda and Dockerfile.rocm are replaced by one Dockerfile. One image tag covers NVIDIA, AMD (ROCm), Intel Arc, and CPU-only
Auto-detection at startup - LlamaMan probes the host: pynvml for NVIDIA, /sys/class/drm sysfs for AMD and Intel Arc. Detected vendor is logged at startup
LLAMA_IMAGE auto-default - if the env var is not set, the image is selected from the detected vendor (server-cuda / server-rocm / server-sycl / server)
GPU_TYPE override - set to cuda, rocm, or intel to skip auto-detection
Intel Arc support - new intel branch in _run_container: mounts /dev/dri, adds video/render groups, uses server-sycl image by default. Per-instance GPU device selection is not supported on Intel Arc
Single docker-compose.yml - the separate ROCm profile service is removed; /sys/class/drm:ro mount is included by default; NVIDIA toolkit utility capability block is present as a commented-out section

Native GPU monitoring

VRAM and utilization are now queried inside the llamaman container directly - no running llama-server instance required
NVIDIA: uses pynvml. Requires uncommenting the deploy.resources.reservations block in docker-compose.yml to grant toolkit utility capability
AMD / Intel Arc: reads mem_info_vram_used, mem_info_vram_total, gpu_busy_percent, and product_name from /sys/class/drm sysfs (the :ro mount in the compose file)
Falls back to the previous exec-based nvidia-smi / rocm-smi approach when native access is not configured and a container is running
The GPU panel no longer returns an error when no llama-server containers are running

Container resource monitoring

Each running instance card shows live stats updated every 3 seconds: CPU%, core quota, RAM used / limit, and GPU assignment
CPU quota is read from the instance's configured threads value (the Docker nano_cpus setting), not from online_cpus which always reflects the host CPU count
GPU assignment is resolved from the instance config against the detected GPU list - no container inspection needed
Stats are fetched in parallel via a ThreadPoolExecutor to avoid blocking the UI on slow Docker API calls

Per-container resource limits

CPU Threads now applies both --threads N to llama-server and a Docker CPU quota (nano_cpus) to the container, capping the cores it can use. Leave blank for no limit
Memory Limit - new field in the launch form (e.g. 32g, 8192m). Sets mem_limit on the spawned container. Saved in presets. Leave blank for no limit

Docker image management

Pull image by name - new text input in the Docker Images tab lets you pull any image by name directly (e.g. ghcr.io/ggml-org/llama.cpp:server-cuda) without it needing to be in the tracked list first
Delete local image - each image in the list now has a delete button that removes it from Docker and from the tracked list. Disabled for the active LLAMA_IMAGE. Returns an error if Docker refuses (e.g. image in use by a running container)

Model backup and restore #39

Download Stored Models JSON - exports all scanned models with their preset configs to a timestamped JSON file
Restore from JSON - upload a previously exported backup. For each model in the file:
- Already present on disk: preset is merged in (existing values are not overwritten)
- Not present but has a HuggingFace source: download is queued immediately and preset is pre-populated at the expected post-download path so it is ready when the file lands
- Not present and no known source: reported as unrestorable
Results are shown inline with per-model status badges (present / queued / missing / error)

Repeat penalty in proxy sampling overrides

New Repeat Penalty field in the per-instance proxy sampling overrides section
Default 0 (disabled - not injected into requests). Range 0–2.0
Only injected into proxied requests when set above 0, so leaving it at the default has no effect on clients that set their own value

Assets 2

08 Apr 14:24

nullata

0.8.9-4

1ee7f9a

0.8.9-4

Display source repository info on model cards (#40)
Added UI support for showing the HuggingFace repo_id a model was downloaded from. CSS, JS, and template changes only - no backend changes
Fix per-instance proxy blocking on request body read
_extract_model_from_request was calling wsgi.input.read() with no argument, which reads the raw socket until EOF (blocking until the client disconnects). Fixed by reading exactly CONTENT_LENGTH bytes, so the proxy no longer hangs waiting for the connection to close before forwarding
Model name validation for healthy instances on per-instance proxy ports
Previously, model name validation (returning 404 for a mismatched "model" field) only ran for sleeping/stopped instances. Extended the check to run after all wake/wait logic so healthy and starting instances are validated consistently - sending the wrong model name to a port always returns 404 regardless of instance state
Docs and version bump (0.8.9-4) README and DOCKERHUB.md updates covering: per-instance proxy behavior and model validation rules, MariaDB/MySQL setup snippet with CREATE DATABASE/CREATE USER/GRANT commands, and minor docker-compose correction

Assets 2

07 Apr 15:34

nullata

0.8.9

0236b5e

0.8.9

Model Favorites & Notes (#35)

Star/favorite models in the sidebar model library - click the star icon to mark favorites, which sort alphabetically at the top of the list
Favorite toggle in settings - a star button appears in the Launch Instance tab bar (far right) for quick access
Model notes - a new "Note" text field in the Launch Instance form lets you add a note to any model, saved automatically on blur
Favorites and notes are stored as part of model presets and persist across sessions
Added PATCH /api/presets/<path> endpoint for lightweight partial preset updates (favorite/note only, no full preset required)

Proxy Wake-on-Request by Model Name (#36)

Fixed: when sending an OpenAI API request directly to a sleeping instance's port (e.g. POST http://localhost:8000/v1/chat/completions), the idle proxy now inspects the model field in the request body and wakes the sleeping instance if the model matches
If the requested model doesn't match the sleeping instance, the proxy returns a clear 404 error instead of a generic failure
If the original instance record is gone but a sleeping instance with a matching model exists on that port, the proxy finds and wakes it
Non-inference requests (health checks, etc.) continue to wake the instance unconditionally
The main llamaman proxy on port 42069 is unaffected - all changes are scoped to the per-instance idle proxy (ports 8000-8020)

Assets 2

01 Apr 15:40

nullata

0.8.7

a04d03e

0.8.7

add embeddings endpoint (#32)
add standard API embeddings guard (#32)
auto add --embeddings server option on UI embedding model option toggle

Assets 2

31 Mar 18:49

nullata

0.8.6

c7b2efb

0.8.6

Model Downloads

Added model list download for redeployment backups
Integrated repo source download into settings store
Failed downloads now auto-retry

Proxy / Parameter Overrides

Added proxy support for temperature, top-p, and top-k
Added upper bound enforcement and presence_penalty to parameter overrides

UI / Visual Improvements

Refactored resource bars: consolidated inline CSS into proper classes, adjusted geometry, styling, and progress bar colors - Status bar layout and color tone adjustments
GPU polling frequency tuned
System info card visibility state properly initialized
General QoL visual polish

Assets 2

30 Mar 14:10

nullata

0.8.3

2e305ec

0.8.3

What's Changed

Re-implement eviction policy #25
Impl download retry #23
Impl app settings tab + eviction policy toggles #21 #24 #25
fix ram usage detection fallback #22
impl stale record cleanup worker #24

Full Changelog: 0.7.8...0.8.3

Assets 2

29 Mar 17:48

nullata

0.7.8

fdc55b8

0.7.8

What's Changed

Implement model loading, download control, and token management by @nullata in #17

New Contributors

@nullata made their first contribution in #17

Full Changelog: 0.7.3...0.7.8

Contributors

nullata

Assets 2

28 Mar 19:05

nullata

0.7.3

4b2e5e7

0.7.3

Release gate slot when per-instance proxy returns 502 (resp=None path never called gate.release(), leaking slots permanently) (#10)
Add requests.Timeout to retry exception lists (ReadTimeout doesn't inherit from ConnectionError, so timeouts were never retried) - Add REQUEST_TIMEOUT env var (default 300s) replacing all hardcoded timeout=300 values (#11)

Assets 2

26 Mar 19:37

nullata

0.7.2

5ff4869

0.7.2

Bug Fixes

Preset idle_timeout_min ignored on API-triggered launch - When a model was auto-launched via the Ollama or OpenAI-compatible API endpoints, the idle_timeout_min preset setting was not passed to launch_instance. It always defaulted to 0 (disabled), so no idle proxy was created and the background poller never enforced the timeout. The model would run indefinitely regardless of preset configuration. (#9)
Initial request lost during model loading - When a request triggered a model load (cold start), _ensure_model_running held _llamaman_lock for the entire model load duration (up to 300s) while polling health through the public proxy port. This caused the original request to time out before the prompt was ever forwarded to the model. Fixed by splitting the flow: _ensure_model_running now returns as soon as the model is launched, and each request handler waits for readiness independently on the internal port before forwarding. (#8)
Proxy did not handle "starting" status - The per-instance idle proxy (ports 8000-8020) only checked for "sleeping" and "stopped" statuses. If a request arrived at a proxy port while the model was still loading ("starting"), it was forwarded immediately to the unready llama-server and failed. The proxy now waits for the model to become healthy before forwarding. (#8)
Connection errors on forward after model load - Added retry logic (3 attempts, 2s interval) for ConnectionError/ConnectionRefusedError on all request forwarding paths: Ollama streaming, Ollama non-streaming, OpenAI passthrough, and idle proxy forwarding. This handles transient failures in the brief window between health-check success and the server being fully ready for inference. (#8)
Preset settings ignored on model relaunch - When a sleeping or stopped model was relaunched (woken by a request or restarted), relaunch_inactive_instance used the config stored from the original launch. Any preset changes made while the model was inactive (e.g. adjusted GPU layers, context size, idle timeout) were ignored. The function now reloads current presets from storage before rebuilding the launch command. (#9)

Improvements

_llamaman_lock released faster - The global model launch/eviction lock is no longer held during model health polling. This unblocks concurrent launches of different models that were previously serialized behind a single long model load.
Health checks use internal port directly - Request handlers now poll the llama-server health endpoint on the internal port rather than routing through the idle proxy, avoiding unnecessary proxy thread blocking during model startup.

Assets 2

26 Mar 17:49

nullata

0.7.1

c1fa71a

0.7.1

Bug Fixes

Fixed requests timing out when an instance is still loading a model. Health checks now use a dedicated MODEL_LOAD_TIMEOUT (default 300s) instead of the short HEALTH_CHECK_TIMEOUT, so large models have time to load without dropping requests. (#8)

New Environment Variable

MODEL_LOAD_TIMEOUT - seconds to wait for a model to become healthy during launch/relaunch (default: 300)

Assets 2

Releases: nullata/llamaman

0.9.6

Docker-in-Docker architecture #37

Universal GPU support - single image for all vendors

Native GPU monitoring

Container resource monitoring

Per-container resource limits

Docker image management

Model backup and restore #39

Repeat penalty in proxy sampling overrides

Uh oh!

0.8.9-4

Uh oh!

0.8.9

Model Favorites & Notes (#35)

Proxy Wake-on-Request by Model Name (#36)

Uh oh!

0.8.7

Uh oh!

0.8.6

Uh oh!

0.8.3

What's Changed

Uh oh!

0.7.8

What's Changed

New Contributors

Contributors

Uh oh!

0.7.3

Uh oh!

0.7.2

Bug Fixes

Improvements

Uh oh!

0.7.1

Uh oh!