Skip to content

drussell23/JARVIS-Prime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

258 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

JARVIS Prime

The Mind of the AGI OS โ€” LLM inference, Neural Orchestrator Core, and cross-repo coordination

๐Ÿš€ v100.0 Neural Orchestrator Core | ๐Ÿง  Unified Intelligent Routing | โšก Zero Hardcoding | ๐Ÿ”ฅ Async by Default | ๐Ÿ›ก๏ธ Safety-Aware | ๐Ÿ”„ Zero-Downtime Hot Swap | ๐Ÿ’ช Production-Grade Resilience | ๐ŸŒ Cross-Repo Integration | ๐Ÿ“Š v221.0 Model Loading Progress Preservation | ๐ŸŽฏ v236.0 Adaptive Prompt System | ๐Ÿ›ก๏ธ v238.0 Degenerate Response Defense-in-Depth | ๐Ÿค– v241.1 Multi-Model Task-Type Routing (11 Models) | ๐Ÿ“ก v243.0 Command Lifecycle Events | ๐Ÿงน v244.0 Brain Vacuum Classification Fix

JARVIS Prime is the cognitive layer of the JARVIS AGI ecosystem. It runs 11 self-hosted specialist LLMs (~40.4 GB, Q4_K_M quantized) on a dedicated GCP Invincible Node โ€” not OpenAI, not Claude, not any third-party API. All inference happens on your own infrastructure with zero per-token costs and complete data privacy. As of v241.1, J-Prime intelligently routes queries to the optimal model based on task type: math queries go to Qwen2.5-Math-7B (83.6% MATH benchmark), code queries go to Qwen2.5-Coder-7B (70.4% HumanEval), reasoning queries go to DeepSeek-R1 (explicit chain-of-thought), and simple queries go to Phi-3.5-mini (~3s latency). Prime also provides the Neural Orchestrator Core (unified routing), AGI models, reasoning engines, and first-class integration with JARVIS (Body) and Reactor-Core (Nerves). It is started either standalone or by the unified supervisor in JARVIS; during startup, model loading progress is preserved across Early Prime โ†’ Trinity handoff (v221.0).


Session Update (2026-03-18): Unlock-Domain Safeguards and Fast-Path Classification

This session hardened J-Prime against a recurring cross-repo failure mode: biometric unlock utterances being misclassified as workspace/general tasks. Prime now applies explicit unlock-domain safeguards before standard LLM classification.

1) Classification Schema Hardening

jarvis_prime/core/classification_schema.py now includes explicit unlock semantics:

  • Added voice_unlock to the domain enum and DOMAIN_TO_TASK_TYPE.
  • Added seven example mappings covering common unlock phrasing variants.
  • Added a CRITICAL guardrail note that unlock utterances must not be routed as workspace tasks.

This gives both rule-based and model-driven paths a stable canonical domain for unlock requests.

2) J-Prime Spinal Reflex (v284.0)

run_server.py now applies a lightweight Python unlock-pattern guard before invoking Phi classification:

  • If unlock intent is detected, the request short-circuits directly to voice_unlock domain.
  • Bypass path executes in sub-millisecond to low-millisecond range and avoids unnecessary LLM classification.
  • Result: unlock routing remains deterministic even when prompt/classifier behavior drifts.

3) Enriched Query Hints from Body

backend/core/jarvis_prime_client.py now forwards unlock intent hints into Prime request context:

  • domain_hint
  • not_workspace

These keys are rendered into the enriched query payload so Prime-side classifiers and guards receive explicit anti-misrouting intent signals.

4) Why This Matters in Trinity

Unlock correctness now has protection at both sides of the Bodyโ†”Mind boundary:

  • Body-side: reflex + pre-flight guards + score biasing toward unlock.
  • Prime-side: schema-level unlock domain + pre-classification spinal reflex.

Combined, this significantly reduces the probability that biometric commands are treated as generic workspace operations.

5) Validation

Cross-repo nuance routing tests for unlock phrasing and paraphrases completed with 50/50 pass rate in this session.


๐ŸŽฏ What is JARVIS Prime?

JARVIS Prime is the Mind in the three-repo Trinity architecture:

Role Repository Responsibility
Body JARVIS (JARVIS-AI-Agent) macOS integration, computer use, unified supervisor, voice/vision
Mind JARVIS-Prime (this repo) LLM inference, reasoning, Neural Orchestrator Core, OpenAI-compatible API
Nerves Reactor-Core Model training, fine-tuning, experience collection, model deployment

Neural Orchestrator Core v100.0 is the single source of truth for routing (Tier 0/0.5/1/2, memory pressure, sticky routing, circuit breakers). Prime exposes health and model loading progress (model_load_progress_pct, startup_progress, etc.) so the JARVIS unified supervisor can show accurate progress and avoid regression during handoff (v221.0).

The Revolution: Neural Orchestrator Core v100.0

The Neural Orchestrator Core consolidates all routing systems (HybridTieredRouter, IntelligentModelRouter, CognitiveRouter, GraphRouter, Neural Switchboard) into a single, enterprise-grade unified routing architecture:

# Simple action โ†’ Tier 0 (Ultra Fast, Local)
"Turn on the lights" โ†’ Local execution (50ms, $0.00)

# Complex task โ†’ Tier 1 (Cloud Intelligence)
"Plan a comprehensive refactoring of the authentication system"
โ†’ GCP Cloud with advanced reasoning ($0.15)

# Deep reasoning โ†’ Tier 2 (Deep Reasoning Models)
"Analyze the causal relationships in this distributed system"
โ†’ Claude Opus 4 with deep reasoning ($0.50)

# Session continuity โ†’ Sticky Routing
"Continue the previous coding session" โ†’ Same model as before

Key Innovation: The Neural Orchestrator Core provides:

  • Unified Routing: Single source of truth for all routing decisions
  • Zero Hardcoding: All configuration via environment variables and YAML
  • Advanced Patterns: Protocol classes, contextvars, async generators, weakref, defensive decorators
  • Cross-Repo Integration: Seamless state sharing across JARVIS, JARVIS Prime, and Reactor Core
  • Memory-Aware Routing: Real-time memory pressure monitoring with macOS native integration
  • Sticky Routing: Session-based model affinity for continuity
  • Request Buffering: Zero-loss hot swap support
  • Circuit Breakers: Coordinated fault tolerance across all tiers

๐Ÿง  Self-Hosted Multi-Model LLM Fleet โ€” Zero Third-Party API Dependencies

The Core Principle: Your Models, Your Infrastructure, Your Data

JARVIS Prime runs 11 self-hosted specialist language models. It does not use OpenAI, Claude, GPT-4, Gemini, or any third-party inference API for primary intelligence. When you ask JARVIS "solve 5x+3=18" โ€” the response is generated by a math-specialist model (Qwen2.5-Math-7B) running on your own infrastructure. Ask "write a Python sort function" โ€” a code-specialist model (Qwen2.5-Coder-7B) handles it. Every query is routed to the optimal model for that task type:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  JARVIS PRIME INFERENCE STACK (v241.1)                    โ”‚
โ”‚                  โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•                    โ”‚
โ”‚                                                                          โ”‚
โ”‚  Models:   11 specialist LLMs (~40.4 GB total, Q4_K_M GGUF)            โ”‚
โ”‚  Routable: 8 models active in task-type routing                          โ”‚
โ”‚  Engine:   llama-cpp-python (C++ backend with Python bindings)          โ”‚
โ”‚  API:      OpenAI-compatible (/v1/chat/completions)                     โ”‚
โ”‚  Host:     GCP Invincible Node (34.45.154.209:8000)                     โ”‚
โ”‚  Router:   GCPModelSwapCoordinator (pre-hook model selection)           โ”‚
โ”‚  Latency:  ~3s (simple) to ~8.6s (complex) per request, CPU-only       โ”‚
โ”‚                                                                          โ”‚
โ”‚  โœ… Self-hosted          โœ… No per-token costs                           โ”‚
โ”‚  โœ… Full data privacy    โœ… No rate limits                               โ”‚
โ”‚  โœ… Pre-loaded from      โœ… No vendor lock-in                            โ”‚
โ”‚     golden image         โœ… Fine-tunable by Reactor-Core                 โ”‚
โ”‚  โœ… Task-aware routing   โœ… Automatic model selection                    โ”‚
โ”‚                                                                          โ”‚
โ”‚  โŒ NOT OpenAI           โŒ NOT Claude                                   โ”‚
โ”‚  โŒ NOT GPT-4            โŒ NOT Gemini                                   โ”‚
โ”‚  โŒ NOT any third-party API                                              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The Model Fleet: 11 Specialist Models (v241.1)

JARVIS Prime hosts 11 GGUF-quantized models on an 80 GB SSD, with 8 routable through the GCP Model Swap Coordinator. Only one model is loaded in RAM at a time (~3-6.5 GB depending on model size), with intelligent sticky routing to prevent thrashing. All models use Q4_K_M quantization (4-bit, k-quant mixed precision) for the best quality-to-size ratio on CPU inference.

Routable Models (8) โ€” Task-Type Specialists

# Model Parameters Disk Role Strengths Weaknesses Routed From
1 Phi-3.5-mini-instruct 3.8B 2.2 GB Fast lightweight ~3s latency, great for simple factual Q&A, definitions, yes/no answers. Microsoft's best small model. MIT license. Small context (4K), limited depth on complex topics, weaker reasoning than 7B models greeting, simple_chat, quick_question, voice_command
2 Mistral-7B-Instruct-v0.2 7.24B 4.4 GB Translation Strong multilingual support, good instruction following, well-tested with llama.cpp. Apache 2.0. The original J-Prime model. Hallucinates multi-step math, weaker than Gemma-2 on general knowledge benchmarks, no code specialization translate
3 Qwen2.5-7B-Instruct 7B 4.4 GB Basic math & reasoning Good at algebra, arithmetic, logic puzzles. 128K context capable. Strong Chinese + English. Apache 2.0. Struggles with competition-level math, proofs, and multi-step mathematical reasoning beyond basic algebra math_simple, reason_simple
4 Qwen2.5-Math-7B-Instruct 7B 4.4 GB Math specialist 83.6% on MATH benchmark (vs GPT-4 ~76%). Purpose-built for mathematical reasoning with chain-of-thought. Best 7B math model. Narrow focus โ€” significantly weaker on non-math tasks (conversation, writing, code). Not suitable as a general model. math_complex
5 DeepSeek-R1-Distill-Qwen-7B 7B 4.4 GB Chain-of-thought reasoning 55.5% on AIME 2024. Explicit step-by-step reasoning traces (<think>...</think> tokens). Strong analytical and logical reasoning. Produces verbose reasoning tokens (slower effective generation), can over-explain simple queries. Response length is unpredictable. reason_complex, analyze
6 Qwen2.5-Coder-7B-Instruct 7B 4.4 GB Code specialist 70.4% HumanEval (beats CodeLlama-34B despite being 5x smaller). Trained on 5.5 trillion code tokens. Multi-language support. Apache 2.0. Narrow code focus โ€” weaker on general conversation, creative writing, and non-technical tasks code_simple, code_complex, code_review, code_explain, code_architecture, code_debug
7 Llama-3.1-8B-Instruct 8B 4.9 GB Long context & creative 128K context window (longest of all models). Strong narrative writing, creative brainstorming, and document summarization. Meta's best open 8B. Not a specialist โ€” slightly weaker on code than Qwen-Coder, weaker on math than Qwen-Math. Larger disk footprint. creative_write, creative_brainstorm, summarize
8 Gemma-2-9B-Instruct 9B 5.5 GB General intelligence (default) Best sub-10B generalist: MMLU 72.3%, HellaSwag 81.9%, ARC-C 68.4%. Excellent at conversational Q&A, analysis, and general knowledge. Google DeepMind. Largest routable model (5.5 GB), slightly slower load time, not a code/math specialist general_chat, unknown

Pre-Staged Models (3) โ€” Downloaded, Not Yet Routable

Model Disk Status Why Pre-Staged
LLaVA-v1.6-Mistral-7B 4.9 GB v242 roadmap Needs CLIP vision encoder + multimodal inference pipeline. Language model portion is compatible with llama.cpp, but image understanding requires a separate vision encoder that J-Prime doesn't yet support.
TinyLlama-1.1B-Chat 0.67 GB Speculative decoding Draft model for llama.cpp's speculative decoding โ€” generates tokens fast (~30+ t/s CPU), validated by the primary model in batch. Can provide 2-3x speedup. Not useful on its own.
BGE-large-en-v1.5 0.17 GB RAG embedding Embedding model for retrieval-augmented generation. Encodes documents into vectors for semantic search. Requires a vector database pipeline (not built yet). No generate() path.

Why Q4_K_M for All Models?

All 11 models use Q4_K_M quantization, which offers the best balance of quality and size for CPU inference:

  • Q4_K_M preserves more important weight dimensions at higher precision than Q4_0 or Q4_K_S
  • 4-7 GB per model fits within the 32 GB VM's RAM budget with room for OS and server overhead
  • Negligible quality loss vs. FP16 on instruction-following benchmarks
  • Optimized for llama.cpp's SIMD-accelerated inference kernels (AVX2/SSE4.2)

What Changed from Single-Model (pre-v241) to Multi-Model

Aspect Before (v238 and earlier) After (v241.1)
Models on disk 1 (Mistral-7B, ~4.4 GB) 11 models (~40.4 GB)
Routable models 1 8 specialists
Math query Mistral-7B hallucinates (5x+3=18 โ†’ x=11) Qwen2.5-Math-7B solves correctly (x=3)
Code query Mistral-7B (not code-trained) Qwen2.5-Coder-7B (70.4% HumanEval)
Simple query Mistral-7B (~8.6s) Phi-3.5-mini (~3s)
Reasoning query Mistral-7B (no CoT) DeepSeek-R1 (explicit chain-of-thought)
General query Mistral-7B Gemma-2-9B (MMLU 72.3%)
Model selection None โ€” everything goes to one model Task-type inference in JARVIS Body + GCPModelSwapCoordinator
Disk requirement 50 GB 80 GB
VM RAM 16 GB (e2-standard-4) 32 GB (e2-highmem-4)

GCP Invincible Node: The Multi-Model Inference Server

The model fleet runs on a GCP Invincible Node โ€” a persistent Compute Engine VM that resists automated shutdown:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                  GCP INVINCIBLE NODE (v241.1)                             โ”‚
โ”‚                  โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•                              โ”‚
โ”‚                                                                          โ”‚
โ”‚  Instance:       jarvis-prime-node                                      โ”‚
โ”‚  External IP:    34.45.154.209                                          โ”‚
โ”‚  Port:           8000                                                    โ”‚
โ”‚  Machine Type:   e2-highmem-4 (4 vCPUs, 32 GB RAM)                     โ”‚
โ”‚  Region:         us-central1-a                                          โ”‚
โ”‚  OS:             Debian (GCP golden image)                               โ”‚
โ”‚  Disk:           80 GB persistent SSD (~40.4 GB models + OS/deps)       โ”‚
โ”‚                                                                          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
โ”‚  โ”‚  JARVIS Prime Server (run_server.py)                           โ”‚     โ”‚
โ”‚  โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                            โ”‚     โ”‚
โ”‚  โ”‚  โ€ข FastAPI + Uvicorn (port 8000)                               โ”‚     โ”‚
โ”‚  โ”‚  โ€ข llama-cpp-python inference engine                           โ”‚     โ”‚
โ”‚  โ”‚  โ€ข OpenAI-compatible API (/v1/chat/completions)                โ”‚     โ”‚
โ”‚  โ”‚  โ€ข Health endpoint (/health) with model_load_progress          โ”‚     โ”‚
โ”‚  โ”‚  โ€ข GCPModelSwapCoordinator (task-type โ†’ model routing)        โ”‚     โ”‚
โ”‚  โ”‚  โ€ข 11 models on disk, 1 loaded in RAM at a time               โ”‚     โ”‚
โ”‚  โ”‚  โ€ข X-Model-Id header in every response (telemetry)            โ”‚     โ”‚
โ”‚  โ”‚  โ€ข Pre-loaded from golden image disk (no download on boot)     โ”‚     โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
โ”‚                                                                          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”‚
โ”‚  โ”‚  InvincibleGuard (Active)                                      โ”‚     โ”‚
โ”‚  โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                    โ”‚     โ”‚
โ”‚  โ”‚  โ€ข Blocks automated termination from supervisor cleanup        โ”‚     โ”‚
โ”‚  โ”‚  โ€ข 4 blocked termination attempts (as of v235.4)               โ”‚     โ”‚
โ”‚  โ”‚  โ€ข Ensures model stays loaded across session boundaries        โ”‚     โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

InvincibleGuard is a critical component โ€” it prevents the supervisor's automated lifecycle management from shutting down the VM while it's healthy and serving inference. This means once a model is loaded, it stays loaded across multiple JARVIS sessions without needing to re-download or re-load.

Golden Image: Pre-Baked Multi-Model for Instant Boot

No models are downloaded at boot time. All 11 models are pre-baked into a GCP golden image โ€” a snapshot of the VM disk with everything pre-installed:

Golden Image Contents (v241.1):
โ”œโ”€โ”€ /opt/jarvis-prime/                        # JARVIS Prime codebase
โ”‚   โ”œโ”€โ”€ run_server.py                          # Server entry point
โ”‚   โ”œโ”€โ”€ jarvis_prime/                          # Core Python package
โ”‚   โ”‚   โ”œโ”€โ”€ server.py                          # FastAPI application
โ”‚   โ”‚   โ””โ”€โ”€ core/                              # Neural Orchestrator, routing,
โ”‚   โ”‚       โ”‚                                  # GCPModelSwapCoordinator, etc.
โ”‚   โ”‚       โ”œโ”€โ”€ gcp_model_swap_coordinator.py  # v241.0: task-type โ†’ model routing
โ”‚   โ”‚       โ”œโ”€โ”€ dynamic_model_registry.py      # Model specs, GCP_TASK_MODEL_MAPPING
โ”‚   โ”‚       โ””โ”€โ”€ llama_cpp_executor.py          # llama-cpp-python wrapper
โ”‚   โ””โ”€โ”€ models/                                # Model directory (~40.4 GB)
โ”‚       โ”œโ”€โ”€ manifest.json                      # Model inventory (primary source of truth)
โ”‚       โ”œโ”€โ”€ mistral-7b-instruct-v0.2.Q4_K_M.gguf         (4.4 GB)  โ€” translation
โ”‚       โ”œโ”€โ”€ qwen2.5-7b-instruct-q4_k_m.gguf              (4.4 GB)  โ€” basic math/reasoning
โ”‚       โ”œโ”€โ”€ qwen2.5-math-7b-instruct-q4_k_m.gguf         (4.4 GB)  โ€” math specialist
โ”‚       โ”œโ”€โ”€ deepseek-r1-distill-qwen-7b-q4_k_m.gguf      (4.4 GB)  โ€” CoT reasoning
โ”‚       โ”œโ”€โ”€ qwen2.5-coder-7b-instruct-q4_k_m.gguf        (4.4 GB)  โ€” code specialist
โ”‚       โ”œโ”€โ”€ Phi-3.5-mini-instruct-Q4_K_M.gguf            (2.2 GB)  โ€” fast lightweight
โ”‚       โ”œโ”€โ”€ Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf       (4.9 GB)  โ€” long context
โ”‚       โ”œโ”€โ”€ gemma-2-9b-it-Q4_K_M.gguf                    (5.5 GB)  โ€” general default
โ”‚       โ”œโ”€โ”€ llava-v1.6-mistral-7b.Q4_K_M.gguf            (4.9 GB)  โ€” vision (pre-staged)
โ”‚       โ”œโ”€โ”€ tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf         (0.67 GB) โ€” spec decoding draft
โ”‚       โ””โ”€โ”€ bge-large-en-v1.5-q4_k_m.gguf                (0.17 GB) โ€” RAG embedding
โ”œโ”€โ”€ Python 3.11 + all dependencies (pre-installed)
โ”œโ”€โ”€ llama-cpp-python (compiled with CPU optimizations)
โ””โ”€โ”€ Startup script (auto-launches server on boot)

Boot sequence:

  1. GCP creates VM from golden image (~26 seconds)
  2. VM boots, startup script launches run_server.py (~30 seconds)
  3. Server loads default model (Mistral-7B) from local disk (no network download)
  4. GCPModelSwapCoordinator initializes, reads manifest.json, registers all 11 models
  5. Health endpoint reports ready_for_inference=True
  6. Total cold start: ~87 seconds (from NOT_FOUND to serving inference)

Without the golden image, the VM would need to download ~40.4 GB from HuggingFace on every cold boot, adding 30-60+ minutes. The golden image eliminates this entirely.

CPU Inference: Variable Latency by Model (v241.1)

The GCP Invincible Node runs on CPU-only hardware (e2-highmem-4, no GPU). With multi-model routing, latency varies by task type:

Factor Details
Hardware 4 vCPUs (Intel x86_64), 32 GB RAM, no GPU/TPU
Inference Mode CPU-only via llama.cpp (AVX2/SSE4.2 SIMD acceleration)
Latency (simple) ~3-4 seconds (Phi-3.5-mini, 3.8B โ€” factual Q&A, definitions)
Latency (standard) ~6-9 seconds (7B models โ€” math, code, translation)
Latency (complex) ~8-12 seconds (Gemma-2-9B, DeepSeek-R1 with reasoning traces)
Model swap time ~20-30 seconds (SSD โ†’ RAM load + 5-token validation)
Token Generation ~3-5 tokens/second for 7B, ~6-10 t/s for 3.8B (CPU-bound)
Concurrent Requests 1 at a time (single model instance, sequential processing)

Why ~8.6s is normal and expected for this configuration:

  1. CPU vs GPU arithmetic: GPU inference (e.g., NVIDIA A100) achieves 30-80 tokens/sec on 7B models via massive parallelism across thousands of CUDA cores. CPU inference uses 4-8 threads doing sequential matrix multiplications โ€” it's fundamentally 10-50x slower per token.

  2. Q4_K_M quantization helps but doesn't eliminate the gap: 4-bit quantization reduces memory bandwidth requirements by ~4x compared to FP16, and llama.cpp uses AVX2 SIMD instructions to process 8 values per cycle. But CPU clock speeds (2-3 GHz) and limited core counts (4 vCPUs) still cap throughput at single-digit tokens/second.

  3. Prompt processing (prefill) is the bottleneck: Before generating the first token, the model must process the entire input prompt through all 32 transformer layers. For a 100-token prompt, that's 100 ร— 32 layers ร— 7B parameters worth of matrix operations โ€” all on CPU.

  4. Memory bandwidth is the real limiter: Even with Q4_K_M reducing the model to ~4.37 GB, every token generation requires reading significant portions of the model weights from RAM. DDR4 bandwidth on standard GCP VMs (~25 GB/s) is orders of magnitude lower than GPU HBM bandwidth (~2 TB/s on A100).

Performance comparison by hardware:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Hardware                   โ”‚ Tokens/sec (7B)  โ”‚ Latency/req   โ”‚ Cost/hr    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ GCP e2-standard-4 (CPU)   โ”‚ ~3-5 t/s         โ”‚ ~8.6s         โ”‚ ~$0.13     โ”‚
โ”‚ GCP n1-standard-8 (CPU)   โ”‚ ~6-10 t/s        โ”‚ ~4-5s         โ”‚ ~$0.38     โ”‚
โ”‚ GCP g2-standard-4 (L4)    โ”‚ ~25-35 t/s       โ”‚ ~1-2s         โ”‚ ~$0.70     โ”‚
โ”‚ GCP a2-highgpu-1g (A100)  โ”‚ ~50-80 t/s       โ”‚ ~0.3-0.5s     โ”‚ ~$3.67     โ”‚
โ”‚ Apple M1 Max (Metal GPU)  โ”‚ ~15-25 t/s       โ”‚ ~2-3s         โ”‚ N/A        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The e2-highmem-4 was chosen for cost efficiency with multi-model capability: at $0.134/hr ($97/month), it provides always-on inference across 8 specialist models for a fraction of the cost of GPU instances. With 32 GB RAM, it can comfortably load any single 7B model (~5.5 GB) or the 9B Gemma-2 (~6.5 GB) with ample headroom. For a personal AI assistant where requests are sporadic (not continuous high-throughput), 3-12s latency (depending on model/task) is an acceptable trade-off against 28x lower cost compared to an A100.

Future upgrade path: If latency becomes a bottleneck (e.g., real-time conversation, high concurrency), the architecture supports seamless migration to:

  • g2-standard-4 (NVIDIA L4 GPU): ~$0.70/hr, ~1-2s latency โ€” best price/performance for inference
  • Larger CPU VM: Doubling vCPUs to n1-standard-8 would roughly halve latency to ~4-5s
  • Speculative decoding: Using a smaller draft model (TinyLlama 1.1B) to propose tokens, validated by Mistral-7B โ€” can provide 2-3x speedup without hardware changes

What This Means in Practice

When a user types a message in the JARVIS frontend, the system intelligently routes to the best model:

Example 1: Math query (routed to Qwen2.5-Math-7B)

User: "solve 5x + 3 = 18"
  โ”‚
  โ”‚  Frontend (localhost:3000)
  โ”‚  โ””โ”€โ”€ WebSocket to localhost:8010
  โ”‚
  โ–ผ
  Backend (localhost:8010, macOS)
  โ”œโ”€โ”€ _infer_task_type("solve 5x+3=18", "SIMPLE") โ†’ "math_simple"
  โ””โ”€โ”€ PrimeRouter โ†’ PrimeClient
        โ””โ”€โ”€ HTTP POST http://34.45.154.209:8000/v1/chat/completions
              โ”‚  metadata: {"task_type": "math_simple"}
              โ”‚
              โ–ผ
        GCP Invincible Node
        โ”œโ”€โ”€ GCPModelSwapCoordinator.ensure_model("math_simple")
        โ”‚   โ””โ”€โ”€ Resolves โ†’ qwen-2.5-7b (math specialist)
        โ””โ”€โ”€ Qwen2.5-7B generates correct answer
              โ”‚  Response + X-Model-Id: qwen-2.5-7b
              โ–ผ
        User sees: "x = 3" โœ“ (not x=11 as Mistral-7B hallucinated)

Example 2: Code query (routed to Qwen2.5-Coder-7B)

User: "write a Python function to merge two sorted arrays"
  โ”‚
  โ–ผ
  Backend: _infer_task_type() โ†’ "code_complex" (has_lang=Python + has_strong=function)
  โ””โ”€โ”€ metadata: {"task_type": "code_complex"}
        โ”‚
        โ–ผ
  GCP: coordinator โ†’ qwen-2.5-coder-7b (70.4% HumanEval)
  โ””โ”€โ”€ Generates correct O(n) merge implementation

Example 3: Simple query (routed to Phi-3.5-mini for speed)

User: "what is the capital of France?"
  โ”‚
  โ–ผ
  Backend: _infer_task_type() โ†’ "simple_chat" (SIMPLE complexity, no specialist signals)
  โ””โ”€โ”€ metadata: {"task_type": "simple_chat"}
        โ”‚
        โ–ผ
  GCP: coordinator โ†’ phi-3.5-mini (2.2 GB, ~3s latency)
  โ””โ”€โ”€ "Paris" โ€” 3x faster than waiting for a 7B model

No data leaves your infrastructure. The request travels from the Mac to the GCP VM, is processed entirely by your own models on your own VM, and the response returns to your Mac. No tokens are sent to OpenAI, Anthropic, Google, or any third party.

Emergency Fallback: Claude API (Tier 2 Only)

Claude API is only used as a last-resort emergency fallback (Tier 2) when:

  1. The GCP VM is completely unreachable (network failure, zone outage)
  2. AND the standard GCP VM fallback also fails
  3. AND the request is classified as requiring deep reasoning
Fallback Chain (ordered by priority):
  1. GCP Golden Image VM โ”€โ”€โ†’ 11 specialist models on Invincible Node (primary, ~3-12s)
  2. GCP Standard VM โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Fresh VM with model download (backup, ~30-60 min cold start)
  3. Claude API โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ Anthropic's API (emergency only, costs per token)

Under normal operation, 100% of requests go to the self-hosted model fleet. The Claude fallback exists for disaster recovery only and has never been triggered in production since the v233.2 golden image fixes.

Why Self-Hosted Matters

Benefit Description
Zero per-token cost No API billing. The only cost is the GCP VM compute (~$97/month for e2-highmem-4). Unlimited requests across all 8 specialist models.
Complete data privacy Prompts and responses never leave your infrastructure. No third-party data retention policies apply.
No rate limits No tokens-per-minute caps, no request queuing from provider-side throttling.
No vendor lock-in The model is open-source (Apache 2.0). Switch to Llama-3, Qwen, Phi, or any GGUF model by changing one file.
Fine-tunable Reactor-Core collects experience data from JARVIS interactions and can fine-tune the model for your specific use patterns.
Full control Choose quantization level, context length, temperature, system prompts, and all inference parameters. No provider-imposed guardrails beyond what you configure.
Offline-capable Once the VM is running, inference works with zero internet dependency (the model is on local disk).
Reproducible Same model, same weights, same quantization = deterministic behavior (given same temperature/seed). No provider-side model updates changing behavior unexpectedly.

Adaptive Prompt System: Complexity-Aware Inference (v236.0, v238.0)

The Problem: One Prompt Does Not Fit All

Before v236.0, every request sent to JARVIS Prime โ€” whether "what is 5+5?" or "design a microservice architecture" โ€” received the same static system prompt, the same max_tokens=4096, and the same temperature=0.7. The system prompt included:

"You are JARVIS, an advanced AI assistant... Be concise but thorough"

Mistral-7B-Instruct interpreted "thorough" as a directive to be verbose, and the "advanced AI assistant" identity activated conversational, polite-assistant behavior. The result: asking "what is 5+5?" returned "Of course, the sum of five and five is ten. I'd be happy to help with any other mathematical queries you might have." instead of just 10.

This is a fundamental challenge with 7B-parameter models: they have limited instruction-following capacity. When a system prompt contains conflicting signals โ€” "be an AI assistant" (conversational) vs. "be concise" (terse) โ€” the model resolves the conflict in favor of the stronger training signal, which is almost always the conversational one.

The Solution: AdaptivePromptBuilder

JARVIS (Body) now classifies every query into one of 5 complexity levels before sending it to Prime, and dynamically adapts three parameters:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Complexity โ”‚ max_tokens โ”‚ temp โ”‚ System Prompt Strategy                                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ SIMPLE     โ”‚ 48         โ”‚ 0.0  โ”‚ NO identity. Few-shot examples only.                            โ”‚
โ”‚            โ”‚            โ”‚      โ”‚ "Reply with ONLY the direct answer."                            โ”‚
โ”‚            โ”‚            โ”‚      โ”‚ v238.0: Only math, spell/translate, yes/no (<8 words).          โ”‚
โ”‚            โ”‚            โ”‚      โ”‚ "what is X?" queries moved to MODERATE.                        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ MODERATE   โ”‚ 512        โ”‚ 0.3  โ”‚ JARVIS identity + "2-3 sentences. No filler."                   โ”‚
โ”‚            โ”‚            โ”‚      โ”‚ v238.0: Default for all queries โ‰ค15 words                       โ”‚
โ”‚            โ”‚            โ”‚      โ”‚ (including "what is X?" and short abstract queries).            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ COMPLEX    โ”‚ 2048       โ”‚ 0.5  โ”‚ JARVIS identity + "Structured and thorough."                    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ADVANCED   โ”‚ 4096       โ”‚ 0.7  โ”‚ JARVIS identity + "Detailed analysis."                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ EXPERT     โ”‚ 4096       โ”‚ 0.7  โ”‚ JARVIS identity + "Comprehensive. Edge cases."                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Three Techniques for 7B Model Compliance

Standard instruction text ("be concise") achieves ~60-70% compliance on 7B models. The v236.0 system uses three additional techniques to push this significantly higher:

1. Identity omission for SIMPLE queries

The JARVIS identity prefix ("You are JARVIS, an advanced AI assistant") is intentionally removed for SIMPLE queries. This eliminates the competing signal that pushes the model toward conversational behavior. For MODERATE and above, the identity is retained because longer responses benefit from the JARVIS personality.

2. Few-shot examples instead of abstract instructions

7B models follow patterns far more reliably than they follow meta-instructions. Instead of telling the model "for math, return just the result," the SIMPLE prompt includes concrete examples:

Q: 5+5
A: 10
Q: Capital of France?
A: Paris
Q: Define gravity
A: The force that attracts objects with mass toward each other.

The model sees these examples and pattern-matches: "short question โ†’ short answer."

3. Temperature 0.0 for deterministic output

At temperature=0.0, the model always selects the highest-probability token at each step. For factual questions with single correct answers (math, capitals, definitions), this eliminates sampling variation entirely. The model produces the same output every time โ€” no "sometimes verbose, sometimes terse" inconsistency.

How This Reaches Prime (Cross-Repo Flow)

The adaptive parameters are set by JARVIS (Body) and sent to Prime via the standard /v1/chat/completions endpoint. From Prime's perspective, it receives normal OpenAI-compatible requests โ€” the intelligence is in what is sent, not in any Prime-side changes:

JARVIS Backend (macOS, port 8010)
  โ”‚
  โ”‚  QueryComplexityManager classifies "5+5?" โ†’ SIMPLE
  โ”‚  AdaptivePromptBuilder selects:
  โ”‚    system_prompt = "Reply with ONLY the direct answer...\nQ: 5+5\nA: 10\n..."
  โ”‚    max_tokens = 64
  โ”‚    temperature = 0.0
  โ”‚
  โ–ผ
  POST http://34.45.154.209:8000/v1/chat/completions
  {
    "model": "jarvis-prime",
    "messages": [
      {"role": "system", "content": "Reply with ONLY the direct answer..."},
      {"role": "user", "content": "what is 5+5?"}
    ],
    "max_tokens": 64,
    "temperature": 0.0
  }
  โ”‚
  โ–ผ
  JARVIS Prime (GCP VM, port 8000)
  โ””โ”€โ”€ llama-cpp-python โ†’ Mistral-7B-Instruct-v0.2 (Q4_K_M)
        โ”‚
        โ”‚  Sees few-shot pattern: Q โ†’ A (short)
        โ”‚  temp=0.0 โ†’ deterministic token selection
        โ”‚  max_tokens=64 โ†’ hard cap on output length
        โ”‚
        โ–ผ
  Response: "10"    (5 tokens including BOS/EOS)

For complex queries, the same flow sends the full JARVIS identity, max_tokens=4096, and temperature=0.7 โ€” giving the model maximum room for structured, detailed analysis.

Verified Results (v236.0 + v238.0)

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Query                             โ”‚ Complexity โ”‚ Tokens โ”‚ Temp โ”‚ Response                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ "what is 5+5?"                    โ”‚ SIMPLE     โ”‚ 48     โ”‚ 0.0  โ”‚ 10                               โ”‚
โ”‚ "what's 5+5?"                     โ”‚ SIMPLE     โ”‚ 48     โ”‚ 0.0  โ”‚ 10                               โ”‚
โ”‚ "is water wet?"                   โ”‚ SIMPLE     โ”‚ 48     โ”‚ 0.0  โ”‚ Yes                              โ”‚
โ”‚ "spell onomatopoeia"             โ”‚ SIMPLE     โ”‚ 48     โ”‚ 0.0  โ”‚ O-N-O-M-A-T-O-P-O-E-I-A         โ”‚
โ”‚ "what is mathematics?"            โ”‚ MODERATE   โ”‚ 512    โ”‚ 0.3  โ”‚ Full definition (3 sentences)    โ”‚
โ”‚ "what is Java?"                   โ”‚ MODERATE   โ”‚ 512    โ”‚ 0.3  โ”‚ Full definition via gcp_prime    โ”‚
โ”‚ "define photosynthesis"           โ”‚ MODERATE   โ”‚ 512    โ”‚ 0.3  โ”‚ 2-3 sentence definition          โ”‚
โ”‚ "capital of France?"              โ”‚ MODERATE   โ”‚ 512    โ”‚ 0.3  โ”‚ Paris / The capital is Paris.    โ”‚
โ”‚ "explain how neural networks      โ”‚ COMPLEX    โ”‚ 2048   โ”‚ 0.5  โ”‚ Multi-paragraph structured       โ”‚
โ”‚  learn"                           โ”‚            โ”‚        โ”‚      โ”‚                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

v238.0 routing confirmed: [QUERY] Response from gcp_prime (latency: 24635.7ms)
Source: jarvis-prime-node at 34.45.154.209 (GCP Invincible Node golden image)

v238.0 Classification Change: Queries like "what is X?", "define X", "who is X?" were previously classified as SIMPLE (48 tokens, temp 0.0, stop sequences). This caused degenerate output ("...") when the model encountered abstract concepts. v238.0 moves these to MODERATE โ€” providing 512 tokens and temp 0.3, which is safe and cheap for all short queries while eliminating the degenerate response failure mode entirely.

The Path Beyond Prompting: Reactor-Core Fine-Tuning

The adaptive prompt system is the immediate fix โ€” it makes Mistral-7B behave correctly today. But prompt-based control is inherently limited for 7B models because instruction compliance is a function of model capacity.

The permanent solution is training the model itself to be concise for simple queries, using the Reactor-Core training pipeline that's already wired into the architecture:

 JARVIS (Body)              JARVIS Prime (Mind)         Reactor-Core (Nerves)
 โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€              โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€         โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
 User: "5+5?"           โ†’   Mistral-7B โ†’ "10"      โ†’   TelemetryEmitter captures
                                                         (query, response, complexity,
                                                          latency, tokens_used)
                                                                โ”‚
                                                                โ–ผ
                                                         TrainingDataPipeline creates
                                                         DPO preference pairs:
                                                         {
                                                           prompt: "5+5?",
                                                           chosen: "10",
                                                           rejected: "Of course, the
                                                              sum of five and five..."
                                                         }
                                                                โ”‚
                                                                โ–ผ
                            Hot-swap fine-tuned       โ†   DPO training with ฮฒ=0.1
                            GGUF (zero downtime)          on accumulated preference data
                            Bake new golden image

After DPO training, conciseness for simple queries is encoded in the model's weights โ€” not dependent on a prompt instruction the model might ignore. The model learns when to be terse and when to be detailed from actual user interaction patterns, not from static rules.

The key components for this pipeline already exist:

  • TelemetryEmitter (JARVIS) โ€” captures every interaction, ships to Reactor-Core
  • TrainingDataPipeline (Prime) โ€” generates DPO preference pairs from conversations
  • RLHFIntegration (Prime) โ€” reward model training and PPO optimization
  • ReactorCoreBridge (Prime) โ€” submits fine-tuning jobs, tracks training, deploys finished models
  • HotSwapManager (Prime) โ€” swaps the model at runtime with zero request drops

v238.0: Degenerate Response Elimination (Defense-in-Depth)

The Problem: "..." as a Model Response

When JARVIS classified "what is mathematics?" as SIMPLE (48 tokens, temperature 0.0, stop sequences \n\n), Mistral-7B sometimes produced "..." followed by a double newline. The stop sequence truncated the output at "...", which then passed through the entire pipeline unchecked โ€” displayed in the UI and spoken aloud via TTS as "full stop."

This is a model behavior that any self-hosted LLM can exhibit when constrained with aggressive token limits, low temperature, and stop sequences on queries that require more than a one-word answer. The model begins generating a longer response, but the constraints truncate it to meaningless punctuation.

How v238.0 Protects the JARVIS โ†’ Prime Pipeline

The fix operates at three layers โ€” any one of which independently prevents garbage from reaching the user:

Layer 1: Classification (JARVIS Body โ€” query_complexity_manager.py)
  "what is mathematics?" โ†’ MODERATE (512 tokens, 0.3 temp, no stop sequences)
  โ†’ Mistral-7B has room to produce a full definition
  โ†’ Eliminates the root cause: the model was never wrong โ€” it was starved

Layer 2: Degenerate Retry (JARVIS Body โ€” query_handler.py)
  If Mistral-7B STILL produces punctuation-only output:
  โ†’ Backend detects content stripped to empty string
  โ†’ Retries once with MODERATE parameters
  โ†’ Retry request goes to Prime at 34.45.154.209:8000
  โ†’ Prime returns real response with sufficient token budget
  โ†’ try/except ensures retry failure doesn't lose original content

Layer 3: Client Suppression (JARVIS Body โ€” JarvisVoice.js)
  If "..." somehow reaches the frontend despite layers 1 and 2:
  โ†’ Frontend detects punctuation-only response
  โ†’ Suppresses display and TTS
  โ†’ Re-arms zombie timeout for automatic retry

Impact on Prime: Prime itself is unchanged โ€” it receives standard OpenAI-compatible requests and returns standard responses. The intelligence is in what JARVIS (Body) sends:

  • Before v238.0: max_tokens=48, temperature=0.0 for "what is mathematics?" โ†’ Prime dutifully truncates
  • After v238.0: max_tokens=512, temperature=0.3 for "what is mathematics?" โ†’ Prime generates full answer

The degenerate retry (Layer 2) may send a second request to Prime if the first response is garbage. This is a normal HTTP POST โ€” Prime processes it like any other request. The retry uses MODERATE parameters, which are safe for any query.

Production Verification

Step 1: PrimeClient resolved to GCP VM: 34.45.154.209:8000 (source: JARVIS_PRIME_URL)
Step 2: PrimeRouter: GCP VM promotion successful, routing updated โ†’ gcp_prime
Step 3: AdaptivePromptBuilder: level=MODERATE, max_tokens=512, temp=0.3
Step 4: [QUERY] Response from gcp_prime (latency: 24635.7ms)
Step 5: API response: "source": "gcp_prime", "model": "jarvis-prime", "fallback_used": false

The 24.6s latency is consistent with CPU inference on the Mistral-7B Q4_K_M model on the e2-standard-4 Invincible Node. Response quality confirmed โ€” full sentence definitions instead of "...".


v241.0/v241.1: Multi-Model Task-Type Routing (GCPModelSwapCoordinator)

The Problem: One Model Does Not Fit All

With a single Mistral-7B serving all queries:

  • "solve 5x+3=18" โ†’ Mistral-7B outputs x=11 (wrong โ€” correct answer is x=3)
  • "write a Python merge sort" โ†’ Mistral-7B produces suboptimal code (not code-trained)
  • "what is the capital of France?" โ†’ waits ~8.6s for a 7B model when a 3.8B model answers in ~3s
  • "explain the implications of quantum error correction" โ†’ limited analysis from a generalist

The Fix: GCPModelSwapCoordinator (Pre-Hook Architecture)

v241.0 introduces the GCPModelSwapCoordinator โ€” a pre-hook that runs before every inference request to ensure the optimal model is loaded. It does NOT replace the generation pipeline; it only swaps the model in the existing LlamaCppExecutor, then returns control to the standard chat_completions() code path.

How the coordinator works:

Incoming request: {"task_type": "math_simple"}
  โ”‚
  โ–ผ
ensure_model("math_simple")
  โ”œโ”€โ”€ _resolve_model("math_simple")
  โ”‚     โ””โ”€โ”€ GCP_TASK_MODEL_MAPPING["math_simple"] = "qwen-2.5-7b"
  โ”‚
  โ”œโ”€โ”€ Is qwen-2.5-7b already loaded?
  โ”‚     โ””โ”€โ”€ YES โ†’ return immediately (no swap, no latency)
  โ”‚
  โ”œโ”€โ”€ Is cooldown active? (60s for medium 3-5 GB models)
  โ”‚     โ””โ”€โ”€ YES โ†’ stay on current model, return
  โ”‚
  โ”œโ”€โ”€ Is queue full? (>50 concurrent requests during swap)
  โ”‚     โ””โ”€โ”€ YES โ†’ HTTP 503 + Retry-After: 30
  โ”‚
  โ””โ”€โ”€ Swap sequence:
        โ”œโ”€โ”€ 1. executor.unload() โ€” release current model RAM
        โ”œโ”€โ”€ 2. executor.load(qwen-2.5-7b.gguf, n_ctx=32768, chat_template="chatml", ...)
        โ”œโ”€โ”€ 3. _validate_model() โ€” 5-token warmup generation
        โ”‚     โ””โ”€โ”€ If FAIL โ†’ rollback to previous model
        โ””โ”€โ”€ 4. Return "qwen-2.5-7b" (model_id for X-Model-Id header)

Per-model executor configuration (Issue #1):

Each model has its own context size, chat template, and inference settings. These are passed as **kwargs to LlamaCppExecutor.load() which merges them with the base config:

Model n_ctx chat_template Notes
Phi-3.5-mini 4,096 phi3 Small context for fast model
Mistral-7B 8,192 mistral Standard instruction format
Qwen2.5-7B 32,768 chatml Full context for math reasoning
Qwen2.5-Math-7B 32,768 chatml Mathematical chain-of-thought
DeepSeek-R1 32,768 chatml Reasoning traces need long context
Qwen2.5-Coder-7B 32,768 chatml Code generation needs full context
Llama-3.1-8B 8,192 llama3 Capped at 8K on 32 GB RAM (full 128K requires more)
Gemma-2-9B 8,192 gemma Largest model, moderate context

Sticky routing with per-model-size cooldowns (Issue #10):

To prevent "model thrashing" (loading a new model for every request), the coordinator uses cooldowns based on model size:

Model Size Cooldown Rationale
Small (<3 GB) 30 seconds Phi-3.5-mini loads fast, shorter cooldown OK
Medium (3-5 GB) 60 seconds Most 7B models โ€” balance between responsiveness and swap cost
Large (>5 GB) 90 seconds Gemma-2-9B, Llama-3.1-8B โ€” slower to load, keep longer

All cooldowns are overridable via environment variables (GCP_COOLDOWN_SMALL, GCP_COOLDOWN_MEDIUM, GCP_COOLDOWN_LARGE).

Bounded queue (Issue R2-1):

During the 20-30s model swap, incoming requests are queued behind an asyncio lock. If more than 50 requests pile up, the coordinator returns HTTP 503 with Retry-After: 30 instead of letting the queue grow unbounded. The counter is atomic in the single asyncio event loop (no TOCTOU race).

Post-swap validation with rollback (Issue #8):

After every model load, the coordinator generates 5 tokens (max_tokens=5) to verify the model responds. If validation fails, it rolls back to the previous model. If rollback also fails, the executor enters a no-model state (logged as CRITICAL) and requests fall through to the Cloud Claude fallback.

Files Modified (v241.0/v241.1)

File Change
run_server.py ChatRequest.metadata field, coordinator pre-hook in chat_completions(), X-Model-Id header, coordinator init in background_initialization()
jarvis_prime/core/dynamic_model_registry.py 11 ModelSpec entries, GCP_TASK_MODEL_MAPPING, GCP_MODEL_CONFIGS per-model overrides
jarvis_prime/core/gcp_model_swap_coordinator.py NEW FILE. Pre-hook coordinator with manifest inventory, bounded queue, cooldowns, validation, rollback
jarvis_prime/core/llama_cpp_executor.py "qwen": "chatml", "deepseek": "chatml", "gemma-2": "gemma" in MODEL_TEMPLATE_MAP
config/unified_config.yaml gcp_model_routing section with 11 model entries

โœจ Core Features

๐Ÿง  1. Neural Orchestrator Core v100.0 - Unified Intelligent Routing

The single source of truth for all routing decisions across the JARVIS ecosystem:

Unified Architecture

  • Consolidates All Routers: HybridTieredRouter, IntelligentModelRouter, CognitiveRouter, GraphRouter, Neural Switchboard
  • Protocol-Based Design: Type-safe interfaces with @runtime_checkable Protocols
  • Context-Aware Routing: Distributed tracing with contextvars for request correlation
  • Dynamic Configuration: Zero hardcoding - all values from DynamicConfig with env var override
  • Cross-Repo State Management: Atomic file operations for shared state across repositories

Advanced Components

UnifiedTaskClassifier

  • Multi-signal task classification (reasoning, chat, code, creative, analysis)
  • Confidence scoring with adaptive thresholds
  • Pattern matching with regex and keyword detection
  • Context-aware classification (session history, user preferences)

UnifiedMemoryMonitor

  • macOS native memory_pressure command integration
  • Cross-repo memory sharing via JARVIS bridge
  • Real-time pressure level detection (normal, warning, critical, urgent)
  • Burst decision support for memory-intensive operations
  • psutil fallback for non-macOS systems

UnifiedStickyRouting

  • Session-based model affinity
  • Automatic session detection from context
  • Configurable TTL for session continuity
  • Memory-efficient storage with weakref.WeakValueDictionary

UnifiedRequestBuffer

  • Zero-loss request buffering during hot swaps
  • Configurable buffer size and timeout
  • Automatic request replay after swap completion
  • Priority-based request ordering

CircuitBreakerManager

  • Coordinated circuit breakers per tier (Tier 0, Tier 0.5, Tier 1, Tier 2)
  • Atomic state management with distributed locking
  • Automatic recovery with half-open state testing
  • Statistics tracking per tier

CrossRepoStateManager

  • Atomic file operations for state persistence
  • File locking with fcntl for race condition prevention
  • Automatic retry with exponential backoff
  • State versioning and conflict resolution
from jarvis_prime.core.neural_orchestrator_core import get_neural_orchestrator

# Get the unified orchestrator (singleton)
orchestrator = await get_neural_orchestrator()

# Route a request (handles everything automatically)
result = await orchestrator.route(
    prompt="Implement a distributed cache with Redis",
    context={
        "session_id": "abc123",
        "user_id": "derek",
        "priority": "high"
    }
)

# Access routing decision
print(f"Tier: {result.tier}")  # RoutingTier.TIER_0_5
print(f"Endpoint: {result.endpoint}")  # "http://localhost:8000/v1/chat/completions"
print(f"Model ID: {result.model_id}")  # "mistral-7b-instruct"
print(f"Task: {result.task_classification}")  # TaskClassification.CODE
print(f"Confidence: {result.confidence}")  # 0.92
print(f"Reasoning: {result.decision_reason}")  # DecisionReason.MEMORY_PRESSURE

# Get comprehensive statistics
stats = orchestrator.get_comprehensive_stats()
print(f"Total requests: {stats['routing']['total_requests']}")
print(f"Sticky hits: {stats['routing']['sticky_hits']}")
print(f"Memory pressure: {stats['memory_monitor']['pressure_level']}")

Advanced Python Patterns

Protocol Classes for Type Safety

from typing import Protocol, runtime_checkable

@runtime_checkable
class RouterProtocol(Protocol):
    async def route(self, prompt: str, context: Dict[str, Any]) -> RoutingResult:
        ...

Context Variables for Distributed Tracing

import contextvars

request_id_var = contextvars.ContextVar('request_id', default=None)
session_id_var = contextvars.ContextVar('session_id', default=None)
trace_context_var = contextvars.ContextVar('trace_context', default=None)

Defensive Decorators with Fallbacks

def with_fallback(fallback_value):
    def decorator(func):
        async def wrapper(*args, **kwargs):
            try:
                return await func(*args, **kwargs)
            except Exception as e:
                logger.warning(f"{func.__name__} failed: {e}, using fallback")
                return fallback_value
        return wrapper
    return decorator

Atomic Operations

async def atomic_state_update(key: str, value: Any):
    async with distributed_lock(f"state_{key}"):
        # Critical section - guaranteed atomicity
        state[key] = value
        await persist_state(state)

๐Ÿงฉ 2. Dynamic Model Registry v99.0

Auto-discovery and management of models across multiple directories:

Features

  • Multi-Directory Discovery: Scans multiple model directories automatically
  • Auto-Download from HuggingFace: Automatic model downloading with progress tracking
  • File System Watching: Real-time detection of new models via watchdog
  • Reactor Core Sync: Automatic synchronization with Reactor Core training pipeline
  • Model Validation: Integrity checks, inference tests, safety validation
  • Version Management: Semantic versioning with rollback support
from jarvis_prime.core.dynamic_model_registry import DynamicModelRegistry

registry = DynamicModelRegistry(
    discovery_dirs=[
        "./models",
        "~/models",
        "/shared/models"
    ],
    auto_download=True,
    watch_files=True
)

# Auto-discover models
await registry.discover_models()

# Get available models
models = registry.list_models()
for model in models:
    print(f"{model.name} - {model.version} - {model.path}")

# Auto-download from HuggingFace
await registry.download_model(
    repo_id="mistralai/Mistral-7B-Instruct-v0.2",
    local_dir="./models"
)

๐Ÿง  3. Neural Switchboard v98.1

Unified routing system with task classification, memory monitoring, and sticky routing:

jarvis_prime.core.neural_switchboard is the stable public facade. Internally it delegates to dynamic_model_registry.py (switchboard routing) and neural_orchestrator_core.py (tier fallback/orchestration), so callers no longer depend on private implementation layout.

Features

  • Task Classification: Multi-signal classification (reasoning, chat, code, creative)
  • Memory Monitoring: Real-time memory pressure detection
  • Sticky Routing: Session-based model affinity
  • Request Buffering: Zero-loss hot swap support
  • Tier Mapping: Automatic tier/capability mapping
from jarvis_prime.core.neural_switchboard import NeuralSwitchboard

switchboard = NeuralSwitchboard()
await switchboard.initialize()

# Classify task
classification = await switchboard.classify_task(
    prompt="Write a Python function to sort a list",
    context={"session_id": "abc123"}
)

# Route request
decision = await switchboard.route(
    prompt="Continue the previous code",
    context={"session_id": "abc123"},
    strategy="auto",  # switchboard | orchestrator | auto
)
print(decision.to_dict())

๐Ÿ›ก๏ธ 4. Advanced Resilience Patterns

Circuit Breaker (Coordinated Per-Tier)

from jarvis_prime.core.neural_orchestrator_core import CircuitBreakerManager

breaker_manager = CircuitBreakerManager()

# Check circuit state for tier
state = await breaker_manager.get_state(RoutingTier.TIER_1)
if state == CircuitState.CLOSED:
    # Safe to route
    result = await route_to_tier_1(prompt)
    await breaker_manager.record_success(RoutingTier.TIER_1)
else:
    # Circuit open, use fallback
    result = await fallback_route(prompt)

Request Buffering (Zero-Loss Hot Swap)

from jarvis_prime.core.neural_orchestrator_core import UnifiedRequestBuffer

buffer = UnifiedRequestBuffer(max_size=1000, timeout_seconds=30.0)

# Buffer requests during hot swap
async with buffer.buffer_mode():
    # All requests are buffered
    await hot_swap_model(new_model_path)
    # Buffered requests are automatically replayed

Retry with Exponential Backoff + Decorrelated Jitter

from jarvis_prime.core.neural_orchestrator_core import with_retry

@with_retry(max_attempts=3, base_delay=1.0, max_delay=10.0)
async def unreliable_operation():
    # Automatically retries with exponential backoff + jitter
    result = await external_api_call()
    return result

๐Ÿ”’ 5. JARVIS Safety Integration

Cross-Repo Bridge reads safety context from main JARVIS instance:

from jarvis_prime.core.neural_orchestrator_core import CrossRepoStateManager

state_manager = CrossRepoStateManager()

# Read safety context
safety_context = await state_manager.read_safety_context()

if safety_context.kill_switch_active:
    # Route all actions to Prime for careful review
    result = await orchestrator.route(
        prompt=prompt,
        context={"force_tier": RoutingTier.TIER_1}
    )

if safety_context.should_be_cautious():
    # User has been denying actions recently
    # Route risky patterns to cloud
    result = await orchestrator.route(
        prompt=prompt,
        context={"force_tier": RoutingTier.TIER_1}
    )

Safety File Location: ~/.jarvis/safety/context_for_prime.json

Risky Pattern Detection:

  • delete, remove, erase, wipe, format
  • kill, terminate, shutdown, reboot
  • sudo, admin, root, system, chmod
  • execute, run, install, uninstall
  • password, credential, secret, token

๐Ÿ”„ 6. Zero-Downtime Hot Swap

Swap models while server is running with zero requests dropped:

from jarvis_prime.core.hot_swap_manager import HotSwapManager

manager = HotSwapManager()

# Background loading, traffic draining, atomic switch
result = await manager.swap_model(
    new_model_path="./models/mistral-7b.gguf",
    new_version_id="mistral-7b-v0.2"
)

print(f"Swapped in {result.duration_seconds:.1f}s")
print(f"Drained {result.requests_drained} in-flight requests")
print(f"Freed {result.memory_freed_mb:.1f} MB")
# Zero requests dropped! โœ…

๐Ÿ“Š 7. Advanced Telemetry & Cost Tracking

from jarvis_prime.core.cross_repo_bridge import CrossRepoBridge

bridge = CrossRepoBridge(instance_id="prime-derek-mac")
await bridge.start()

# Automatic metrics tracking
bridge.record_inference(tokens_in=25, tokens_out=150, latency_ms=47.3)

# Cost savings calculation
state = bridge.state
print(f"Total requests: {state.metrics.total_requests}")
print(f"Cloud cost if used: ${state.metrics.estimated_cost_usd:.4f}")
print(f"Savings: ${state.metrics.savings_vs_cloud_usd:.4f}")

# Shared with main JARVIS at:
# ~/.jarvis/cross_repo/jarvis_prime_state.json

๐ŸŒ 8. OpenAI-Compatible API

Drop-in replacement for OpenAI SDK:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

response = client.chat.completions.create(
    model="jarvis-prime",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    stream=True  # Real-time streaming
)

for chunk in response:
    print(chunk.choices[0].delta.content, end="")

๐Ÿงฉ 9. Complete AGI Architecture

7 Specialized AGI Models

from jarvis_prime.core.agi_models import (
    ActionModel,           # Action planning and execution
    MetaReasoner,         # Meta-cognitive reasoning, strategy selection
    CausalEngine,         # Causal understanding, counterfactuals
    WorldModel,           # Physical/common sense reasoning
    MemoryConsolidator,   # Memory consolidation and replay
    GoalInference,        # Goal understanding and decomposition
    SelfModel,            # Self-awareness and capability assessment
)

# Orchestrate multiple models for complex reasoning
from jarvis_prime.core.agi_models import AGIOrchestrator

orchestrator = AGIOrchestrator()
result = await orchestrator.process(
    request="Design a distributed caching system",
    required_models=["meta_reasoner", "action", "causal"]
)

Advanced Reasoning Engine

from jarvis_prime.core.reasoning_engine import ReasoningEngine, ReasoningStrategy

engine = ReasoningEngine()

# Chain-of-Thought reasoning
cot_result = await engine.reason(
    prompt="How do I optimize this algorithm?",
    strategy=ReasoningStrategy.CHAIN_OF_THOUGHT,
    max_steps=10
)

# Tree-of-Thoughts for exploration
tot_result = await engine.reason(
    prompt="Design three different approaches to...",
    strategy=ReasoningStrategy.TREE_OF_THOUGHTS,
    num_branches=3,
    exploration_depth=4
)

# Self-Reflection for error correction
reflection_result = await engine.reason(
    prompt="Review this code for bugs",
    strategy=ReasoningStrategy.SELF_REFLECTION,
    confidence_threshold=0.8
)

๐Ÿ—๏ธ Architecture

System Overview

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    JARVIS UNIFIED SUPERVISOR                           โ”‚
โ”‚                    (run_supervisor.py - v100.0)                         โ”‚
โ”‚                                                                         โ”‚
โ”‚  Orchestrates: JARVIS (Body), JARVIS-Prime (Mind), Reactor-Core       โ”‚
โ”‚  Initializes: Neural Orchestrator Core v100.0                          โ”‚
โ”‚  Manages: Health checks, lifecycle, cross-repo communication          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                            โ”‚
                            โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              NEURAL ORCHESTRATOR CORE v100.0                            โ”‚
โ”‚              Unified Intelligent Routing Architecture                    โ”‚
โ”‚                                                                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚                    UNIFIED ROUTING LAYER                         โ”‚  โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”      โ”‚  โ”‚
โ”‚  โ”‚  โ”‚ TaskClass โ”‚ โ”‚MemPressureโ”‚ โ”‚ Sticky    โ”‚ โ”‚ RequestBufโ”‚      โ”‚  โ”‚
โ”‚  โ”‚  โ”‚   -ifier  โ”‚ โ”‚  Monitor  โ”‚ โ”‚ Routing   โ”‚ โ”‚   -fer    โ”‚      โ”‚  โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜      โ”‚  โ”‚
โ”‚  โ”‚        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜            โ”‚  โ”‚
โ”‚  โ”‚                      โ–ผ             โ–ผ                          โ”‚  โ”‚
โ”‚  โ”‚              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                    โ”‚  โ”‚
โ”‚  โ”‚              โ”‚   ROUTING DECISION ENGINE โ”‚                    โ”‚  โ”‚
โ”‚  โ”‚              โ”‚    (Unified Algorithm)    โ”‚                    โ”‚  โ”‚
โ”‚  โ”‚              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                    โ”‚  โ”‚
โ”‚  โ”‚                            โ”‚                                    โ”‚  โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚                         โ–ผ                         โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚  โ”‚ Tier 0  โ”‚  โ”‚Tier 0.5 โ”‚  โ”‚ Tier 1  โ”‚  โ”‚Tier 2โ”‚  โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚  โ”‚ Ultra   โ”‚  โ”‚ Local   โ”‚  โ”‚ Cloud   โ”‚  โ”‚ Deep โ”‚  โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚  โ”‚ Fast    โ”‚  โ”‚ Capable โ”‚  โ”‚  Intel  โ”‚  โ”‚Reasonโ”‚  โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜  โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚                    โ–ผ            โ–ผ                 โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚           โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚           โ”‚  CIRCUIT BREAKER MANAGER   โ”‚          โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚           โ”‚  (Coordinated State)       โ”‚          โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ”‚           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚          โ”‚  โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                                                         โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚                    CROSS-REPO INTEGRATION                        โ”‚  โ”‚
โ”‚  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                     โ”‚  โ”‚
โ”‚  โ”‚  โ”‚  JARVIS   โ”‚ โ”‚  JARVIS   โ”‚ โ”‚  Reactor  โ”‚                     โ”‚  โ”‚
โ”‚  โ”‚  โ”‚  (Body)   โ”‚ โ”‚  Prime    โ”‚ โ”‚   Core    โ”‚                     โ”‚  โ”‚
โ”‚  โ”‚  โ”‚  Memory   โ”‚ โ”‚  Memory   โ”‚ โ”‚  Sync     โ”‚                     โ”‚  โ”‚
โ”‚  โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜                     โ”‚  โ”‚
โ”‚  โ”‚        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                           โ”‚  โ”‚
โ”‚  โ”‚                      โ–ผ                                         โ”‚  โ”‚
โ”‚  โ”‚        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                           โ”‚  โ”‚
โ”‚  โ”‚        โ”‚  SHARED STATE MANAGER     โ”‚                           โ”‚  โ”‚
โ”‚  โ”‚        โ”‚  (~/.jarvis/cross_repo/)  โ”‚                           โ”‚  โ”‚
โ”‚  โ”‚        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                           โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                                           โ”‚
         โ–ผ                                           โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   JARVIS (Body)     โ”‚                  โ”‚  JARVIS-Prime (Mind)     โ”‚
โ”‚   โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€   โ”‚                  โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€    โ”‚
โ”‚   โ€ข Computer Use    โ”‚โ—„โ”€โ”€โ”€โ”€Trinityโ”€โ”€โ”€โ”€โ”€โ”€โ”ค  โ€ข AGI Models (7 types)  โ”‚
โ”‚   โ€ข Action Exec     โ”‚     Protocol     โ”‚  โ€ข Reasoning Engine      โ”‚
โ”‚   โ€ข macOS Control   โ”‚    (File IPC +   โ”‚  โ€ข Multimodal Fusion     โ”‚
โ”‚   โ€ข Safety Manager  โ”‚     WebSocket)   โ”‚  โ€ข Continuous Learning   โ”‚
โ”‚   "Reflex Mode"     โ”‚                  โ”‚  "Cognitive Mode"        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                                           โ”‚
         โ”‚                                           โ”‚
         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
                             โ–ผ
                  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                  โ”‚  Reactor-Core (Soul)โ”‚
                  โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”‚
                  โ”‚  โ€ข Model Training   โ”‚
                  โ”‚  โ€ข Fine-tuning      โ”‚
                  โ”‚  โ€ข Checkpointing    โ”‚
                  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Cross-Repo Integration (Trinity)

JARVIS-Prime is the Mind in the three-repo Trinity architecture. It is started and monitored by the JARVIS unified supervisor and coordinates with Reactor-Core for training data and model deployment.

How JARVIS (Body) uses Prime:

  • Discovery: Supervisor resolves JARVIS_PRIME_REPO_PATH (or default ~/Documents/repos/JARVIS-Prime).
  • Early Prime pre-warm: Supervisor can start Prime early so LLM loading begins in parallel; when Trinity phase starts, it adopts the running process and clears JARVIS_EARLY_PRIME_PID. The Early Prime monitor then stops with handoff=True so progress is preserved (v221.0).
  • Health: Supervisor polls GET /health and reads model_load_progress_pct, startup_progress, loading_progress, phase, model_loaded, ready_for_inference. Progress never regresses (e.g. 18% โ†’ 0%) thanks to handoff-safe state in the supervisor.
  • State: Prime reads/writes shared state under ~/.jarvis/ (e.g. cross_repo/, Neural Orchestrator state) for safety context and routing.

How Reactor-Core uses Prime:

  • Inference: Reactor can call Primeโ€™s OpenAI-compatible API for generation during training or evaluation.
  • Model deployment: Trained/updated models can be deployed to Prime (e.g. hot swap, model registry).
  • Trinity Protocol: Events and heartbeats flow via file IPC and/or WebSocket; Prime participates in Trinity state sync.
  • Autonomy Policy (Phase 2): JARVIS Body sends autonomy_policy on JARVISCommand with allowed/denied action lists and risk thresholds. Prime validates proposed actions against the policy, builds a structured action_plan in PrimeResponse, and returns policy_compatible: bool and contract_version for boot contract checking.

Phase 2: Trinity Autonomy Wiring (Prime Role)

Prime serves as the policy gate in the autonomy pipeline. When Body's Google Workspace Agent proposes an autonomous action, Prime validates it against the attached policy and returns a structured plan.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚              PRIME AUTONOMY ROLE                         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                          โ”‚
โ”‚  Inbound (from Body):                                   โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”‚
โ”‚  โ”‚ JARVISCommand                          โ”‚             โ”‚
โ”‚  โ”‚   .autonomy_policy = {                 โ”‚             โ”‚
โ”‚  โ”‚       "allowed_actions": [...],        โ”‚             โ”‚
โ”‚  โ”‚       "denied_actions": [...],         โ”‚             โ”‚
โ”‚  โ”‚       "max_risk_level": "medium",      โ”‚             โ”‚
โ”‚  โ”‚       "require_confirmation": false    โ”‚             โ”‚
โ”‚  โ”‚   }                                    โ”‚             โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚
โ”‚                     โ”‚                                    โ”‚
โ”‚                     โ–ผ                                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”‚
โ”‚  โ”‚ Policy Validation                      โ”‚             โ”‚
โ”‚  โ”‚   โ€ข Check action against allowed list  โ”‚             โ”‚
โ”‚  โ”‚   โ€ข Check action against denied list   โ”‚             โ”‚
โ”‚  โ”‚   โ€ข Validate risk level                โ”‚             โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚
โ”‚                     โ”‚                                    โ”‚
โ”‚                     โ–ผ                                    โ”‚
โ”‚  Outbound (to Body):                                    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”             โ”‚
โ”‚  โ”‚ PrimeResponse                          โ”‚             โ”‚
โ”‚  โ”‚   .action_plan = {                     โ”‚             โ”‚
โ”‚  โ”‚       "steps": [...],                  โ”‚             โ”‚
โ”‚  โ”‚       "risk_assessment": "low"         โ”‚             โ”‚
โ”‚  โ”‚   }                                    โ”‚             โ”‚
โ”‚  โ”‚   .policy_compatible = true            โ”‚             โ”‚
โ”‚  โ”‚   .contract_version = "1.0"            โ”‚             โ”‚
โ”‚  โ”‚   .autonomy_schema_version = "1.0"     โ”‚             โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚
โ”‚                                                          โ”‚
โ”‚  Health endpoint additions:                              โ”‚
โ”‚  GET /health โ†’ { autonomy_schema_version: "1.0",        โ”‚
โ”‚                   contract_version: "1.0" }              โ”‚
โ”‚  (Used by Supervisor boot contract check)                โ”‚
โ”‚                                                          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Files modified:

  • jarvis_prime/core/jarvis_bridge.py โ€” autonomy_policy on JARVISCommand, action_plan/policy_compatible/contract_version on PrimeResponse
  • jarvis_prime/server.py โ€” autonomy_schema_version and contract_version in health endpoint

Health endpoint contract for supervisor:

  • During model loading: model_load_progress_pct (0โ€“100), model_loading_in_progress, phase (e.g. loading_model), model_load_elapsed_seconds.
  • When ready: model_loaded, ready_for_inference, phase: "ready".
  • run_server.py is the authoritative full server.
  • jarvis_prime/server.py (module entry) now delegates to run_server.py so both startup paths expose the same contract and capabilities.

Model Loading Progress & Handoff (v221.0)

When the JARVIS unified supervisor uses Early Prime pre-warm, Prime starts early and a background monitor polls /health and updates the dashboard. When the Trinity phase takes over, it adopts the running Prime process and clears the early-Prime env var; the Early Prime monitor then stops. v221.0 ensures:

  • No progress regression: The supervisorโ€™s update_model_loading(active=False, handoff=True) preserves max_progress_seen. Progress never drops (e.g. 18% โ†’ 0%).
  • Prime health: Primeโ€™s /health must report model_load_progress_pct (and related fields) so the Trinity monitor can continue from the preserved progress. Module startup and script startup now resolve to the same full server path.

See JARVIS-AI-Agent memory/2026-02-04.md (or equivalent) for the full root-cause analysis and fix summary.

Request Flow with Neural Orchestrator Core

User Request: "Implement a distributed cache with Redis"
     โ”‚
     โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Step 1: Neural Orchestrator Core Route()                      โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                      โ”‚
โ”‚ โ€ข Check sticky routing: session_id="abc123" โ†’ Model affinity  โ”‚
โ”‚ โ€ข Classify task: CODE (confidence: 0.92)                      โ”‚
โ”‚ โ€ข Check memory pressure: NORMAL (macOS native)                โ”‚
โ”‚ โ€ข Check circuit breakers: All CLOSED                           โ”‚
โ”‚ โ€ข Select tier: TIER_0_5 (Local Capable)                      โ”‚
โ”‚ โ€ข Select endpoint: http://localhost:8000/v1/chat/completions  โ”‚
โ”‚ โ€ข Select model: mistral-7b-instruct                            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚
     โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Step 2: Request Execution                                     โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                   โ”‚
โ”‚ โ€ข Acquire circuit breaker permit: SUCCESS                    โ”‚
โ”‚ โ€ข Execute request with timeout: 60s                          โ”‚
โ”‚ โ€ข Stream response tokens                                      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚
     โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Step 3: Response & State Update                              โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                             โ”‚
โ”‚ โ€ข Release circuit breaker permit: SUCCESS                     โ”‚
โ”‚ โ€ข Update sticky routing: session_id โ†’ model_id                โ”‚
โ”‚ โ€ข Update statistics: total_requests++, sticky_hits++           โ”‚
โ”‚ โ€ข Record outcome for adaptive learning                        โ”‚
โ”‚ โ†’ Return response to user                                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿš€ Quick Start

Prerequisites

  • Python 3.11+ (recommended for best performance with structured concurrency)
  • macOS (for M1/M2/M3 optimization) or Linux
  • 8GB+ RAM (16GB recommended for larger models)
  • 10GB+ free disk space

Installation

# Clone repository
git clone https://github.com/drussell23/jarvis-prime.git
cd jarvis-prime

# Install dependencies
pip install -e .

# Or with all features
pip install -e ".[server,gcs,telemetry,agi,neural-orchestrator]"

Entry Points

Entry Point Purpose When to Use
run_server.py Authoritative full server with startup state, progress reporting, AGI/neural orchestration, and cross-repo bridges Recommended โ€” used by unified supervisor; reports model_load_progress_pct, startup_progress, model_loading_in_progress
jarvis_prime/server.py (module) Unified module entrypoint that delegates to run_server.py Use when launching with python -m jarvis_prime.server; behavior is now capability-equivalent to run_server.py
Unified Supervisor (JARVIS) python3 unified_supervisor.py in JARVIS-AI-Agent Recommended for full ecosystem โ€” starts Body + Prime + Reactor-Core with Trinity coordination

jarvis_prime/server.py now fails fast by default if run_server.py is unavailable (to avoid degraded startup). Emergency override: set JARVIS_PRIME_ALLOW_LEGACY_SERVER_FALLBACK=true.

The health endpoint (GET /health) must expose model_load_progress_pct (and optionally startup_progress, loading_progress, model_loading_in_progress) so the JARVIS unified supervisor can track loading progress and avoid regression during Early Prime โ†’ Trinity handoff (v221.0).

Unified Supervisor (Recommended)

Start all components with a single command from the JARVIS (Body) repo:

# From JARVIS-AI-Agent repo โ€” starts JARVIS + JARVIS-Prime + Reactor-Core
python3 unified_supervisor.py

# Supervisor will:
# 1. Start JARVIS-Prime server (port 8000)
# 2. Initialize Neural Orchestrator Core v100.0
# 3. Connect to JARVIS Body (if running)
# 4. Setup Trinity Protocol (File IPC + WebSocket)
# 5. Start health monitoring
# 6. Initialize Dynamic Model Registry
# 7. Start cross-repo state management

# Output:
# ============================================================
# JARVIS Unified Supervisor v100.0 - Starting
# ============================================================
# ๐Ÿง  Neural Orchestrator Core v100.0 initialized
# ๐Ÿ“Š Dynamic Model Registry v99.0 initialized
# ๐Ÿ”„ Cross-Repo State Manager initialized
# Starting component: jarvis_prime
# Starting component: jarvis
# All components started successfully
# Supervisor running, press Ctrl+C to stop

Note: The unified supervisor lives in JARVIS-AI-Agent; it discovers and starts JARVIS-Prime (and Reactor-Core). From within the JARVIS-Prime repo you can run the standalone server only (see below).

Standalone Server

Start just the JARVIS-Prime server:

# Download a model first
python -c "
from jarvis_prime.docker.model_downloader import download_model
download_model('tinyllama-chat', './models')
"

# Start server
python run_server.py \
    --model ./models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \
    --port 8000

# Server starts at http://localhost:8000

Test Neural Orchestrator Core

from jarvis_prime.core.neural_orchestrator_core import get_neural_orchestrator
import asyncio

async def main():
    # Get singleton orchestrator
    orchestrator = await get_neural_orchestrator()

    # Simple request โ†’ Tier 0
    result = await orchestrator.route(
        prompt="What's 2+2?",
        context={"session_id": "test123"}
    )
    print(f"Tier: {result.tier}")  # RoutingTier.TIER_0
    print(f"Task: {result.task_classification}")  # TaskClassification.CHAT

    # Complex request โ†’ Tier 1
    result = await orchestrator.route(
        prompt="Plan a comprehensive security audit of the authentication system",
        context={"session_id": "test123"}
    )
    print(f"Tier: {result.tier}")  # RoutingTier.TIER_1
    print(f"Task: {result.task_classification}")  # TaskClassification.REASONING
    print(f"Confidence: {result.confidence}")  # 0.92

    # Get comprehensive statistics
    stats = orchestrator.get_comprehensive_stats()
    print(f"Total requests: {stats['routing']['total_requests']}")
    print(f"Sticky hits: {stats['routing']['sticky_hits']}")
    print(f"Memory pressure: {stats['memory_monitor']['pressure_level']}")

asyncio.run(main())

Send Requests (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy"
)

# Simple request
response = client.chat.completions.create(
    model="jarvis-prime",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)
print(response.choices[0].message.content)

# Streaming request
stream = client.chat.completions.create(
    model="jarvis-prime",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

๐ŸŒ API Endpoints

Neural Orchestrator Core Endpoints

GET /neural-orchestrator/health

Check Neural Orchestrator health status.

Response:

{
  "status": "healthy",
  "components": {
    "task_classifier": "healthy",
    "memory_monitor": "healthy",
    "sticky_routing": "healthy",
    "request_buffer": "healthy",
    "circuit_breaker": "healthy",
    "cross_repo_state": "healthy"
  },
  "uptime_seconds": 3600.5
}

GET /neural-orchestrator/stats

Get comprehensive statistics.

Response:

{
  "routing": {
    "total_requests": 1250,
    "sticky_hits": 342,
    "task_classifications": {
      "REASONING": 450,
      "CHAT": 600,
      "CODE": 150,
      "CREATIVE": 50
    }
  },
  "memory_monitor": {
    "pressure_level": "normal",
    "last_check": "2025-01-07T14:30:45Z"
  },
  "circuit_breaker": {
    "tier_0": {"state": "closed", "failures": 0},
    "tier_0_5": {"state": "closed", "failures": 0},
    "tier_1": {"state": "closed", "failures": 0},
    "tier_2": {"state": "closed", "failures": 0}
  }
}

POST /neural-orchestrator/route

Route a request through the Neural Orchestrator.

Request:

{
  "prompt": "Implement a distributed cache",
  "context": {
    "session_id": "abc123",
    "user_id": "derek",
    "priority": "high"
  }
}

Response:

{
  "tier": "TIER_0_5",
  "endpoint": "http://localhost:8000/v1/chat/completions",
  "model_id": "mistral-7b-instruct",
  "task_classification": "CODE",
  "confidence": 0.92,
  "decision_reason": "MEMORY_PRESSURE",
  "metadata": {
    "sticky_hit": true,
    "memory_pressure": "normal"
  }
}

GET /neural-orchestrator/memory

Get current memory pressure status.

Response:

{
  "pressure_level": "normal",
  "pressure_score": 0.25,
  "memory_usage_mb": 8192,
  "memory_available_mb": 8192,
  "last_check": "2025-01-07T14:30:45Z"
}

POST /neural-orchestrator/classify

Classify a task without routing.

Request:

{
  "prompt": "Write a Python function to sort a list",
  "context": {
    "session_id": "abc123"
  }
}

Response:

{
  "task_classification": "CODE",
  "confidence": 0.95,
  "signals": {
    "reasoning_indicators": 0.1,
    "code_indicators": 0.9,
    "chat_indicators": 0.2
  }
}

Standard API Endpoints

POST /v1/chat/completions

OpenAI-compatible chat completions endpoint.

POST /generate

Simple text generation endpoint.

GET /health

Health check endpoint.

GET /metrics

Cost tracking and inference metrics.

GET /v1/models

List available models.

POST /api/v1/models/reload

Reload a model (hot swap).

AGI Endpoints

POST /agi/reason

Advanced reasoning with AGI models.

POST /agi/plan

Action planning with AGI models.

POST /agi/process

Multi-model AGI processing.

POST /agi/feedback

Provide feedback for continuous learning.

POST /agi/learning/trigger

Trigger continuous learning update.

GET /agi/status

Get AGI subsystem status.

GET /agi/learning/stats

Get continuous learning statistics.


๐ŸŽ›๏ธ Configuration

Environment Variables (Zero Hardcoding)

Neural Orchestrator Core Configuration

# Core settings
export NEURAL_ORCHESTRATOR_ENABLED=true
export NEURAL_ORCHESTRATOR_CONFIG_PATH=config/neural_orchestrator.yaml

# Task classification
export NEURAL_ORCHESTRATOR_REASONING_THRESHOLD=0.5
export NEURAL_ORCHESTRATOR_CODE_THRESHOLD=0.6
export NEURAL_ORCHESTRATOR_CREATIVE_THRESHOLD=0.4

# Memory monitoring
export NEURAL_ORCHESTRATOR_MEMORY_CHECK_INTERVAL=5.0
export NEURAL_ORCHESTRATOR_MEMORY_PRESSURE_THRESHOLD=0.8
export NEURAL_ORCHESTRATOR_MEMORY_CRITICAL_THRESHOLD=0.9

# Sticky routing
export NEURAL_ORCHESTRATOR_STICKY_ENABLED=true
export NEURAL_ORCHESTRATOR_STICKY_TTL=3600.0

# Request buffering
export NEURAL_ORCHESTRATOR_BUFFER_MAX_SIZE=1000
export NEURAL_ORCHESTRATOR_BUFFER_TIMEOUT=30.0

# Circuit breaker
export NEURAL_ORCHESTRATOR_CIRCUIT_FAILURE_THRESHOLD=5
export NEURAL_ORCHESTRATOR_CIRCUIT_RECOVERY_TIMEOUT=30.0
export NEURAL_ORCHESTRATOR_CIRCUIT_HALF_OPEN_MAX_REQUESTS=3

# Cross-repo state
export NEURAL_ORCHESTRATOR_CROSS_REPO_DIR=~/.jarvis/cross_repo
export NEURAL_ORCHESTRATOR_STATE_FILE=neural_orchestrator_state.json

Dynamic Model Registry Configuration

# Discovery
export MODEL_REGISTRY_DISCOVERY_DIRS="./models,~/models,/shared/models"
export MODEL_REGISTRY_AUTO_DOWNLOAD=true
export MODEL_REGISTRY_WATCH_FILES=true

# HuggingFace
export MODEL_REGISTRY_HF_TOKEN=your_token_here
export MODEL_REGISTRY_HF_CACHE_DIR=~/.cache/huggingface

# Reactor Core sync
export MODEL_REGISTRY_REACTOR_CORE_ENABLED=true
export MODEL_REGISTRY_REACTOR_CORE_URL=http://localhost:9000

General Server Configuration

# Server
export JARVIS_PRIME_HOST=0.0.0.0
export JARVIS_PRIME_PORT=8000
export JARVIS_PRIME_MODELS_DIR=./models

# Safety integration
export JARVIS_PRIME_SAFETY_ENABLED=true
export JARVIS_CROSS_REPO_DIR=~/.jarvis/cross_repo

# Model settings
export JARVIS_PRIME_INITIAL_MODEL=./models/mistral-7b.gguf
export JARVIS_PRIME_CONTEXT_LENGTH=4096
export JARVIS_PRIME_N_GPU_LAYERS=-1  # All layers on GPU (M1 MPS)
export PRIME_QUANTIZATION_BITS=8  # 4-bit or 8-bit for M1 optimization

GCP Cloud Hybrid Configuration

# GCP settings
export GCP_ENABLED=true
export GCP_PROJECT_ID=your-project-id
export GCP_ZONE=us-central1-a
export GCP_VM_INSTANCE_TYPE=n1-standard-4
export GCP_VM_SPOT=true
export GCP_VM_RAM_GB=64  # Updated from 32GB to 64GB
export GCP_PRIME_URL=http://your-gcp-vm:8000

๐Ÿ“Š Performance & Benchmarks

Neural Orchestrator Core Performance (M1 Max 64GB)

Metric Value
Routing decision latency 0.5-1.5ms
Task classification latency 0.3-0.8ms
Memory pressure check (macOS native) 5-15ms
Memory pressure check (psutil fallback) 1-3ms
Sticky routing lookup <0.1ms
Circuit breaker check <0.1ms
Cross-repo state read 2-5ms
Cross-repo state write 3-8ms

Local Model Performance (M1 Mac 16GB)

Model Size Tokens/sec Latency (P50) Latency (P99) Memory
TinyLlama 1.1B (Q4_K_M) 670MB 85 t/s 12ms 45ms 1.2GB
Phi-2 2.7B (Q4_K_M) 1.6GB 42 t/s 24ms 89ms 2.8GB
Mistral 7B (Q4_K_M) 4.3GB 18 t/s 56ms 178ms 5.9GB
Llama-3 8B (Q4_K_M) 4.9GB 15 t/s 67ms 201ms 6.8GB
Qwen 2.5 32B (Q4_K_M) 18GB 5 t/s 200ms 600ms 20GB

GCP Invincible Node โ€” Real-World Production Performance (v241.1)

Measured on jarvis-prime-node (e2-highmem-4, 4 vCPUs, 32 GB RAM, CPU-only, no GPU):

Metric Value
Models on disk 11 specialist LLMs (~40.4 GB total, Q4_K_M GGUF)
Routable models 8 (task-type routing via GCPModelSwapCoordinator)
Total disk 80 GB SSD (~27.6 GB headroom after models + OS)
Cold start (golden image) ~87 seconds (VM create โ†’ ready_for_inference=True)
Latency (simple, Phi-3.5) ~3-4 seconds
Latency (7B models) ~6-9 seconds
Latency (9B models) ~8-12 seconds
Model swap time ~20-30 seconds (SSD โ†’ RAM + 5-token validation)
Token generation rate ~6-10 t/s (3.8B), ~3-5 t/s (7B), ~2-4 t/s (9B)
Memory usage (model loaded) ~3 GB (Phi-3.5) to ~6.5 GB (Gemma-2-9B)
Inference mode CPU-only (AVX2/SSE4.2 SIMD via llama.cpp)
Concurrent requests 1 (sequential processing, 50-request bounded queue during swap)
VM cost $0.134/hr ($97/month always-on)
Per-request cost $0.00 (self-hosted, unlimited requests across all 8 specialist models)

Note: Latency varies by model size. Phi-3.5-mini (3.8B) is ~3x faster than 7B models. The 20-30s model swap cost is mitigated by sticky routing with per-model-size cooldowns (30/60/90s). In practice, most consecutive queries go to the same model โ€” swaps only happen when the user switches between task types (e.g., from math to code) after the cooldown expires.

GCP Cloud Performance (A100 GPU) โ€” Reference Benchmarks

Model Size Tokens/sec Latency (P50) Latency (P99) Cost/hr
Llama 3.3 70B (Q4) 35GB 45 t/s 22ms 65ms $1.50
Qwen 2.5 72B (Q4) 36GB 42 t/s 24ms 70ms $1.50
Mixtral 8x22B (Q4) 45GB 38 t/s 26ms 75ms $2.00
DeepSeek V2 (Q4) 50GB 35 t/s 29ms 80ms $2.50

Cost Savings (Measured over 30 days)

Scenario: 50,000 requests/month (avg 150 tokens out)

Neural Orchestrator Routing:
- Tier 0 (Ultra Fast): 30,000 requests (60%) โ†’ Local โ†’ $0.00
- Tier 0.5 (Local Capable): 12,000 requests (24%) โ†’ Local โ†’ $0.00
- Tier 1 (Cloud Intelligence): 7,000 requests (14%) โ†’ GCP โ†’ $10.50
- Tier 2 (Deep Reasoning): 1,000 requests (2%) โ†’ Claude Opus โ†’ $15.00

Total cost: $25.50/month

If 100% Cloud:
- 50,000 requests ร— 150 tokens ร— $0.024/1K = $180.00/month

Savings: $154.50/month (86% reduction) ๐ŸŽ‰

Resilience Metrics (Production - 7 days)

Metric Value
Circuit breaker opens 2
Fallback cache hits 1,247
Fallback to simple mode 15
Total requests 187,342
Zero-downtime swaps 6
Requests dropped 0 โœ…
Average recovery time 6.2s
Sticky routing hits 45,231 (24.1%)
Memory pressure alerts 3

๐Ÿ”’ Safety & Security

Multi-Layer Safety Integration

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 1: JARVIS ActionSafetyManager (Body)                    โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                  โ”‚
โ”‚ โ€ข Monitors all action execution                               โ”‚
โ”‚ โ€ข Detects risky patterns                                      โ”‚
โ”‚ โ€ข User confirmation required for HIGH risk                    โ”‚
โ”‚ โ€ข Kill switch activation                                      โ”‚
โ”‚ โ€ข Writes context: ~/.jarvis/safety/context_for_prime.json   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
                             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 2: Neural Orchestrator Safety Integration              โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                โ”‚
โ”‚ โ€ข Reads safety context before routing                         โ”‚
โ”‚ โ€ข Routes risky actions to Prime when kill switch active       โ”‚
โ”‚ โ€ข Adjusts tier selection based on safety state                โ”‚
โ”‚ โ€ข Forces Tier 1/2 for high-risk operations                   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
                             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 3: Cross-Repo State Manager                            โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                         โ”‚
โ”‚ โ€ข Atomic state updates                                        โ”‚
โ”‚ โ€ข File locking for race condition prevention                  โ”‚
โ”‚ โ€ข Automatic retry with exponential backoff                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
                             โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Layer 4: AGI Safety Reasoning                                 โ”‚
โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                                  โ”‚
โ”‚ โ€ข CausalEngine predicts action consequences                   โ”‚
โ”‚ โ€ข MetaReasoner evaluates risk vs benefit                      โ”‚
โ”‚ โ€ข ActionModel includes safety constraints                     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Safety Context Example

{
  "kill_switch_active": true,
  "current_risk_level": "high",
  "pending_confirmation": true,
  "recent_blocks": 2,
  "recent_confirmations": 5,
  "recent_denials": 3,
  "user_trust_level": 0.62,
  "last_update": "2025-01-07T14:30:45.123456",
  "session_start": "2025-01-07T09:00:00.000000",
  "total_audits": 47,
  "total_blocks": 8
}

Routing Behavior:

  • Kill switch active โ†’ All actions route to Tier 1/2
  • Recent denials > 2 โ†’ Route risky patterns to Tier 1/2
  • User trust < 0.7 โ†’ More conservative routing
  • High risk level โ†’ Force confirmation

๐Ÿ—บ๏ธ Roadmap

Architectural Status Report โ€” Cross-Repo Audit (February 2026)

A comprehensive audit of the JARVIS ecosystem identified critical integration gaps that affect J-Prime's role as the cognitive layer:

LangGraph Dependency Status

LangGraph is listed in JARVIS Body's backend/requirements.txt but is NOT installed. This means:

  • All 9 LangGraph reasoning graphs across the JARVIS Body codebase execute their linear fallback paths instead of conditional graph routing
  • The LangGraphReasoningEngine's 7-node graph (with loop-back on low confidence via route_after_reflection()) has never executed
  • The JARVISCheckpointer in memory_integration.py inherits from object instead of LangGraph's BaseCheckpointSaver, providing no real checkpoint persistence
  • Impact on J-Prime: The reasoning quality sent to J-Prime for inference is lower than designed because all reasoning is single-pass linear, not iterative

Resolution (v246.0): Install langgraph, langgraph-checkpoint, and langgraph-checkpoint-sqlite in JARVIS Body to activate all 9 reasoning graphs. This will improve the quality of reasoning that feeds into J-Prime's inference pipeline.

Google Workspace Integration (v245.0 โ€” Fixed)

The Google Workspace Agent in JARVIS Body now successfully creates real Gmail drafts, checks email, queries calendar, and performs workspace searches via the Google API. Key fixes applied:

Fix Impact on J-Prime
Agent singleton cache bug (49s โ†’ 0.2s) Workspace commands now reuse cached agent, reducing total latency
Body generation via proper ModelRequest API Draft email body generation now correctly calls J-Prime's inference endpoint
Task-type metadata flow Workspace commands carry task_type metadata, enabling J-Prime's GCPModelSwapCoordinator to select the optimal model

Real-Time Voice Conversation Infrastructure (v238.0 โ€” JARVIS Body)

JARVIS Body v238.0 introduced a full real-time voice conversation pipeline โ€” continuous, bidirectional, streaming voice dialogue. J-Prime serves as the LLM inference backend for this pipeline, streaming tokens via SSE that are immediately converted to speech.

How J-Prime is used in voice conversation:

User speaks โ†’ Mic โ†’ AEC โ†’ StreamingSTT (faster-whisper) โ†’ Text
  โ†’ ConversationPipeline sends to J-Prime (/v1/chat/completions, stream=true)
    โ†’ J-Prime routes to optimal specialist model (GCPModelSwapCoordinator)
      โ†’ SSE token stream back to JARVIS Body
        โ†’ SentenceSplitter accumulates tokens into sentences
          โ†’ Streaming TTS (Piper) synthesizes each sentence
            โ†’ AudioBus plays audio โ†’ User hears first word at ~300-500ms

Key implications for J-Prime:

Aspect Before (Command Mode) After (Conversation Mode)
Request pattern Single request, wait for full response Rapid-fire requests (every 2-8 seconds per turn)
Streaming Optional, usually batch Required โ€” SSE streaming mandatory for latency
Context window Single query 20-turn sliding window (conversation history as messages array)
Latency target <10 seconds acceptable <500ms time-to-first-token critical for natural feel
Model selection Task-type routing Conversation defaults to Gemma-2-9B (general) with dynamic specialist routing
Sticky routing Helps avoid swap Critical โ€” model swaps mid-conversation add 30s latency

What this means for J-Prime's infrastructure:

  • SSE streaming performance is now latency-critical. Every millisecond between "user stops speaking" and "first token arrives" is perceptible. The existing /v1/chat/completions endpoint with stream=true is used directly.
  • Sticky routing prevents thrashing. In conversation mode, queries are mostly general_chat type, keeping the same model (Gemma-2-9B) loaded across turns. Task-type routing still works โ€” if the user says "solve 5x+3=18" mid-conversation, the Math specialist handles it.
  • Context accumulation increases prompt size. A 20-turn conversation with 50-100 tokens per turn means 1000-2000 tokens of context per request. This is well within all models' context windows but increases per-request inference time.
  • Barge-in creates abandoned requests. When the user interrupts JARVIS mid-response, the SSE stream is cancelled client-side. J-Prime should handle stream cancellation gracefully (it already does via llama-cpp-python's generator cleanup).

No J-Prime code changes were needed โ€” the existing OpenAI-compatible API with SSE streaming, sticky routing, and model swap coordination handles voice conversation natively. The entire implementation is in JARVIS Body's new backend/audio/ package.

Planned: Unified Agent Runtime (v247.0)

JARVIS Body is planning a Unified Agent Runtime โ€” a persistent sense-think-act-verify-reflect loop for autonomous goal pursuit. J-Prime's role:

  • THINK phase: The Agent Runtime calls J-Prime's inference API for goal decomposition, planning, and sub-step generation
  • Task-type routing matters more: Multi-step autonomous goals will send a wider variety of task types (analysis, planning, code, creative) โ€” J-Prime's specialist routing becomes critical
  • Checkpoint-aware inference: The Runtime will checkpoint goal state between phases; J-Prime may need to support session-context-aware inference for continuity across sub-steps
  • Higher request volume: Autonomous operation generates more inference requests than reactive command-response; J-Prime's sticky routing and cooldown mechanisms will be stress-tested

โœ… v243.0/v243.1 โ€” Command Lifecycle Events + Event Bus Lifecycle (COMPLETED โ€” JARVIS Body-side)

v243.0/v243.1 shipped as Command Lifecycle Events and Event Infrastructure Lifecycle Management in the JARVIS Body repo. This affects J-Prime because command lifecycle events now flow through TrinityEventBus, providing visibility into how J-Prime's inference results are used downstream.

What this means for J-Prime:

  • Command outcomes are now observable. When JARVIS Body classifies a user query, routes it to J-Prime for inference, and receives a response, the full lifecycle is published as events (command.received โ†’ command.classified โ†’ command.completed/command.failed). NeuralMesh's Knowledge Graph consumes these events to build semantic memory of command patterns.
  • Boot-order races resolved. TrinityEventBus is now explicitly started in the supervisor's Phase 4 (Intelligence) before any subscriber connects. Previously, NeuralMesh needed a 10s delayed retry because the bus might not exist when subscribers tried to connect.
  • Health monitoring. HealthAggregator now tracks TrinityEventBus metrics (events published/delivered/failed, active subscriptions) and ProactiveEventStream state. J-Prime health endpoints can surface this data.
  • Graceful shutdown. Event buses are stopped AFTER subscribers (AGI OS, NeuralMesh) but BEFORE broad task cancellation, preventing orphaned handlers.

Files modified (all in JARVIS Body repo):

  • unified_supervisor.py โ€” Event state tracking, explicit startup, health checks, DMS progress, shutdown
  • backend/core/trinity_event_bus.py โ€” Command lifecycle event types
  • backend/api/unified_command_processor.py โ€” Event emission at each command stage
  • backend/neural_mesh/neural_mesh_coordinator.py โ€” Knowledge Graph subscription

Impact on J-Prime roadmap: Command lifecycle telemetry creates richer training signals for the v242.0 DPO pipeline โ€” the system now knows not just what J-Prime returned, but whether the command succeeded or failed downstream.


โœ… v244.0 โ€” Startup Warning Root Fix + Brain Vacuum Classification (COMPLETED โ€” JARVIS Body-side)

v244.0 shipped three fix categories in the JARVIS Body repo. The third โ€” brain vacuum classification โ€” directly affects J-Prime's fallback behavior:

Brain Vacuum Classification Fix:

When J-Prime is unreachable (network issue, GCP VM down, model loading), JARVIS Body falls back to Claude API or Gemini via _brain_vacuum_fallback() in jarvis_prime_client.py. Before v244.0, this fallback hardcoded intent="answer" for all responses โ€” meaning action commands like "lock my screen" or "open Safari" became text explanations instead of executing the action.

After v244.0, the fallback includes a classification prompt prefix:

User: "lock my screen"
  โ†’ J-Prime unreachable โ†’ brain vacuum fallback
    โ†’ Claude API invoked with classification prompt prefix
      โ†’ Response: CLASSIFICATION: {"intent": "action", "domain": "system",
                   "requires_action": true, "suggested_actions": ["lock_screen"]}
      โ†’ StructuredResponse.intent = "action"  (NOT "answer")
      โ†’ Command pipeline executes lock_screen

Valid classifications:

  • Intents: answer, conversation, action, vision_needed, multi_step_action, clarify
  • Domains: general, system, security, workspace, development, media, smart_home
  • Fallback: If classification parsing fails, defaults to intent="answer" (safe default)

Other v244.0 changes (not J-Prime specific):

  • 858 lines of dead code removed (orphaned tiered routing system imports/endpoints/tests)
  • Cloud SQL proxy startup reduced from ~47s to ~3-5s (redundant settling delay eliminated)

File modified: backend/core/jarvis_prime_client.py โ€” _brain_vacuum_fallback(), _parse_classification(), _strip_classification_line()


Ouroboros: JARVIS Self-Programming (Planned โ€” Future Version)

JARVIS becomes capable of reading, understanding, and improving its own codebase autonomously using a two-model pipeline:

  • Architect phase โ€” DeepSeek-R1-Distill-Qwen-14B analyzes the JARVIS/J-Prime/Reactor-Core codebase, plans changes with explicit <think> reasoning traces. Outputs structured plan with file paths, line numbers, specific changes, and risk assessment.
  • Implementer phase โ€” Qwen2.5-Coder-14B-Instruct generates code diffs from the architect's plan. Multi-file changes with correct imports, type hints, and docstrings.
  • Verifier phase โ€” DeepSeek-R1-14B reviews generated code, checks for missed requirements, sends back for revision if needed.
  • Execution pipeline โ€” Architect (R1-14B, ~20-40s) โ†’ model swap (~30s) โ†’ Implementer (Coder-14B, ~20-40s) โ†’ model swap (~30s) โ†’ Verifier (R1-14B, ~15-25s). Total: ~2-3 minutes per self-improvement cycle.
  • Safety guardrails โ€” Changes require human approval before commit. Automated test suite must pass. Automatic rollback on any failure. Git branch isolation for all self-modifications.
  • Self-improvement targets โ€” Optimize model swap cooldowns from real usage patterns. Refactor detected code smells. Auto-generate missing test cases. Update documentation from code changes.

Why two models, not one: A specialist 14B code model generates better code than a generalist. A specialist 14B reasoning model produces better architectural plans than a code model. The model swap (~30s) is cheaper than the quality loss of using one model for both phases. Self-programming is not latency-sensitive โ€” correctness matters more than speed.

v242.0 - DPO Training from Multi-Model Telemetry (Planned)

Activate DPO preference training using multi-model telemetry. Depends on v239.0 pipeline activation.

Corrected status (Feb 2026 audit) โ€” more is built than previously reported:

  • TelemetryEmitter in JARVIS Body โ€” emits emit_interaction() after every command. Telemetry JSONL files confirmed present in ~/.jarvis/telemetry/.
  • TrinityExperienceReceiver in Reactor Core โ€” watches ~/.jarvis/ directories for event files, with deduplication and ordering
  • TelemetryIngestor in Reactor Core โ€” reads JSONL from ~/.jarvis/telemetry/. Schema verified byte-identical to emitter output (v1.0 canonical).
  • UnifiedTrainingPipeline in Reactor Core โ€” DPO/LoRA training and GGUF export chain exists
  • HotSwapManager in J-Prime โ€” accepts fine-tuned GGUF files, zero-downtime swap
  • TrainingDataPipeline in J-Prime โ€” captures conversations, generates DPO pairs
  • ReactorCoreBridge.upload_training_data() โ€” Fully implemented (992 LOC, v242.0) with batch upload, fallback, job tracking. Previously listed as "not implemented."

What v242.0 adds (on top of v239.0 pipeline):

  • Fix B: J-Prime interaction capture โ€” run_server.py adds X-Model-Id header but doesn't log interactions for training. Every /v1/chat/completions request should be captured with full metadata.
  • Fix D: Automatic DPO pair generation โ€” When the same query type gets different quality answers from different specialist models, automatically generate preference pairs without human labeling.
  • Ground truth sources โ€” User corrections, Claude-as-judge evaluation, and objective metrics (code compilation, math verification) to avoid circular self-assessment in DPO pairs.

The multi-model training data advantage:

v241.1 multi-model routing creates IMPLICIT quality comparisons:

  Query: "5x+3=18"
    Mistral-7B (before routing fix):  "x = 11"  โ† rejected
    Qwen2.5-Math-7B (after routing):  "x = 3"   โ† chosen

  โ†’ Automatic DPO pair: {prompt: "5x+3=18", chosen: "x=3", rejected: "x=11"}
  โ†’ No human labeling needed. Multi-model routing IS the labeling mechanism.

Training constraints:

  • LoRA fine-tuning requires the full-precision base model (FP16, ~14 GB for 7B), not the GGUF
  • Training happens on a machine with sufficient RAM (local Mac or separate GCP VM)
  • The GGUF is the output โ€” quantized and deployed to the golden image
  • Elastic Weight Consolidation (EWC) prevents catastrophic forgetting when training on task-specific data

v241.2 - 14B Model Tier (Planned)

Add three 14B-class models for significantly stronger reasoning, math, and code:

  • DeepSeek-R1-Distill-Qwen-14B (~8.1 GB, ~10 GB RAM) โ€” 69.7% AIME 2024 (up from 55.5% on 7B). Explicit <think> chain-of-thought. Route reason_complex and analyze here.
  • Phi-4 (14B, ~8.0 GB, ~10 GB RAM) โ€” Microsoft's 80.4% MATH. Route math_complex word problems here.
  • Qwen2.5-Coder-14B-Instruct (~8.1 GB, ~10 GB RAM) โ€” ~80-85% HumanEval. Foundation for Ouroboros. Route code_complex and code_architecture here.
  • Update GCP_TASK_MODEL_MAPPING with 14B routing for complex tasks (7B stays for simple variants)
  • Update GCP_MODEL_CONFIGS with 14B-specific context sizes and templates
  • Add filename patterns for 14B models in GCPModelSwapCoordinator._scan_filenames()
  • Update golden image builder and manifest.json for 14 total models
  • Disk impact: +24.3 GB โ†’ total ~64.7 GB on 80 GB SSD (~15.3 GB headroom)

LLaVA Vision Integration (Planned โ€” Future Version)

  • Build CLIP vision encoder pipeline in J-Prime (multimodal inference path)
  • Mark LLaVA-v1.6-Mistral-7B as routable: true in manifest
  • Route vision commands to self-hosted LLaVA instead of Claude Vision API
  • Eliminate last external API dependency for core features

Note: v244.0 shipped as the Startup Warning Root Fix + Brain Vacuum Classification Fix in the JARVIS Body repo. See ยง v244.0 above.

v245.0 - Agent Runtime Inference Support (Planned)

Support the JARVIS Body Unified Agent Runtime with enhanced inference capabilities:

  • Session-context inference โ€” Accept optional session_id and goal_context in /v1/chat/completions metadata, enabling multi-step reasoning that remembers previous sub-steps within the same autonomous goal
  • Batch sub-step inference โ€” Accept an array of related prompts (e.g., decomposition + planning + risk assessment) to reduce model swap overhead when the same model handles multiple phases
  • Streaming progress โ€” Return partial results via SSE for long-running inference during autonomous THINK phases, so the Agent Runtime can checkpoint intermediate reasoning
  • Priority queue for autonomous vs. interactive โ€” Interactive user commands get priority over autonomous background inference to maintain responsiveness
  • Telemetry attribution โ€” Tag inference requests with source: "agent_runtime" vs source: "user_command" for separate monitoring and training data collection

v239.0 - Pipeline Activation: Wiring the Training Loop (In Progress)

Connect J-Prime to the Reactor Core training pipeline. Most infrastructure is already built โ€” this version wires the existing components together with ~200-400 lines of changes across the ecosystem.

Corrected status (Feb 2026 audit):

  • ReactorCoreBridge.upload_training_data() โ€” Previously reported as "not implemented" VERIFIED: Fully implemented (992 LOC, v242.0) with batch upload, file fallback, and job tracking. No action needed.
  • Experience schemas โ€” Verified byte-identical across all three repos (v1.0 canonical ExperienceEvent). No alignment work needed.
  • HotSwapManager โ€” Accepts GGUF files for zero-downtime model swap. ReactorCoreWatcher in JARVIS Body detects new model files.

What v239.0 adds for J-Prime:

  • Deployment feedback โ€” After HotSwapManager loads a new model from Reactor Core, write a deployment_status.json feedback file to ~/.jarvis/reactor/feedback/ so Reactor Core knows the deployment succeeded or failed
  • Health verification after swap โ€” After loading a new model, run a quick inference sanity check and include the result in the feedback file
  • Interaction capture in run_server.py โ€” Log every /v1/chat/completions request-response pair with X-Model-Id to disk for training data collection (supports the DPO pair generation in Reactor Core)

v246.0 - Reactor Core Advanced Training Integration (Planned)

Advanced training features on top of the v239.0 pipeline:

  • Per-model DPO pair generation โ€” When different specialist models answer the same query type with different quality, automatically generate preference pairs without human labeling
  • Temporal A/B testing โ€” After deploying a fine-tuned model, compare metrics against the previous 2-hour window to detect regressions
  • Model lineage tracking โ€” Every deployed model records: base model, training method, dataset hash, evaluation scores, so quality can be traced back to training data

โœ… v241.0/v241.1 - Multi-Model GCP Golden Image + Task-Type Routing (Current)

  • 11 specialist models pre-baked in golden image (~40.4 GB on 80 GB SSD)
  • 8 routable models with intelligent task-type routing
  • GCPModelSwapCoordinator: pre-hook pattern, sticky routing, bounded queue, post-swap validation, rollback
  • Per-model executor configs (n_ctx, chat_template, n_gpu_layers, flash_attn)
  • Task-type inference in JARVIS Body with tightened code detection (2+ indicators, false-positive prevention)
  • Metadata flow: JARVIS Body โ†’ PrimeClient โ†’ ChatRequest.metadata โ†’ coordinator โ†’ model swap
  • X-Model-Id response header for per-model telemetry
  • manifest.json as primary model inventory (filename regex as fallback)
  • Per-model-size cooldowns: 30s (small) / 60s (medium) / 90s (large)
  • 3 pre-staged models: LLaVA (v242 vision), TinyLlama (speculative decoding), BGE (RAG)
  • v241.1: Added DeepSeek-R1-Qwen-7B (reasoning), Gemma-2-9B (general), Qwen2.5-Math-7B (math)
  • Template auto-detection: qwenโ†’chatml, deepseekโ†’chatml, gemma-2โ†’gemma

โœ… v238.0 โ€” Real-Time Voice Conversation Infrastructure (JARVIS Body-side)

  • 7-layer audio infrastructure (Layers -1 through 6) in JARVIS Body: FullDuplexDevice, AudioBus+AEC, Streaming TTS (Piper), Streaming STT (faster-whisper), Turn Detection, Barge-In, Conversation Pipeline, Mode Dispatcher
  • J-Prime SSE streaming (/v1/chat/completions, stream=true) serves as the LLM backend for real-time conversation responses
  • 20-turn sliding context window sent as messages array per request โ€” no J-Prime changes needed
  • Sticky routing prevents model thrashing during conversations (conversation queries are mostly general_chat โ†’ Gemma-2-9B stays loaded)
  • SentenceSplitter in JARVIS Body accumulates J-Prime tokens into sentences โ†’ Streaming TTS yields ~300-500ms time-to-first-audio
  • Barge-in creates abandoned SSE streams โ€” J-Prime's llama-cpp-python generator cleanup handles cancellation gracefully
  • No J-Prime code changes required โ€” existing OpenAI-compatible API with SSE streaming handles voice conversation natively
  • Two-phase bootstrap: AudioBus starts before narrator (Phase 1), full pipeline wires after Intelligence provides J-Prime client (Phase 2)

โœ… v238.0 - Degenerate Response Elimination (JARVIS Body-side)

  • SIMPLE classification narrowed: "what is/who is/define" queries promoted to MODERATE
  • Backend degenerate response detection with safe retry (MODERATE params)
  • Client-side degenerate response suppression before display/TTS
  • requestId echo in all backend WebSocket response dicts (enables frontend dedup)
  • command_response handler aligned with response handler (dedup, ref clearing, validation)
  • Defense-in-depth: 3-layer architecture (classification โ†’ backend retry โ†’ client filter)
  • Production verified: "what is Java?" โ†’ gcp_prime (24.6s latency, full definition)

โœ… v100.0 - Neural Orchestrator Core

  • Unified routing architecture consolidating all routers
  • Protocol-based design with type-safe interfaces
  • Context-aware routing with distributed tracing
  • Dynamic configuration with zero hardcoding
  • Cross-repo state management with atomic operations
  • Unified task classifier with multi-signal analysis
  • Unified memory monitor with macOS native integration
  • Unified sticky routing with session affinity
  • Unified request buffer for zero-loss hot swaps
  • Coordinated circuit breakers per tier
  • Advanced Python patterns (Protocols, contextvars, async generators, weakref)
  • Defensive decorators with graceful fallbacks
  • Exponential backoff with decorrelated jitter
  • Structured concurrency with TaskGroup (Python 3.11+)

โœ… v99.0 - Dynamic Model Registry

  • Multi-directory model discovery
  • Auto-download from HuggingFace
  • File system watching with watchdog
  • Reactor Core synchronization
  • Model validation (integrity, inference, safety)
  • Version management with rollback support

โœ… v98.0 - Neural Switchboard

  • Task classification with multi-signal analysis
  • Memory monitoring with real-time pressure detection
  • Sticky routing with session-based affinity
  • Request buffering for zero-loss hot swaps
  • Tier/capability mapping

โœ… v92.0 - LLM/Brain Intelligence

  • Auto model selector with complexity-based routing
  • Unified inference with fallback chain
  • RLHF pipeline with PPO
  • Reactor Core bridge for training integration
  • Continuous learning with EWC
  • Dynamic batching for throughput optimization
  • Circuit breakers per backend

โœ… v91.0 - Observability Bridge

  • Langfuse integration for distributed tracing
  • Prometheus export in OpenMetrics format
  • Chaos testing framework
  • Adaptive polling optimization
  • Cross-repo observability integration

โœ… v90.0 - Production Hardening

  • Event delivery guarantees with retry + DLQ
  • Model validation (pre-deployment)
  • Request queuing during hot-swap
  • Canary deployments with gradual rollout
  • Auto-rollback on error threshold
  • Distributed tracing with TraceContext
  • Circuit breakers per endpoint
  • Metrics & alerting
  • SAGA pattern for transactional deployments

โœ… v87.0 - The Connective Tissue

  • Unified mode with single command startup
  • Intelligent model router with fallback chain
  • GCP VM manager with spot instance lifecycle
  • Service mesh with dynamic discovery
  • Unified config (single YAML source)
  • RAM-aware routing with automatic failover
  • Adaptive thresholds with outcome learning

โœ… v79.1 - Cognitive Router "Corpus Callosum"

  • CognitiveRouter with adaptive thresholds
  • PrimeBridge with circuit breaker and connection pooling
  • Response cache for graceful degradation
  • Fixed singleton race condition (asyncio.Condition)
  • Fixed file IPC race conditions (fcntl locking, OrderedDict)
  • Fallback chain (4 levels)
  • Adaptive polling intervals
  • Bounded message queues
  • Zero hardcoding (all env vars)
  • Production-grade resilience patterns

๐Ÿ”ฎ v101.0 - Advanced Features (Planned)

  • Request deduplication
  • Routing decision caching
  • Continuous memory pressure monitoring during execution
  • Deadlock detection for locks
  • Request cancellation support
  • Request batching optimization
  • Distributed tracing correlation enhancement

๐Ÿงช Testing & Development

Running Tests

# Unit tests
pytest tests/unit/

# Integration tests
pytest tests/integration/

# End-to-end tests
pytest tests/e2e/

# Neural Orchestrator Core tests
pytest tests/test_neural_orchestrator_core.py -v

# With coverage
pytest --cov=jarvis_prime --cov-report=html

# Test specific module
pytest tests/unit/test_neural_orchestrator_core.py -v

Development Server with Hot Reload

# Install in development mode
pip install -e ".[dev]"

# Run with auto-reload on code changes
python run_server.py --reload --debug

# Server restarts automatically when files change

Docker Deployment

# Build image
docker build -t jarvis-prime:latest .

# Run container
docker run -d \
  -p 8000:8000 \
  -v $(pwd)/models:/app/models \
  -v ~/.jarvis:/root/.jarvis \
  -e JARVIS_PRIME_INITIAL_MODEL=/app/models/mistral-7b.gguf \
  -e NEURAL_ORCHESTRATOR_ENABLED=true \
  jarvis-prime:latest

# Check logs
docker logs -f <container-id>

๐Ÿ“š Documentation

Core Documentation

Training & Models

Version-Specific Documentation


๐Ÿค Contributing

Contributions welcome! See CONTRIBUTING.md for guidelines.

Development Workflow

# Fork and clone
git clone https://github.com/YOUR_USERNAME/jarvis-prime.git
cd jarvis-prime

# Create feature branch
git checkout -b feature/amazing-feature

# Make changes and test
pytest tests/

# Commit with conventional commits
git commit -m "feat: add amazing feature

- Detailed description
- Why this change is needed
- Any breaking changes

๐Ÿค– Generated with Claude Code
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>"

# Push and create PR
git push origin feature/amazing-feature

๐Ÿ“„ License

MIT License - see LICENSE for details


๐Ÿ™ Acknowledgments

  • Anthropic - Claude API and advanced reasoning capabilities
  • Meta AI - Llama models and research
  • Mistral AI - High-quality open models
  • Microsoft Research - Phi models for coding
  • Alibaba - Qwen multilingual models
  • ggerganov - llama.cpp runtime for efficient inference
  • HuggingFace - Model hosting and transformers library
  • OpenAI - API compatibility standards

๐Ÿ“ž Support


๐Ÿ† Summary

What JARVIS Prime Delivers

โœ… Multi-Model Self-Hosted LLM Fleet (v241.1) - 11 specialist models (~40.4 GB) on your own GCP VM โ€” math, code, reasoning, creative, and general intelligence specialists. No OpenAI, no Claude, no third-party APIs. โœ… Intelligent Task-Type Routing (v241.1) - Math โ†’ Qwen2.5-Math-7B (83.6% MATH), Code โ†’ Qwen2.5-Coder-7B (70.4% HumanEval), Reasoning โ†’ DeepSeek-R1 (55.5% AIME), Simple โ†’ Phi-3.5-mini (~3s), General โ†’ Gemma-2-9B (72.3% MMLU). Automatic model selection via GCPModelSwapCoordinator. โœ… Adaptive Prompt System (v236.0 + v238.0) - Complexity-aware inference: "5+5?" โ†’ "10" (48 tokens, temp 0.0), "what is Java?" โ†’ full definition (512 tokens, temp 0.3), "design a system" โ†’ detailed analysis (4096 tokens, temp 0.7) โœ… Degenerate Response Defense-in-Depth (v238.0) - 3-layer protection (classification, backend retry, client suppression) ensures meaningless LLM output ("...") never reaches the user โœ… Google Workspace Body Generation (v245.0) - Draft email body generation now correctly calls J-Prime via proper ModelRequest API with task-type metadata, producing AI-generated email content through the specialist model fleet โœ… Enterprise-Grade AGI Operating System - 11 specialist models, reasoning, multimodal fusion (LLaVA pre-staged) โœ… Neural Orchestrator Core v100.0 - Unified intelligent routing, single source of truth โœ… GCP Golden Image Boot - Cold start in ~87 seconds with 11 pre-baked models on 80 GB SSD โœ… Production-Grade Resilience - Circuit breakers, fallback chains, post-swap validation, model rollback โœ… Zero Hardcoding - Fully configurable via environment variables, YAML, and manifest.json โœ… Safety-Aware Routing - Integrated with JARVIS ActionSafetyManager โœ… Zero-Downtime Operations - Hot swap models with bounded queue (50 request limit, HTTP 503 overflow) โœ… Complete Data Privacy - All inference on your infrastructure, no data leaves your VMs โœ… Cost Optimization - ~$97/month flat for unlimited self-hosted inference across 8 specialist models (no per-token billing) โœ… Per-Model Telemetry - X-Model-Id header on every response + Langfuse, Prometheus integration โœ… Cross-Repo Integration - Task-type metadata flows from JARVIS Body through PrimeClient to coordinator โœ… Reactor-Core Training Loop - DPO/RLHF pipeline to fine-tune models from real interactions, with per-model attribution โœ… Battle-Tested - 187K+ requests in production, zero failures

Known Gaps (In Roadmap)

  • LangGraph not installed in JARVIS Body โ€” All 9 reasoning graphs use linear fallback; reasoning quality sent to J-Prime is sub-optimal (v246.0 target)
  • ReactorCoreBridge.upload_training_data() not implemented โ€” CORRECTED (Feb 2026 audit): Fully implemented (992 LOC, v242.0) with batch upload, file fallback, and job tracking. The real gap is operational activation โ€” the training pipeline has never been run. See v239.0.
  • Training pipeline never activated โ€” All components are built and schemas verified, but zero training jobs have ever run. ReactorCoreWatcher and initialize_reactor_core() exist in JARVIS Body but are never called during supervisor startup. Target: v239.0.
  • Deployment feedback loop missing โ€” After HotSwapManager loads a new model, no feedback is sent to Reactor Core about success/failure/regression. One-way blind deployment. Target: v239.0.
  • No Agent Runtime inference support โ€” J-Prime doesn't yet support session-context or batch inference for autonomous multi-step goal pursuit (v245.0 target)
  • Single concurrent request โ€” CPU inference processes one request at a time; autonomous background goals may queue behind interactive commands (v245.0 priority queue target)

v241.1 Highlights

๐Ÿค– 11 Specialist Models - Right model for every task, not one-size-fits-all ๐Ÿงฎ Math Specialist - Qwen2.5-Math-7B: 83.6% MATH benchmark, eliminates hallucinated arithmetic ๐Ÿ’ป Code Specialist - Qwen2.5-Coder-7B: 70.4% HumanEval, trained on 5.5T code tokens ๐Ÿง  Reasoning Specialist - DeepSeek-R1: explicit chain-of-thought with <think> traces โšก Fast Lightweight - Phi-3.5-mini: ~3s latency for simple queries (3x faster than 7B) ๐Ÿ”„ Sticky Routing - Per-model-size cooldowns (30/60/90s) prevent model thrashing ๐Ÿ›ก๏ธ Post-Swap Validation - 5-token warmup after every load with automatic rollback on failure ๐Ÿ“Š Per-Model Telemetry - X-Model-Id header identifies which specialist served each request

v100.0 Highlights

๐Ÿง  Neural Orchestrator Core - Unified routing architecture consolidating all routers ๐Ÿ›ก๏ธ Advanced Patterns - Protocol classes, contextvars, async generators, weakref โšก Performance - Sub-millisecond routing decisions, native macOS memory integration ๐Ÿ”ง Zero Hardcoding - 100% dynamic configuration with env var override ๐Ÿ“Š Cross-Repo Integration - Atomic state management across JARVIS ecosystem ๐Ÿ”„ Sticky Routing - Session-based model affinity for continuity ๐Ÿ’พ Request Buffering - Zero-loss hot swap support ๐Ÿ”Œ Circuit Breakers - Coordinated fault tolerance per tier

Ready for enterprise deployment with complete AGI capabilities!


Architecture at a Glance (v241.1)

User Request โ†’ JARVIS Body (Backend)
                     โ”‚
                     โ”œโ”€โ†’ Query Complexity Classification (SIMPLE/MODERATE/COMPLEX/ADVANCED/EXPERT)
                     โ”œโ”€โ†’ Adaptive Prompt Builder (system prompt, max_tokens, temperature)
                     โ”œโ”€โ†’ Task Type Inference (math_simple, code_complex, general_chat, etc.)
                     โ””โ”€โ†’ PrimeRouter โ†’ PrimeClient (metadata: {task_type, complexity_level})
                           โ”‚
                           โ–ผ
               GCP Invincible Node (J-Prime, port 8000)
                     โ”‚
                     โ”œโ”€โ†’ GCPModelSwapCoordinator.ensure_model(task_type)
                     โ”‚     โ”œโ”€โ†’ GCP_TASK_MODEL_MAPPING resolution
                     โ”‚     โ”œโ”€โ†’ Sticky routing + cooldown check
                     โ”‚     โ””โ”€โ†’ Model swap if needed (unload โ†’ load โ†’ validate โ†’ serve)
                     โ”‚
                     โ”œโ”€โ†’ Active Model Inference (llama-cpp-python)
                     โ”‚     โ”œโ”€โ†’ Phi-3.5-mini (~3s)     โ€” simple queries
                     โ”‚     โ”œโ”€โ†’ Qwen2.5-Math-7B (~7s)  โ€” math
                     โ”‚     โ”œโ”€โ†’ Qwen2.5-Coder-7B (~7s) โ€” code
                     โ”‚     โ”œโ”€โ†’ DeepSeek-R1 (~10s)     โ€” reasoning
                     โ”‚     โ”œโ”€โ†’ Gemma-2-9B (~9s)       โ€” general
                     โ”‚     โ””โ”€โ†’ ... (8 routable models)
                     โ”‚
                     โ””โ”€โ†’ Response + X-Model-Id header โ†’ JARVIS Body โ†’ Frontend

Powered by 11 self-hosted specialist models on your own GCP infrastructure. No third-party APIs required.


Autonomous Gmail Triage Integration (Mind Role)

In Gmail autonomy, JARVIS-Prime is the semantic intelligence layer used by Body-side triage. Prime does not directly execute Gmail actions; it provides structured extraction and reasoning signals that drive safe policy outcomes.

Prime's Responsibilities in Triage

  • Produce structured semantic extraction for unread emails (keywords, urgency, sender-frequency signals).
  • Provide robust fallback behavior when extraction contracts degrade.
  • Preserve deterministic interfaces so Body-side scoring and policy remain stable.
  • Emit model-attribution metadata so Reactor-Core can learn from outcomes later.

Cross-Repo Runtime Path

flowchart LR
    A[JARVIS Body runtime cycle] --> B[Request semantic extraction]
    B --> C[JARVIS-Prime routing layer]
    C --> D[Best-fit specialist model]
    D --> E[Structured output contract]
    E --> F[Body scoring + policy]
    F --> G[Notifications + UI updates]
    F --> H[Outcome telemetry to Reactor-Core]
Loading

What to Expect in Testing

  • Prime improves tier quality when extraction contracts validate.
  • If Prime output is invalid/unavailable, Body degrades to heuristic extraction without stalling triage.
  • User-visible behavior remains stable: command responses still work; freshness determines whether triage metadata is attached.
  • Frontend receives proactive notifications from Body-side notification bridge; Prime contributes semantic quality, not direct UI transport.

Built with โค๏ธ by Derek Russell Powered by self-hosted LLM fleet (Qwen, DeepSeek, Gemma, Llama, Mistral, Phi), llama-cpp-python, and the JARVIS Ecosystem

About

Specialized PRIME models for JARVIS. Production-ready models with quantization, M1 Mac support, and seamless integration. Powered by Reactor Core.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages