-
Notifications
You must be signed in to change notification settings - Fork 644
feature: Architecturally separate judge model for adversarial-resistant guardrails #1748
Description
Did you check the docs?
- I have read all the NeMo-Guardrails docs
Is your feature request related to a problem? Please describe.
Current guardrail enforcement relies on LLM self-checking (self_check_input, self_check_output, self_check_facts). The same model that generates responses also judges whether those responses are safe.
In adversarial testing, single-model self-enforcement degrades under sustained conversational pressure. Multi-turn prompt injection gradually shifts compliance boundaries through context manipulation — roleplay framing, appeal to helpfulness, layered exceptions. The
model progressively relaxes its own rules because the adversarial context is part of its conversation history.
Additionally, current rails primarily filter content (what the AI says), not actions (what the AI does). When agents have tool access, the dangerous artifact is the proposed action — a shell command, API call, or file write — not the natural language response.
Describe the solution you'd like
- Stateless judge model — A separate model instance evaluates proposed actions with zero conversation context. It receives only the action payload and the ground rules. No conversation history means no multi-turn manipulation vector.
- Dual-gate evaluation — Gate 1: deterministic pattern denylist (zero latency, zero inference cost, catches known-bad patterns). Gate 2: stateless LLM judge (catches novel/semantic threats the denylist can't express). Gate 1 executes before Gate 2.
- Hierarchical conflict resolution — A priority-based layer hierarchy for safety rules. When rules conflict (e.g., "be helpful" vs "don't expose credentials"), lower layers always override higher layers deterministically. Eliminates ambiguous edge-case behavior.
- Action-level filtering — Guardrails applied to the proposed action (tool call, shell command, API request), not just the natural language response. A benign-sounding response can contain a destructive tool call — content filtering doesn't catch this.
Describe alternatives you've considered
- Self-checking with stronger prompts: Tested extensively. Degrades under adversarial pressure regardless of prompt quality. The problem is architectural (shared context), not prompt engineering.
- Deterministic-only (no LLM judge): Predictable but has permanent blind spots. Any pattern not in the denylist passes with 100% reliability. Semantic evaluation is needed for novel threats.
- The dual-gate approach covers both: deterministic for known patterns, semantic for novel ones. Each covers the other's blind spots.
Additional context
Full specification with empirical testing results, conflict resolution semantics, conformance requirements, and reference implementation:
RFC-ASA-001: The Asimov Safety Architecture for Autonomous AI Agents
https://github.com/h-network/RFCs/blob/main/RFC-ASA-001-Safety-Architecture-LLMs.md
Reference implementation with 44 security hardening items:
https://github.com/h-network/h-cli