Skip to content

feature: Architecturally separate judge model for adversarial-resistant guardrails #1748

@h-network

Description

@h-network

Did you check the docs?

  • I have read all the NeMo-Guardrails docs

Is your feature request related to a problem? Please describe.

Current guardrail enforcement relies on LLM self-checking (self_check_input, self_check_output, self_check_facts). The same model that generates responses also judges whether those responses are safe.

In adversarial testing, single-model self-enforcement degrades under sustained conversational pressure. Multi-turn prompt injection gradually shifts compliance boundaries through context manipulation — roleplay framing, appeal to helpfulness, layered exceptions. The
model progressively relaxes its own rules because the adversarial context is part of its conversation history.

Additionally, current rails primarily filter content (what the AI says), not actions (what the AI does). When agents have tool access, the dangerous artifact is the proposed action — a shell command, API call, or file write — not the natural language response.

Describe the solution you'd like

  1. Stateless judge model — A separate model instance evaluates proposed actions with zero conversation context. It receives only the action payload and the ground rules. No conversation history means no multi-turn manipulation vector.
  2. Dual-gate evaluation — Gate 1: deterministic pattern denylist (zero latency, zero inference cost, catches known-bad patterns). Gate 2: stateless LLM judge (catches novel/semantic threats the denylist can't express). Gate 1 executes before Gate 2.
  3. Hierarchical conflict resolution — A priority-based layer hierarchy for safety rules. When rules conflict (e.g., "be helpful" vs "don't expose credentials"), lower layers always override higher layers deterministically. Eliminates ambiguous edge-case behavior.
  4. Action-level filtering — Guardrails applied to the proposed action (tool call, shell command, API request), not just the natural language response. A benign-sounding response can contain a destructive tool call — content filtering doesn't catch this.

Describe alternatives you've considered

  • Self-checking with stronger prompts: Tested extensively. Degrades under adversarial pressure regardless of prompt quality. The problem is architectural (shared context), not prompt engineering.
  • Deterministic-only (no LLM judge): Predictable but has permanent blind spots. Any pattern not in the denylist passes with 100% reliability. Semantic evaluation is needed for novel threats.
  • The dual-gate approach covers both: deterministic for known patterns, semantic for novel ones. Each covers the other's blind spots.

Additional context

Full specification with empirical testing results, conflict resolution semantics, conformance requirements, and reference implementation:

RFC-ASA-001: The Asimov Safety Architecture for Autonomous AI Agents
https://github.com/h-network/RFCs/blob/main/RFC-ASA-001-Safety-Architecture-LLMs.md

Reference implementation with 44 security hardening items:
https://github.com/h-network/h-cli

Metadata

Metadata

Assignees

No one assigned

    Labels

    status: needs triageNew issues that have not yet been reviewed or categorized.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions