feature: Architecturally separate judge model for adversarial-resistant guardrails

### Did you check the docs?

- [x] I have read all the NeMo-Guardrails docs

### Is your feature request related to a problem? Please describe.

Current guardrail enforcement relies on LLM self-checking (self_check_input, self_check_output, self_check_facts). The same model that generates responses also judges whether those responses are safe.

In adversarial testing, single-model self-enforcement degrades under sustained conversational pressure. Multi-turn prompt injection gradually shifts compliance boundaries through context manipulation — roleplay framing, appeal to helpfulness, layered exceptions. The
model progressively relaxes its own rules because the adversarial context is part of its conversation history.

Additionally, current rails primarily filter content (what the AI says), not actions (what the AI does). When agents have tool access, the dangerous artifact is the proposed action — a shell command, API call, or file write — not the natural language response.


### Describe the solution you'd like

1. Stateless judge model — A separate model instance evaluates proposed actions with zero conversation context. It receives only the action payload and the ground rules. No conversation history means no multi-turn manipulation vector.
2. Dual-gate evaluation — Gate 1: deterministic pattern denylist (zero latency, zero inference cost, catches known-bad patterns). Gate 2: stateless LLM judge (catches novel/semantic threats the denylist can't express). Gate 1 executes before Gate 2.
3. Hierarchical conflict resolution — A priority-based layer hierarchy for safety rules. When rules conflict (e.g., "be helpful" vs "don't expose credentials"), lower layers always override higher layers deterministically. Eliminates ambiguous edge-case behavior.
4. Action-level filtering — Guardrails applied to the proposed action (tool call, shell command, API request), not just the natural language response. A benign-sounding response can contain a destructive tool call — content filtering doesn't catch this.


### Describe alternatives you've considered

- Self-checking with stronger prompts: Tested extensively. Degrades under adversarial pressure regardless of prompt quality. The problem is architectural (shared context), not prompt engineering.
- Deterministic-only (no LLM judge): Predictable but has permanent blind spots. Any pattern not in the denylist passes with 100% reliability. Semantic evaluation is needed for novel threats.
- The dual-gate approach covers both: deterministic for known patterns, semantic for novel ones. Each covers the other's blind spots.


### Additional context

Full specification with empirical testing results, conflict resolution semantics, conformance requirements, and reference implementation:

RFC-ASA-001: The Asimov Safety Architecture for Autonomous AI Agents
https://github.com/h-network/RFCs/blob/main/RFC-ASA-001-Safety-Architecture-LLMs.md

Reference implementation with 44 security hardening items:
https://github.com/h-network/h-cli

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: Architecturally separate judge model for adversarial-resistant guardrails #1748

Did you check the docs?

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feature: Architecturally separate judge model for adversarial-resistant guardrails #1748

Description

Did you check the docs?

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions