Skip to content

refactor(jailbreak): Use onnx instead of pickle to load model#1715

Open
erickgalinkin wants to merge 6 commits intodevelopfrom
jailbreak-onnx
Open

refactor(jailbreak): Use onnx instead of pickle to load model#1715
erickgalinkin wants to merge 6 commits intodevelopfrom
jailbreak-onnx

Conversation

@erickgalinkin
Copy link
Copy Markdown
Collaborator

@erickgalinkin erickgalinkin commented Mar 11, 2026

Description

Remove use of pickle by migrating to onnxruntime.

NB: depends on push of onnx model to HuggingFace.

Summary by CodeRabbit

  • Chores
    • Updated jailbreak detection model format and improved model loading with enhanced safety mechanisms.
    • Updated dependencies including transformers, PyTorch, and runtime components to newer versions for improved performance and compatibility.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 11, 2026

Greptile Summary

This PR replaces the pickle-based sklearn Random Forest classifier with an ONNX model loaded via onnxruntime, and adds auto-download logic via huggingface_hub. The ONNX migration itself in models.py is clean, but checks.py has two P0 bugs introduced in this PR:

  • from model_based.models import JailbreakClassifier is neither a valid absolute nor a relative import — it will raise ModuleNotFoundError when called through the nemoguardrails package. It should be from .models import JailbreakClassifier.
  • The return value of hf_hub_download(..., cache_dir=classifier_path) is discarded. hf_hub_download stores files in a nested cache directory, not directly at classifier_path/snowflake.onnx, so the auto-download path will always fail to load the model. Using local_dir=classifier_path and capturing the return value would fix this.

Confidence Score: 2/5

Not safe to merge — two P0 bugs in checks.py will cause runtime failures on every cold start.

The ONNX migration in models.py is correct, but checks.py introduces a broken import (from model_based.models instead of from .models) and discards the return value of hf_hub_download while using cache_dir instead of local_dir, meaning the auto-download path will never place the file where the code expects it. Both bugs manifest on the first use of the endpoint in a fresh environment.

nemoguardrails/library/jailbreak_detection/model_based/checks.py requires fixes before merge; tests/test_jailbreak_model_based.py has a stale missing-deps test.

Important Files Changed

Filename Overview
nemoguardrails/library/jailbreak_detection/model_based/checks.py Two P0 bugs: broken from model_based.models import (should be from .models) and discarded hf_hub_download return value meaning downloaded file is never found at the expected path.
nemoguardrails/library/jailbreak_detection/model_based/models.py Migrates pickle-based sklearn classifier to ONNX InferenceSession; redundant res[1][:2] slice noted in prior thread but functionally harmless; CPUExecutionProvider hardcoded which is intentional for the RF model.
tests/test_jailbreak_model_based.py Most mocking updated correctly (onnxruntime InferenceSession), but test_model_based_classifier_missing_deps still patches sklearn.ensemble which is no longer used, making it ineffective.
nemoguardrails/library/jailbreak_detection/requirements.txt Replaces sklearn/numpy 1.23.5 with onnxruntime and updates torch/transformers; huggingface_hub added for auto-download; torchvision retained per prior thread explanation.
nemoguardrails/library/jailbreak_detection/Dockerfile Updates wget target from snowflake.pkl to snowflake.onnx; straightforward change.
nemoguardrails/library/jailbreak_detection/Dockerfile-GPU Same as Dockerfile — updates wget target from .pkl to .onnx.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[initialize_model called] --> B{EMBEDDING_CLASSIFIER_PATH set?}
    B -- No --> C[Log warning, return None]
    B -- Yes --> D{snowflake.onnx present at classifier_path?}
    D -- Yes --> G
    D -- No --> E[hf_hub_download snowflake.onnx]
    E --> F["⚠️ Return value discarded\nFile stored in nested cache dir\nnot at classifier_path/snowflake.onnx"]
    F --> G["from model_based.models import JailbreakClassifier\n⚠️ ModuleNotFoundError in package context\n(should be from .models)"]
    G --> H[JailbreakClassifier snowflake.onnx]
    H --> I[SnowflakeEmbed - torch/transformers]
    H --> J[InferenceSession CPUExecutionProvider]
    I --> K[embed text → numpy array]
    K --> L["classifier.run None, X: e"]
    L --> M["res[0] → classification label"]
    L --> N["res[1][0][classification] → probability"]
    M --> O[Return bool classification, float score]
    N --> O
Loading

Comments Outside Diff (2)

  1. nemoguardrails/library/jailbreak_detection/model_based/checks.py, line 51-53 (link)

    Broken import path will cause ModuleNotFoundError

    from model_based.models import JailbreakClassifier is not a valid absolute import and is not a relative import (missing the leading dot). When checks.py is imported via the nemoguardrails package (e.g. as nemoguardrails.library.jailbreak_detection.model_based.checks), Python will look for a top-level model_based package which does not exist. The original code used the full absolute path; the replacement should use a relative import.

  2. nemoguardrails/library/jailbreak_detection/model_based/checks.py, line 44-56 (link)

    hf_hub_download return value discarded; file path will be wrong

    hf_hub_download does not place the file directly at classifier_path/snowflake.onnx. It stores files in a nested HuggingFace cache structure (e.g. classifier_path/models--nvidia--NemoGuard-JailbreakDetect/snapshots/<hash>/snowflake.onnx) and returns the actual resolved path. Because the return value is discarded, the subsequent JailbreakClassifier(str(Path(classifier_path).joinpath("snowflake.onnx"))) call will still not find the file, causing a runtime error on every cold start.

    The fix is to capture the returned path and pass it directly:

        # check if model is present. If not, download it.
        model_path = Path(classifier_path).joinpath("snowflake.onnx")
        if not model_path.is_file():
            from huggingface_hub import hf_hub_download
    
            downloaded = hf_hub_download(
                repo_id="nvidia/NemoGuard-JailbreakDetect",
                filename="snowflake.onnx",
                local_dir=classifier_path,
            )
            model_path = Path(downloaded)
    
        from .models import JailbreakClassifier
    
        jailbreak_classifier = JailbreakClassifier(str(model_path))

    Note: local_dir (not cache_dir) places the file directly into the target directory.

Prompt To Fix All With AI
This is a comment left during a code review.
Path: nemoguardrails/library/jailbreak_detection/model_based/checks.py
Line: 51-53

Comment:
**Broken import path will cause `ModuleNotFoundError`**

`from model_based.models import JailbreakClassifier` is not a valid absolute import and is not a relative import (missing the leading dot). When `checks.py` is imported via the `nemoguardrails` package (e.g. as `nemoguardrails.library.jailbreak_detection.model_based.checks`), Python will look for a top-level `model_based` package which does not exist. The original code used the full absolute path; the replacement should use a relative import.

```suggestion
    from .models import (
        JailbreakClassifier,
    )
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: nemoguardrails/library/jailbreak_detection/model_based/checks.py
Line: 44-56

Comment:
**`hf_hub_download` return value discarded; file path will be wrong**

`hf_hub_download` does not place the file directly at `classifier_path/snowflake.onnx`. It stores files in a nested HuggingFace cache structure (e.g. `classifier_path/models--nvidia--NemoGuard-JailbreakDetect/snapshots/<hash>/snowflake.onnx`) and returns the actual resolved path. Because the return value is discarded, the subsequent `JailbreakClassifier(str(Path(classifier_path).joinpath("snowflake.onnx")))` call will still not find the file, causing a runtime error on every cold start.

The fix is to capture the returned path and pass it directly:

```python
    # check if model is present. If not, download it.
    model_path = Path(classifier_path).joinpath("snowflake.onnx")
    if not model_path.is_file():
        from huggingface_hub import hf_hub_download

        downloaded = hf_hub_download(
            repo_id="nvidia/NemoGuard-JailbreakDetect",
            filename="snowflake.onnx",
            local_dir=classifier_path,
        )
        model_path = Path(downloaded)

    from .models import JailbreakClassifier

    jailbreak_classifier = JailbreakClassifier(str(model_path))
```

Note: `local_dir` (not `cache_dir`) places the file directly into the target directory.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: tests/test_jailbreak_model_based.py
Line: 81-93

Comment:
**Stale test checks `sklearn.ensemble` which is no longer imported**

The test description says "If sklearn is missing, instantiating `JailbreakClassifier` should raise `ImportError`", but `sklearn` is no longer imported anywhere in `JailbreakClassifier.__init__` — only `onnxruntime` is. Setting `sklearn.ensemble` to `None` will therefore not trigger an `ImportError` from the new code path, making this test validate the wrong dependency. It should mock `onnxruntime` as `None` instead:

```suggestion
def test_model_based_classifier_missing_deps(monkeypatch):
    """
    If onnxruntime is missing, instantiating JailbreakClassifier should raise ImportError.
    """
    monkeypatch.setitem(sys.modules, "onnxruntime", None)
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (5): Last reviewed commit: "Fix file download for checks. Update Doc..." | Re-trigger Greptile

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

This PR migrates the jailbreak detection classifier from a pickle-based model to ONNX Runtime, updating the model initialization, inference mechanism, dependencies, and test expectations accordingly.

Changes

Cohort / File(s) Summary
Model Format Migration
nemoguardrails/library/jailbreak_detection/model_based/models.py, nemoguardrails/library/jailbreak_detection/model_based/checks.py
Changed model serialization from pickle ("snowflake.pkl") to ONNX format ("snowflake.onnx"). Updated inference from scikit-learn's predict_proba/argmax pattern to ONNX Runtime's InferenceSession.run API. Enhanced device selection from "cuda:0" to "cuda" and added trust_remote_code and safe_serialization flags for transformer loading.
Dependency Updates
nemoguardrails/library/jailbreak_detection/requirements.txt
Updated package versions for transformers, torch, torchvision, nemoguardrails, numpy, scikit-learn, and einops. Added onnxruntime>=1.24.3 as a new dependency.
Test Updates
tests/test_jailbreak_model_based.py
Updated expected model file path from "snowflake.pkl" to "snowflake.onnx" in test_initialize_model_with_valid_path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Results For Major Changes ⚠️ Warning PR description lacks test results or testing information for this major ONNX migration change affecting model inference pipeline. Add test results demonstrating ONNX model produces equivalent classification results, include performance metrics, and document regression testing methodology.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: migrating from pickle to ONNX for model loading in the jailbreak detection module.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jailbreak-onnx

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
nemoguardrails/library/jailbreak_detection/model_based/models.py (2)

25-35: ⚠️ Potential issue | 🟠 Major

Pin the model revision to an immutable commit hash.

trust_remote_code=True executes code from the Hugging Face repository at load time. Without pinning revision to a specific commit, the application will execute any code updates published to that repository in the future, creating a supply-chain vulnerability.

Add revision="<commit-sha>" (full commit hash, not a branch or tag) to both AutoTokenizer.from_pretrained() and AutoModel.from_pretrained() calls.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoguardrails/library/jailbreak_detection/model_based/models.py` around
lines 25 - 35, The AutoTokenizer and AutoModel loads
(AutoTokenizer.from_pretrained and AutoModel.from_pretrained) currently use
trust_remote_code=True without pinning a revision; update both calls to include
revision="<commit-sha>" (a full immutable commit hash string) so the tokenizer
and model load a specific commit, keeping trust_remote_code=True and preserving
existing args (safe_serialization, add_pooling_layer) while preventing future
remote code changes from being executed.

25-35: ⚠️ Potential issue | 🟠 Major

Replace safe_serialization with use_safetensors for load-time safetensors configuration.

safe_serialization is a save-time flag, not a load-time parameter. On AutoTokenizer.from_pretrained() and AutoModel.from_pretrained(), use use_safetensors=True (or use_safetensors=None to prefer safetensors if available) instead. The current parameter will be ignored or may cause errors depending on the transformers version.

Suggested fix
self.tokenizer = AutoTokenizer.from_pretrained(
    "Snowflake/snowflake-arctic-embed-m-long",
    trust_remote_code=True,
    use_safetensors=True
)
self.model = AutoModel.from_pretrained(
    "Snowflake/snowflake-arctic-embed-m-long",
    trust_remote_code=True,
    add_pooling_layer=False,
    use_safetensors=True,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoguardrails/library/jailbreak_detection/model_based/models.py` around
lines 25 - 35, The calls to AutoTokenizer.from_pretrained and
AutoModel.from_pretrained in the model initialization use the save-time
parameter safe_serialization; replace safe_serialization with the load-time
option use_safetensors (e.g., use_safetensors=True or use_safetensors=None to
prefer safetensors) in both self.tokenizer and self.model constructor calls
(keep trust_remote_code and add_pooling_layer as-is) so the transformers loader
actually uses safetensors at load time.
🧹 Nitpick comments (3)
nemoguardrails/library/jailbreak_detection/model_based/models.py (1)

56-57: Build X as an explicit batched ndarray before calling ONNX Runtime.

{"X": [e]} passes a Python list of vectors into ORT. Exported sklearn ONNX models usually declare X as a 2-D float tensor, so this is relying on implicit coercion in a pretty fragile place.

🧪 Suggested change
     def __call__(self, text: str) -> Tuple[bool, float]:
         e = self.embed(text)
-        res = self.classifier.run(None, {"X": [e]})
+        x = np.asarray([e], dtype=np.float32)
+        res = self.classifier.run(None, {"X": x})

Also add at module scope:

import numpy as np
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoguardrails/library/jailbreak_detection/model_based/models.py` around
lines 56 - 57, The code passes a Python list to ONNX Runtime
(self.classifier.run(None, {"X": [e]}) ), which relies on implicit coercion;
instead, import numpy as np at module scope and build an explicit 2-D batched
ndarray for X (e.g., X = np.asarray([e], dtype=np.float32) or np.expand_dims
with dtype float32) before calling self.classifier.run(None, {"X": X}) so the
input matches the exported sklearn ONNX model's 2-D float tensor expectation.
nemoguardrails/library/jailbreak_detection/model_based/checks.py (1)

46-46: Fail fast if snowflake.onnx is missing.

A bad EMBEDDING_CLASSIFIER_PATH will currently bubble up as a low-level ONNX Runtime error. Checking is_file() here would make startup failures much easier to diagnose, especially since this migration depends on a new artifact being present.

🛠 Suggested change
-    jailbreak_classifier = JailbreakClassifier(str(Path(classifier_path).joinpath("snowflake.onnx")))
+    model_path = Path(classifier_path).joinpath("snowflake.onnx")
+    if not model_path.is_file():
+        raise FileNotFoundError(f"Jailbreak classifier not found: {model_path}")
+    jailbreak_classifier = JailbreakClassifier(str(model_path))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nemoguardrails/library/jailbreak_detection/model_based/checks.py` at line 46,
Check for the presence of the ONNX file before instantiating
JailbreakClassifier: compute the path from classifier_path and "snowflake.onnx"
(reference the variable classifier_path and the filename "snowflake.onnx"), use
Path(...).is_file() and if it returns False raise a clear, early error (e.g.,
ValueError or RuntimeError) that includes the missing path, then only call
JailbreakClassifier with that path; this ensures startup fails fast instead of
surfacing a low-level ONNX Runtime error.
tests/test_jailbreak_model_based.py (1)

230-257: Please add a unit test for the ONNX session output shape.

This assertion only locks down the filename change. The fragile part of the migration is in nemoguardrails/library/jailbreak_detection/model_based/models.py, Lines 54-62, where InferenceSession.run() output is unpacked and scored; a mocked ORT test would catch regressions there.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_jailbreak_model_based.py` around lines 230 - 257, Add a unit test
that mocks onnxruntime.InferenceSession.run to return the exact tuple/array
shapes your model code expects and assert the classifier handles the unpacking
and produces the expected output; specifically, in the new test mock
models.InferenceSession (or the attribute used inside
nemoguardrails.library.jailbreak_detection.model_based.models) so that
InferenceSession.run returns a tuple of numpy arrays with the same shapes used
by JailbreakClassifier scoring, then instantiate or call JailbreakClassifier
(via initialize_model or directly) and assert no exceptions and that the
returned score/label shapes/values match expectations; focus on covering
InferenceSession.run, JailbreakClassifier, and the unpacking logic so
regressions in the run() output shape will fail the test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nemoguardrails/library/jailbreak_detection/requirements.txt`:
- Line 14: The file ending for the requirements list is missing a trailing
newline; edit the requirements.txt entry that contains "onnxruntime>=1.24.3" and
ensure the file ends with a single newline character (add a newline after the
last line) so the end-of-file fixer / linters stop failing.
- Around line 7-14: The requirements file is missing a trailing newline which
causes lint failures; open
nemoguardrails/library/jailbreak_detection/requirements.txt and add a single
newline character at the end of the file (after the last entry
onnxruntime>=1.24.3) so the file ends with a newline; save and commit the
change.

---

Outside diff comments:
In `@nemoguardrails/library/jailbreak_detection/model_based/models.py`:
- Around line 25-35: The AutoTokenizer and AutoModel loads
(AutoTokenizer.from_pretrained and AutoModel.from_pretrained) currently use
trust_remote_code=True without pinning a revision; update both calls to include
revision="<commit-sha>" (a full immutable commit hash string) so the tokenizer
and model load a specific commit, keeping trust_remote_code=True and preserving
existing args (safe_serialization, add_pooling_layer) while preventing future
remote code changes from being executed.
- Around line 25-35: The calls to AutoTokenizer.from_pretrained and
AutoModel.from_pretrained in the model initialization use the save-time
parameter safe_serialization; replace safe_serialization with the load-time
option use_safetensors (e.g., use_safetensors=True or use_safetensors=None to
prefer safetensors) in both self.tokenizer and self.model constructor calls
(keep trust_remote_code and add_pooling_layer as-is) so the transformers loader
actually uses safetensors at load time.

---

Nitpick comments:
In `@nemoguardrails/library/jailbreak_detection/model_based/checks.py`:
- Line 46: Check for the presence of the ONNX file before instantiating
JailbreakClassifier: compute the path from classifier_path and "snowflake.onnx"
(reference the variable classifier_path and the filename "snowflake.onnx"), use
Path(...).is_file() and if it returns False raise a clear, early error (e.g.,
ValueError or RuntimeError) that includes the missing path, then only call
JailbreakClassifier with that path; this ensures startup fails fast instead of
surfacing a low-level ONNX Runtime error.

In `@nemoguardrails/library/jailbreak_detection/model_based/models.py`:
- Around line 56-57: The code passes a Python list to ONNX Runtime
(self.classifier.run(None, {"X": [e]}) ), which relies on implicit coercion;
instead, import numpy as np at module scope and build an explicit 2-D batched
ndarray for X (e.g., X = np.asarray([e], dtype=np.float32) or np.expand_dims
with dtype float32) before calling self.classifier.run(None, {"X": X}) so the
input matches the exported sklearn ONNX model's 2-D float tensor expectation.

In `@tests/test_jailbreak_model_based.py`:
- Around line 230-257: Add a unit test that mocks
onnxruntime.InferenceSession.run to return the exact tuple/array shapes your
model code expects and assert the classifier handles the unpacking and produces
the expected output; specifically, in the new test mock models.InferenceSession
(or the attribute used inside
nemoguardrails.library.jailbreak_detection.model_based.models) so that
InferenceSession.run returns a tuple of numpy arrays with the same shapes used
by JailbreakClassifier scoring, then instantiate or call JailbreakClassifier
(via initialize_model or directly) and assert no exceptions and that the
returned score/label shapes/values match expectations; focus on covering
InferenceSession.run, JailbreakClassifier, and the unpacking logic so
regressions in the run() output shape will fail the test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 35bd8e9f-9b1d-410e-a34a-4e98c023a745

📥 Commits

Reviewing files that changed from the base of the PR and between 2646995 and 82633e2.

📒 Files selected for processing (4)
  • nemoguardrails/library/jailbreak_detection/model_based/checks.py
  • nemoguardrails/library/jailbreak_detection/model_based/models.py
  • nemoguardrails/library/jailbreak_detection/requirements.txt
  • tests/test_jailbreak_model_based.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Erick Galinkin <[email protected]>
Signed-off-by: Erick Galinkin <[email protected]>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 11, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@Pouyanpi Pouyanpi changed the title Update code to use onnx instead of pickle refactor(jailbreak): Use onnx instead of pickle to load model Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feature: make jailbreak model-based rail fully usable in the open-source distribution

1 participant