Skip to content

feat/new compiler#33

Open
BornPsych wants to merge 69 commits intomainfrom
feat/new-compiler
Open

feat/new compiler#33
BornPsych wants to merge 69 commits intomainfrom
feat/new-compiler

Conversation

@BornPsych
Copy link
Copy Markdown

@BornPsych BornPsych commented Sep 8, 2025

Note

Introduce a new regex runtime and split circuit code into circom/noir modules, update WASM/SDK integration, and add a browser proving app with stricter email validation and DKIM endpoint update.

  • Core:
    • Regex Runtime: Add new regex module (extraction, padding, patterns, WASM) using zk_regex_compiler; remove old inline regex logic and types.
    • Circuit Refactor: Split into circom and noir modules; add Noir input generation and limb/REDc utilities; consolidate helpers.
    • Email Parsing/Cryptos: Switch to new regex APIs; validate email length in PaddedEmailAddr (returns Result); update DKIM key endpoint; improve padding/sha utilities.
  • WASM:
    • Update exports to new regex functions; add Noir circuit input generators; adjust existing functions and error handling.
  • Browser Test App:
    • Add browser_test (Vite setup, worker, provers for Noir/Circom, assets) for in-browser proving.
  • Misc:
    • Bump package version; add yarn.lock.

Written by Cursor Bugbot for commit bcf50a2. This will update automatically on new commits. Configure here.

Summary by CodeRabbit

  • New Features

    • Interactive browser test app with two in-browser provers, sample email fixtures, and a minimal UI to drive proof flows.
    • Runtime regex extraction and padding utilities exposed to JS/WASM.
    • Expanded circuit-input generation supporting regex-driven extraction and external/externalizable inputs.
  • Bug Fixes

    • Improved error reporting and stricter validation for inputs and email parsing.
  • Chores

    • Version bumps, updated ignore rules, new build/config files and tooling for the browser test project.

✏️ Tip: You can customize this high-level summary in your review settings.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bcf50a2 and f45e528.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • Cargo.toml (3 hunks)
  • package.json (1 hunks)
🔇 Additional comments (5)
package.json (2)

13-13: Clarify the pre-release version format.

The version suffix -9 is unconventional. Pre-release versions typically follow semantic versioning conventions like 0.4.66-rc.1, 0.4.66-beta.9, or 0.4.66-alpha.1.

Is 0.4.66-9 intentional, or should this be 0.4.66-rc.1 or similar?


4-12: Verify alignment between scripts and PR scope.

The PR introduces a Vite-based browser test app, but the scripts section shows no new build or dev targets for it. Similarly, the AI summary mentions "Vite-based browser test harness," yet no Vite-related devDependencies are visible in the provided snippet.

Verify that:

  • Required devDependencies (Vite, build tools for the browser app) are declared elsewhere (e.g., workspace dependencies, root package.json in a monorepo).
  • Build/dev scripts for the browser app are intentionally omitted or defined in a separate workspace package.
Cargo.toml (3)

3-3: Version bump rationale.

The version has been bumped from 0.4.62-12 to 0.4.62-17 (a 5-patch increment in prerelease). Ensure this aligns with the scope of changes (new regex module, browser test app, circuit refactoring) and follows your project's versioning strategy.


20-20: I need to dig deeper to understand the relationship between these two dependencies.

Let me check if the external regex crate is actually imported anywhere in the codebase:

Based on my verification, I have sufficient information to generate a rewritten comment. The key findings are:

  1. fancy_regex is confirmed to be used: The script found use fancy_regex::Regex; in src/regex/extract.rs
  2. regex crate dependency: fancy-regex internally delegates to an inner NFA implementation using the regex crate, which means fancy-regex depends on regex as a transitive dependency
  3. No direct imports of external regex crate found: The search results showed only references to crate::regex:: (a local module), not use regex:: (external crate)

The review comment's concern about dual dependencies is actually valid and warrants clarification - the regex crate is not directly used in the codebase but is an indirect dependency of fancy-regex.

Dual regex dependencies are justified—regex is a transitive dependency of fancy-regex.

fancy-regex uses a hybrid approach where a backtracking virtual machine delegates to the regex crate for simple pattern matching. The regex dependency is not directly used in the codebase; rather, it's pulled in as a transitive dependency through fancy_regex. This is intentional and necessary for fancy-regex's functionality. The dual dependency can be left as-is or clarified in a comment.


53-53: No API compatibility issues found with sp1-verifier 5.0.0.

The code uses Groth16Verifier::verify() with the signature matching the v5.0.0 API, and the GROTH16_VK_BYTES constant is available. The single usage in src/wasm.rs is fully compatible. The default-features = false configuration is intentional and correct for WASM builds.

Comment thread Cargo.toml Outdated
serde = { version = "1.0.159", features = ["derive"] }
zk-regex-apis = { git = "https://github.com/zkemail/zk-regex.git" }
zk-regex-compiler = { git = "https://github.com/zkemail/zk-regex.git" }
zk-regex-compiler = { git = "https://github.com/zkemail/zk-regex.git", rev = "35014c2bff69115bb92be2dc2c9b006fa69a6649" }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify zk-regex-compiler revision is valid and compatible.

The dependency is pinned to a specific commit hash. Ensure this revision exists, is reachable, and doesn't introduce security or stability issues.


🏁 Script executed:

#!/bin/bash
# Verify the zk-regex-compiler rev is valid and the git repository is accessible
git ls-remote https://github.com/zkemail/zk-regex.git 35014c2bff69115bb92be2dc2c9b006fa69a6649

Length of output: 0



---

🏁 Script executed:

```shell
#!/bin/bash
# Alternative verification: fetch repository info and check for commit
# Try with --heads and --tags to see all refs
git ls-remote --heads --tags https://github.com/zkemail/zk-regex.git | head -50

Length of output: 3471


🌐 Web query:

zkemail zk-regex commit 35014c2bff69115bb92be2dc2c9b006fa69a6649

💡 Result:

I couldn't find that exact commit hash in public zk-email/zk-regex sources. I can search GitHub (or the specific repo) for that commit and return the commit message / diff — do you want me to do that? (If yes, tell me which repo or allow me to search the zkemail org.) [1][2]

Sources:
[1] zkemail GitHub org.
[2] zk-regex documentation.


🏁 Script executed:

#!/bin/bash
# Direct attempt to fetch the specific commit
git fetch --depth=1 https://github.com/zkemail/zk-regex.git 35014c2bff69115bb92be2dc2c9b006fa69a6649 2>&1 || echo "Fetch failed or commit not found"

Length of output: 101


Fix zk-regex-compiler revision: commit does not exist in repository.

The pinned commit 35014c2bff69115bb92be2dc2c9b006fa69a6649 is not found in the zkemail/zk-regex repository. Cargo will fail to resolve this dependency during build. Use a valid commit hash or tag (e.g., tag 2.1.0 which is publicly available).

🤖 Prompt for AI Agents
In Cargo.toml around line 16, the dependency zk-regex-compiler pins a
non-existent commit hash; replace the rev value with a valid reference (either a
real commit hash from the zkemail/zk-regex repo or a published tag such as
"2.1.0") by updating the dependency to use rev = "<valid-hash>" or tag = "2.1.0"
(or use git = ... and branch = "<branch>" if preferred), then run cargo update
or cargo fetch and verify cargo build succeeds.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/parse_email.rs (1)

261-293: Align fallback logic between get_command() and get_command_idxes().

There's an inconsistency between these two methods:

  • get_command() (lines 261-282) has a fallback strategy: it first tries canonicalized_body (line 266), and if that fails, falls back to cleaned_body (line 272).
  • get_command_idxes() (lines 285-293) only searches in cleaned_body (line 290), with no fallback.

This inconsistency can lead to bugs:

  1. If the command exists in canonicalized_body but not cleaned_body, get_command() succeeds while get_command_idxes() fails.
  2. If the command exists in both bodies at different positions, the indices returned by get_command_idxes() won't match the substring returned by get_command().

Consider one of these solutions:

Option 1: Make get_command_idxes() match the fallback logic:

 pub fn get_command_idxes(&self, ignore_body_hash_check: bool) -> Result<(usize, usize)> {
     let regex_config = serde_json::from_str(include_str!("../regexes/command.json"))?;
     if ignore_body_hash_check {
         Ok((0, 0))
     } else {
-        let idxes = extract_substr_idxes(&self.cleaned_body, &regex_config, None, false)?[0];
-        Ok(idxes)
+        match extract_substr_idxes(&self.canonicalized_body, &regex_config, None, false) {
+            Ok(idxes) => Ok(idxes[0]),
+            Err(_) => {
+                let idxes = extract_substr_idxes(&self.cleaned_body, &regex_config, None, false)?[0];
+                Ok(idxes)
+            }
+        }
     }
 }

Option 2: Simplify both methods to only use cleaned_body (if canonicalized_body fallback is unnecessary).

♻️ Duplicate comments (4)
browser_test/src/noirProver.ts (1)

270-282: Critical: Hex decoding corrupts bytes with leading zeros.

The replace(/^0+/, "") on line 272 strips all leading zeros, corrupting bytes like 0x0a (newline) into "a" or 0x00 into "", producing odd-length hex strings that cannot decode properly. Control characters and null bytes in public outputs will be silently mangled.

Apply this diff to fix the decoding:

 function toUtf8(hex: string): string {
-  // Remove '0x' prefix and leading zeros
-  const cleanHex = hex.slice(2).replace(/^0+/, "");
+  // Remove '0x' prefix
+  let cleanHex = hex.startsWith("0x") ? hex.slice(2) : hex;
+  
+  // Ensure even length by prepending '0' if needed
+  if (cleanHex.length % 2 !== 0) {
+    cleanHex = "0" + cleanHex;
+  }
 
   // Convert the hex to a Uint8Array
   const bytes = new Uint8Array(cleanHex.length / 2);
   for (let i = 0; i < cleanHex.length; i += 2) {
     bytes[i / 2] = parseInt(cleanHex.substring(i, i + 2), 16);
   }
 
   // Use TextDecoder to convert to UTF-8
   return new TextDecoder().decode(bytes);
 }
src/circuit/noir/mod.rs (3)

26-33: Security: Reconsider defaulting body hash check to skip.

Defaulting ignore_body_hash_check to true bypasses a critical email authentication mechanism, potentially allowing body tampering to go undetected. Consider defaulting to false to enforce verification by default.

Apply this diff:

-        ParsedEmail::new_from_raw_email(email, params.ignore_body_hash_check.unwrap_or(true))
+        ParsedEmail::new_from_raw_email(email, params.ignore_body_hash_check.unwrap_or(false))

158-164: Replace unwrap with error propagation.

Line 162 uses unwrap() on precomputed_sha, which will panic if it's None. Since the function returns Result, propagate the error properly.

Apply this diff:

         // Always generate partial_body_hash when body verification is enabled
         if params.ignore_body_hash_check.is_some_and(|x| !x) {
             noir_circuit_input.partial_body_real_length =
                 Some(parsed_email.canonicalized_body.len());
-            let partial_hash = u8_to_u32(email_circuit_inputs.precomputed_sha.unwrap().as_slice())?;
+            let partial_hash = u8_to_u32(
+                email_circuit_inputs
+                    .precomputed_sha
+                    .as_ref()
+                    .ok_or_else(|| anyhow::anyhow!("Precomputed SHA is missing"))?
+                    .as_slice()
+            )?;
             noir_circuit_input.partial_body_hash = Some(partial_hash);
         }

290-295: Add null checks before unwrapping body references.

Lines 292 and 294 use unwrap() without verification. If remove_soft_line_breaks is enabled but decoded_body is None, or if body processing was skipped, these will panic.

Apply this diff:

             HaystackLocation::Body => {
                 let body = if params.remove_soft_line_breaks {
-                    noir_circuit_input.decoded_body.as_ref().unwrap()
+                    noir_circuit_input.decoded_body.as_ref()
+                        .ok_or_else(|| anyhow::anyhow!("Decoded body is missing when remove_soft_line_breaks is enabled"))?
                 } else {
-                    noir_circuit_input.body.as_ref().unwrap()
+                    noir_circuit_input.body.as_ref()
+                        .ok_or_else(|| anyhow::anyhow!("Body is missing when required for regex matching"))?
                 };
🧹 Nitpick comments (2)
browser_test/src/noirProver.ts (1)

73-80: Consider using nullish coalescing for legacy field fallback.

While the current || operator works for truthy values, using nullish coalescing (??) would be more precise for handling null/undefined specifically, avoiding issues with falsy-but-valid values like empty strings.

Apply this diff:

           parts: dr.parts.map((p) => ({
             // @ts-ignore
             is_public: p.isPublic || !!p.is_public,
             // @ts-ignore
-            regex_def: p.regexDef || p.regex_def,
+            regex_def: p.regexDef ?? p.regex_def,
             // @ts-ignore
-            ...((p.isPublic || p.is_public) && { maxLength: p.maxLength || p.max_length || 64 }),
+            ...((p.isPublic ?? p.is_public) && { maxLength: p.maxLength ?? p.max_length ?? 64 }),
           })),
src/circuit/noir/mod.rs (1)

174-180: Address the TODO: handle missing selector gracefully.

The TODO comment indicates this will fail if the selector is not found in the body (e.g., when selector doesn't account for soft line breaks). Consider adding explicit error handling or validation to provide a clearer error message.

Would you like me to help implement proper error handling for the missing selector case?

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 882ea5f and 02409b7.

⛔ Files ignored due to path filters (2)
  • browser_test/bun.lock is excluded by !**/*.lock
  • yarn.lock is excluded by !**/yarn.lock, !**/*.lock
📒 Files selected for processing (8)
  • browser_test/package.json (1 hunks)
  • browser_test/src/noirProver.ts (1 hunks)
  • package.json (1 hunks)
  • src/circuit/mod.rs (1 hunks)
  • src/circuit/noir/mod.rs (1 hunks)
  • src/circuit/noir/structs.rs (1 hunks)
  • src/cryptos.rs (10 hunks)
  • src/parse_email.rs (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • package.json
  • browser_test/package.json
🧰 Additional context used
🧬 Code graph analysis (6)
browser_test/src/noirProver.ts (1)
src/wasm.rs (2)
  • parseEmail (51-77)
  • generateNoirCircuitInputsWithRegexesAndExternalInputs (605-652)
src/circuit/noir/mod.rs (5)
src/parse_email.rs (2)
  • remove_quoted_printable_soft_breaks (323-345)
  • new_from_raw_email (49-75)
src/converters.rs (2)
  • string_to_circom_bigint_bytes (463-483)
  • vec_u8_to_bigint (265-269)
src/circuit/circom.rs (2)
  • compute_signal_length (577-579)
  • clone (122-139)
src/circuit/mod.rs (2)
  • generate_circuit_inputs (134-192)
  • new (98-113)
src/circuit/noir/utils.rs (6)
  • bn_to_limb_str_array (101-134)
  • bn_to_redc_limb_str_array (161-164)
  • get_header_sequence (195-228)
  • u8_to_u32 (304-317)
  • get_address_header_sequence (248-290)
  • trim_sha256_padding (319-350)
src/circuit/mod.rs (4)
src/cryptos.rs (5)
  • generate_partial_sha (494-557)
  • sha256_pad (426-462)
  • new (59-61)
  • new (260-262)
  • new (335-343)
src/parse_email.rs (1)
  • remove_quoted_printable_soft_breaks (323-345)
src/converters.rs (1)
  • to_circom_bigint_bytes (254-256)
src/circuit/utils.rs (1)
  • get_adjusted_selector (66-87)
src/cryptos.rs (3)
src/regex/padding.rs (1)
  • pad_string (18-22)
src/circuit/mod.rs (1)
  • new (98-113)
src/converters.rs (2)
  • hex_to_field (20-67)
  • field_to_hex (76-79)
src/parse_email.rs (2)
src/regex/patterns.rs (8)
  • extract_body_hash_idxes (106-108)
  • extract_email_addr_idxes (81-83)
  • extract_email_domain_idxes (86-88)
  • extract_from_addr_idxes (91-93)
  • extract_message_id_idxes (116-118)
  • extract_subject_all_idxes (101-103)
  • extract_timestamp_idxes (111-113)
  • extract_to_addr_idxes (96-98)
src/regex/extract.rs (1)
  • extract_substr_idxes (188-201)
src/circuit/noir/structs.rs (2)
src/cryptos.rs (1)
  • deserialize (183-226)
src/circuit/circom.rs (1)
  • deserialize (155-161)
🔇 Additional comments (25)
src/parse_email.rs (3)

6-10: LGTM!

The refactoring from external zk_regex_apis to the local crate::regex module is clean and aligns with the PR's goal of introducing a new internal regex runtime.


209-210: LGTM!

The API call updates correctly pass None for the new nfa_graph parameter, which aligns with the updated extract_substr_idxes signature.

Also applies to: 214-214, 227-228, 231-231


340-342: LGTM!

The refactoring to use resize() for padding is cleaner and more idiomatic than manual loop-based padding. Using usize::MAX as a sentinel value for padded positions is well-documented in the function comments and aligns with the new padding strategy.

src/cryptos.rs (8)

3-3: LGTM! Good refactoring to use centralized padding utility.

The import of pad_string from the new regex module improves code reusability and aligns with the PR's objective to consolidate padding logic.


111-127: Excellent security improvement! Validation prevents truncation collisions.

The change to return Result<Self> with explicit length validation is a critical security enhancement. By rejecting email addresses exceeding 256 bytes upfront, this prevents scenarios where different email addresses could produce identical padded representations due to silent truncation.

The error handling is clear and the documentation properly explains the security rationale.


505-505: Good improvement! Error propagation instead of panicking.

Replacing .unwrap() with ? operator for both Regex::new() and String::from_utf8() provides better error handling by propagating errors to the caller rather than panicking at runtime.

Also applies to: 515-515


800-816: LGTM! Proper error propagation throughout the function.

The function signature change to return Result<String> is consistent with the updated PaddedEmailAddr::from_email_addr signature. All error propagation using ? operator is correct, and the function now properly handles errors from:

  • Email address validation
  • Hex-to-field conversions
  • Account salt generation

Note: This is a breaking change to the public API.


833-833: Minor style improvement.

Using .or_else() for Option chaining is slightly more idiomatic than .unwrap_or_else() in this context, though both work correctly here.


601-697: Excellent test coverage for the new validation logic!

The test suite comprehensively covers:

  • Valid email address padding and verification
  • Boundary condition (exactly 256 bytes - should pass)
  • Over-limit rejection (257 bytes - should fail)
  • Collision prevention security scenarios
  • Long but valid emails under the limit
  • Integration with calculate_account_salt for both success and error paths

The tests are well-structured with clear assertions and descriptive error messages.


38-38: DKIM API endpoint is operational and returns valid data.

The new endpoint https://archive.zk.email/api/key/domain successfully returns DKIM key records in the expected JSON format with proper structure (domain, selector, firstSeenAt, lastSeenAt, value fields). The change is verified as live and compatible.


532-532: Verify circuit-level SHA padding handling with circuit implementation team.

The comment update at line 549 asserts that "the circuit handles SHA padding", allowing the remaining body to be padded to max_remaining_body_length with zeros rather than requiring it to be a multiple of 64 bytes. However, the verification script did not locate the circuit's actual SHA padding implementation (e.g., partial_sha256_var_end) or corresponding tests that validate this behavior with variable-length inputs.

Action required: Confirm with the circuit development team that the circuit implementation correctly handles variable-length remaining body inputs (i.e., not strictly multiples of 64) before this change is merged. The codebase reference alone is insufficient to verify this assumption.

src/circuit/mod.rs (3)

1-17: LGTM! Clean module organization.

The module structure and imports are well-organized, with appropriate re-exports for public APIs.


19-75: LGTM! Well-designed data structures.

The separation between public API structs (CircuitParams, CircuitOptions) and internal implementation structs (CircuitInput) provides good encapsulation. Serde annotations are correctly applied for JSON serialization.


77-114: LGTM! Secure defaults applied.

The constructor correctly applies sensible defaults, including defaulting ignore_body_hash_check to false to enforce body hash verification by default.

browser_test/src/noirProver.ts (4)

1-15: LGTM! Clean SDK initialization.

The imports and SDK configuration are well-structured, with helpful commented alternatives for different environments.


85-147: LGTM! Robust circuit execution flow.

The Noir circuit execution is well-structured with appropriate error handling and data transformations (Map to object conversion for circuit inputs).


167-178: LGTM! Clean email fetching with error handling.

The function properly handles fetch errors and returns the email content as text.


180-198: LGTM! Proper validation of required external inputs.

The function correctly ensures all required external inputs are provided and augments them with maxLength constraints.

src/circuit/noir/structs.rs (3)

4-11: LGTM! Clean custom deserializer.

The empty string to None deserializer is correctly implemented using filter, providing clean handling of empty optional strings.


13-62: LGTM! Well-designed data structures.

The struct definitions are clean and appropriate for Noir circuit inputs, with good use of Option for conditional fields and the custom deserializer for sha_precompute_selector to handle empty strings elegantly.


64-77: LGTM! Complete type definitions.

The HaystackLocation enum and RegexInput struct provide clear, type-safe interfaces for regex processing in Noir circuits.

src/circuit/noir/mod.rs (4)

1-24: LGTM! Clean module organization.

Module structure, imports, and the MODULUS_BITS constant are well-organized.


35-105: LGTM! Clean circuit input construction.

The conversion of parsed email data into circuit parameters and the initialization of NoirCircuitInputs are well-structured.


223-277: LGTM! Clean integration of external inputs.

The function properly constructs circuit parameters and processes external inputs with appropriate padding and serialization.


306-443: LGTM! Robust regex processing and test coverage.

The regex input processing correctly handles capture groups with proper error checks, and the test module provides good coverage of the circuit input generation flow.

Comment thread src/circuit/mod.rs
Comment on lines +172 to +183
let result = generate_partial_sha(
body_for_sha,
body_original_len,
adjusted_selector,
params.max_body_length,
);

// Use match to handle the result and convert any error into an anyhow::Error
let (precomputed_sha, body_remaining_padded, body_remaining_length) = match result {
Ok((sha, remaining, len)) => (sha, remaining, len),
Err(e) => panic!("Failed to generate partial SHA: {:?}", e),
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Replace panic with error propagation.

The function panics on generate_partial_sha failure, but it already returns Result<CircuitInput>. Convert the error and propagate it using ? instead of panicking.

Apply this diff:

-        // Use match to handle the result and convert any error into an anyhow::Error
-        let (precomputed_sha, body_remaining_padded, body_remaining_length) = match result {
-            Ok((sha, remaining, len)) => (sha, remaining, len),
-            Err(e) => panic!("Failed to generate partial SHA: {:?}", e),
-        };
+        // Convert the error into an anyhow::Error and propagate it
+        let (precomputed_sha, body_remaining_padded, body_remaining_length) = result
+            .map_err(|e| anyhow::anyhow!("Failed to generate partial SHA: {:?}", e))?;
🤖 Prompt for AI Agents
In src/circuit/mod.rs around lines 172 to 183, the code currently panics on a
failed generate_partial_sha call; replace the match/panic with direct error
propagation. Call generate_partial_sha(...) and use the ? operator to propagate
the error (if needed, map the error into anyhow::Error with .map_err(|e|
anyhow::anyhow!(...)) before applying ?), then destructure the returned tuple
into (precomputed_sha, body_remaining_padded, body_remaining_length).

Support both legacy and new circuit input formats by:
- Adding optional fields (max_match_length, max_haystack_length, regex_graph_json) with aliases for old field names
- Implementing legacy path that generates {name}RegexIdx format for old circuits
- Using NFAGraph-based generation only when regex_graph_json is present
- Adding helper methods to detect circuit version and calculate defaults
- Fixing typo: removeSoftLinesBreaks -> removeSoftLineBreaks in tests

This ensures circuits compiled before the new compiler integration continue to work while supporting the new format.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
src/circuit/circom.rs (2)

630-637: Use u64 (not i64) for capture_group_start_indices JSON numbers

Here capture_group_start_indices elements are cast to i64 while related fields (capture_group_ids, capture_group_starts) use u64, and indices are inherently non‑negative. This inconsistency is unnecessary and can be surprising when comparing Circom vs Noir outputs.

Align this with the other fields by casting to u64:

-                                    .map(|s| serde_json::Value::Number((*s as i64).into()))
+                                    .map(|s| serde_json::Value::Number((*s as u64).into()))

283-324: Silent fallback to index 0 can hide parsing failures for code/timestamp/command

code_idx, timestamp_idx, and command_idx all map Err(_) to 0, and the body-search fallback also forces 0 when padded_cleaned_body is None. This silently suppresses errors and makes it impossible to distinguish “not found / failed to parse” from “found at position 0”, which can mis-wire circuit behavior.

Consider instead:

  • Propagating the errors (?) if these indices are required, or
  • Using Option<usize> in EmailCircuitInput and encoding the optionality explicitly at the circuit boundary, or
  • If 0 is truly a safe sentinel, document that invariant clearly and ensure the circuits never treat index 0 as a valid match location.
🧹 Nitpick comments (4)
src/lib.rs (1)

9-10: Consider whether regex needs to be a public module

You’re already re-exporting the curated regex surface at the crate root; if external consumers don’t need to reach into regex::* internals, you could make this mod regex; instead and keep the internal layout private. If consumers do rely on submodules, the current pub is fine.

src/circuit/circom.rs (3)

41-49: EmailCircuitParams.remove_soft_line_breaks is currently unused

The remove_soft_line_breaks flag is exposed on EmailCircuitParams but never read anywhere in this module, so toggling it has no effect. Either wire it into the email input generation (e.g., how the body is cleaned/padded) or drop the field to avoid a misleading public API.


569-581: Dead DecomposedRegexConfig construction in NFAGraph path

In the NFAGraph branch you build and populate decomposed_regex_config, but never actually use it; only gen_circuit_inputs with NFAGraph is called afterward. This is dead code that adds confusion and may trigger unused-variable warnings.

You can safely remove decomposed_regex_config (and the VecDeque dependency if unused elsewhere), or, if it’s intended for a future shared Circom/Noir path, add a brief comment to mark it as intentionally unused for now.


703-706: compute_signal_length is correct but lacks tests and can be simplified

The formula for compute_signal_length matches the intended “ceil(max_length / 31)”, but there’s a TODO and no tests here. Edge cases like max_length = 0 or exact multiples of 31 are easy to regress.

Consider both (a) adding a small unit test table (e.g., 0, 1, 30, 31, 32) and (b) simplifying the implementation to ((max_length + 30) / 31) for readability. I can sketch the tests if helpful.

Also applies to: 708-708

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 02409b7 and c902807.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • Cargo.toml (3 hunks)
  • package.json (1 hunks)
  • src/circuit/circom.rs (1 hunks)
  • src/lib.rs (2 hunks)
  • ts_tests/circuit_input.test.ts (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • package.json
🧰 Additional context used
🧬 Code graph analysis (2)
src/lib.rs (3)
src/regex/patterns.rs (8)
  • extract_body_hash_idxes (106-108)
  • extract_email_addr_idxes (81-83)
  • extract_email_domain_idxes (86-88)
  • extract_from_addr_idxes (91-93)
  • extract_message_id_idxes (116-118)
  • extract_subject_all_idxes (101-103)
  • extract_timestamp_idxes (111-113)
  • extract_to_addr_idxes (96-98)
src/regex/extract.rs (2)
  • extract_substr (374-386)
  • extract_substr_idxes (188-201)
src/regex/padding.rs (2)
  • pad_bytes (32-36)
  • pad_string (18-22)
src/circuit/circom.rs (6)
src/converters.rs (4)
  • field_to_hex (76-79)
  • hex_to_u256 (302-308)
  • string_to_circom_bigint_bytes (463-483)
  • vec_u8_to_bigint (265-269)
src/parse_email.rs (3)
  • find_index_in_body (360-372)
  • remove_quoted_printable_soft_breaks (323-345)
  • new_from_raw_email (49-75)
src/circuit/noir/utils.rs (1)
  • trim_sha256_padding (319-350)
src/circuit/mod.rs (2)
  • generate_circuit_inputs (134-192)
  • new (98-113)
src/cryptos.rs (5)
  • from (273-275)
  • new (59-61)
  • new (260-262)
  • new (335-343)
  • from_email_addr (111-128)
src/regex/extract.rs (1)
  • extract_substr_idxes (188-201)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build-test
🔇 Additional comments (5)
ts_tests/circuit_input.test.ts (1)

34-34: LGTM! Typo fix applied consistently.

The property name correction from removeSoftLinesBreaks to removeSoftLineBreaks (removing the extra 's') is applied consistently across all three test cases. This aligns with the updated WASM API mentioned in the PR objectives.

Also applies to: 79-79, 153-153

src/lib.rs (2)

11-11: WASM cfg gate for wasm module looks good

The #[cfg(target_arch = "wasm32")] around the wasm module keeps the native build surface clean while exposing the WASM helpers only where relevant; no issues here.


23-30: <!-- [scratchpad_start]
[task verify_re_export_impact 2/3]
[observations]

  • Found lib.rs content showing re-export block at lines 24-30 (matches review snippet)
  • Searched for extract_ across entire codebase - found extensive usage of:
    • extract_body_hash_idxes, extract_email_addr_idxes, extract_email_domain_idxes
    • extract_from_addr_idxes, extract_message_id_idxes, extract_subject_all_idxes
    • extract_substr, extract_substr_idxes, extract_timestamp_idxes, extract_to_addr_idxes
    • pad_bytes, pad_string in multiple files
  • No results for extract_substrs (plural) - only extract_substr (singular) exists
  • All re-exported items are actively used in src/parse_email.rs, src/regex/patterns.rs, src/regex/wasm.rs, src/circuit/circom.rs
  • Functions defined in src/regex/patterns.rs (wrapper functions around extract_substr_idxes) and src/regex/extract.rs (core implementations)

[analysis]

  • The original review suggested looking for removed helpers like extract_substrs
  • No evidence of extract_substrs existing - appears never to have existed in codebase
  • All currently re-exported items are actively used and defined in the regex module
  • Singular vs plural: code uses extract_substr (singular) throughout, not plural variant

[pending]

  • Need to verify regex/mod.rs actually exports all these items
  • Need to ensure all types are properly exported (DecomposedRegexConfig, ExtractionError, etc.)
  • Check if re-export is complete or if any items are missing

[actions]

  1. Check src/regex/mod.rs to see what it actually exports
  2. Cross-reference all re-exported items with what's available in regex module
  3. Verify no breakage from the re-export set
#!/bin/bash
# Check the regex module's mod.rs to see what it exports
cat src/regex/mod.rs
#!/bin/bash
# Check if extract.rs and patterns.rs properly export their functions
grep -n "^pub fn\|^pub struct\|^pub enum" src/regex/extract.rs src/regex/patterns.rs src/regex/types.rs
#!/bin/bash
# Verify all re-exported items from lib.rs actually exist in regex module
# Check each function/type exists
rg -n "pub fn extract_body_hash_idxes|pub fn extract_email_addr_idxes|pub fn extract_email_domain_idxes|pub fn extract_from_addr_idxes|pub fn extract_message_id_idxes|pub fn extract_subject_all_idxes|pub fn extract_substr|pub fn extract_substr_idxes|pub fn extract_timestamp_idxes|pub fn extract_to_addr_idxes|pub fn pad_bytes|pub fn pad_string" src/regex/
Cargo.toml (2)

53-53: Let me search more specifically for any breaking changes documentation between versions:

Let me search for the GitHub release notes more specifically:

Based on my verification, v5.0.0 was released as a fix for a Plonky3 library issue affecting SP1 zkVM security, specifically a FRI size check vulnerability. However, I found that:

  1. No direct code usage detected: The codebase declares sp1-verifier as a dependency but contains no direct Rust code imports or usages of it.

  2. Security fix, not breaking changes: The release notes indicate this was primarily a security fix rather than breaking API changes.

  3. Incomplete verification: Without visible code usage of sp1-verifier or documented API breaking changes, I cannot definitively confirm whether the upgrade required code modifications.

Verify that the sp1-verifier 5.0.0 upgrade compiles successfully and that all tests pass. Confirm the dependency works as intended with the current codebase.


16-16: zk-regex-compiler revision is valid and reachable.

The commit hash a2ce67ff8ae667bc7846cf8920d7d866f2f02d4c has been verified as valid and accessible in the zkemail/zk-regex repository. The sp1-verifier upgrade from 4.0.0 to 5.0.0 shows no breaking changes in the public API based on the v5.0.0 release notes.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/circuit/circom.rs (2)

237-330: Silent 0 defaults for indices still risk hiding extraction failures

code_idx, timestamp_idx, and command_idx default to 0 on any extraction error (Lines 290–307) and are later overwritten with find_index_in_body results when precomputed_sha is present (Lines 315–329). This keeps the previous behavior but still makes it hard to distinguish “not found / error” from a genuine index at 0.

Consider one of:

  • Propagating the errors instead of mapping Err(_) => 0, or
  • Changing these fields to Option<usize> in EmailCircuitInput and using None on extraction failure, or
  • At minimum, documenting clearly that 0 is a sentinel for “missing” and never a valid match position in your circuits.

Example (propagate errors):

-    let mut code_idx = match parsed_email.get_invitation_code_idxes(
+    let mut code_idx = parsed_email
+        .get_invitation_code_idxes(
             params
                 .as_ref()
                 .map_or(false, |p| p.ignore_body_hash_check.unwrap_or(false)),
-    ) {
-        Ok(indexes) => indexes.0,
-        Err(_) => 0,
-    };
-    let timestamp_idx = match parsed_email.get_timestamp_idxes() {
-        Ok(indexes) => indexes.0,
-        Err(_) => 0,
-    };
-    let mut command_idx =
-        match parsed_email.get_command_idxes(circuit_input_params.ignore_body_hash_check) {
-            Ok(indexes) => indexes.0,
-            Err(_) => 0,
-        };
+        )?
+        .0;
+    let timestamp_idx = parsed_email.get_timestamp_idxes()?.0;
+    let mut command_idx =
+        parsed_email.get_command_idxes(circuit_input_params.ignore_body_hash_check)?.0;

This issue was already raised in a previous review and still applies.


618-656: Use u64 consistently for capture group indices instead of i64

All other Circom capture-group-related arrays here (capture_group_ids, capture_group_starts) are serialized as u64, and the Noir side uses u64 for capture_group_start_indices as well. This one cast to i64 is inconsistent and can be confusing:

.map(|s| serde_json::Value::Number((*s as i64).into()))

Switching to u64 keeps all index-like fields non-negative and aligned:

-                                    .map(|s| serde_json::Value::Number((*s as i64).into()))
+                                    .map(|s| serde_json::Value::Number((*s as u64).into()))

This was identified in an earlier review and still applies.

🧹 Nitpick comments (4)
package.json (1)

13-13: Clarify the pre-release version format.

The version "0.4.66-15" uses a non-standard pre-release suffix. Semantic versioning typically expects formats like 0.4.66-alpha.15, 0.4.66-beta.15, or 0.4.66-rc.15 to convey meaning.

Please clarify what "-15" represents (e.g., build number, draft iteration) and consider adopting a standard pre-release identifier for clarity.

src/regex/types.rs (1)

30-71: Clarify behavior when max_bytes_per_part length doesn’t match parts

LegacyDecomposedRegexConfig::to_new_config silently defaults to 256 when max_bytes_per_part is None or too short, which can surprise callers if they miscount parts.

Consider either asserting or documenting this behavior explicitly, e.g.:

 impl LegacyDecomposedRegexConfig {
     /// Convert old config to new config with optional max_bytes per part
     pub fn to_new_config(&self, max_bytes_per_part: Option<Vec<usize>>) -> DecomposedRegexConfig {
+        if let Some(ref v) = max_bytes_per_part {
+            debug_assert!(
+                v.len() >= self.parts.len(),
+                "max_bytes_per_part shorter than parts; missing entries default to 256",
+            );
+        }
         let parts = self
             .parts
             .iter()
             .enumerate()
             .map(|(i, part)| {
                 let max_bytes = max_bytes_per_part.as_ref().and_then(|v| v.get(i).copied());
                 part.to_new_part(max_bytes)
             })
             .collect();

This keeps runtime behavior but makes the implicit default less surprising.

src/circuit/circom.rs (2)

675-691: Minor readability tweak for external input value handling

The call to string_to_circom_bigint_bytes adds an extra reference:

string_to_circom_bigint_bytes(&external_input.value.as_deref().unwrap_or(""))?;

Since as_deref().unwrap_or("") already yields &str, you can simplify to:

-        let mut value =
-            string_to_circom_bigint_bytes(&external_input.value.as_deref().unwrap_or(""))?;
+        let mut value =
+            string_to_circom_bigint_bytes(external_input.value.as_deref().unwrap_or(""))?;

No functional change, just slightly clearer.


719-723: compute_signal_length logic is correct; consider adding tests

The calculation max_length / 31 + (if max_length % 31 != 0 { 1 } else { 0 }) correctly computes the number of 31-byte limbs, including the max_length == 0 case. Given the TODO, adding small table-driven tests (e.g., 0, 1, 31, 32, 62, 63) would lock this in.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c902807 and b9c0830.

📒 Files selected for processing (5)
  • package.json (1 hunks)
  • src/circuit/circom.rs (1 hunks)
  • src/circuit/mod.rs (1 hunks)
  • src/cryptos.rs (9 hunks)
  • src/regex/types.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/circuit/mod.rs
🧰 Additional context used
🧬 Code graph analysis (2)
src/circuit/circom.rs (5)
src/circuit/mod.rs (3)
  • generate_circuit_inputs_old (194-256)
  • generate_circuit_inputs (134-192)
  • new (98-113)
src/converters.rs (4)
  • field_to_hex (76-79)
  • hex_to_u256 (302-308)
  • string_to_circom_bigint_bytes (463-483)
  • vec_u8_to_bigint (265-269)
src/parse_email.rs (3)
  • find_index_in_body (360-372)
  • remove_quoted_printable_soft_breaks (323-345)
  • new_from_raw_email (49-75)
src/circuit/noir/utils.rs (1)
  • trim_sha256_padding (319-350)
src/regex/extract.rs (1)
  • extract_substr_idxes (188-201)
src/cryptos.rs (2)
src/regex/padding.rs (1)
  • pad_string (18-22)
src/converters.rs (2)
  • hex_to_field (20-67)
  • field_to_hex (76-79)
🔇 Additional comments (14)
package.json (1)

4-12: Verify npm scripts cover new browser_test app.

The PR introduces a browser_test Vite app. The current build scripts (wasm:build, wasm:postbuild, build) appear focused on Rust WASM compilation. Confirm that:

  1. The browser_test app has its own separate build/dev scripts (e.g., in a nested package.json), or
  2. New npm scripts are needed to support browser_test development/build.
src/cryptos.rs (8)

3-3: LGTM!

The new import for pad_string from the regex module is correctly used in the refactored from_email_addr method (line 123) and aligns with the PR's goal of consolidating padding utilities.


111-128: LGTM! Excellent security improvement.

The addition of length validation before padding prevents potential collision attacks where different email addresses exceeding 256 bytes could produce identical padded representations. The error handling is clear and the documentation thoroughly explains the security rationale.


559-630: LGTM! Good backward compatibility approach.

The addition of generate_partial_sha_old maintains support for the older Circom compiler that requires external padding. The 64-byte alignment check (lines 615-620) is correctly preserved for this compatibility path, while the new implementation delegates padding to the circuit.


674-770: LGTM! Excellent test coverage.

The new tests comprehensively validate the email address length validation and padding behavior:

  • Valid inputs and boundary cases (lines 674-702)
  • Error handling for overflow (lines 704-718)
  • Security properties preventing collisions (lines 720-730)
  • Integration with calculate_account_salt (lines 746-770)

The tests are well-structured and provide strong confidence in the new validation logic.


872-890: LGTM! Proper error propagation implemented.

The function signature change to return Result<String> correctly propagates errors from:

  • PaddedEmailAddr::from_email_addr (line 875)
  • hex_to_field conversions (lines 879, 881)
  • AccountSalt::new (line 886)

This provides better error visibility to callers and aligns with Rust best practices.


903-906: LGTM! Good robustness improvement.

Adding a fallback to try the lowercase "from" header improves compatibility with email systems that may use different header casing. The or_else pattern ensures no behavioral change when the standard "From" header is present.


38-38: New DKIM API endpoint verified and functional.

The endpoint change from /api/key to /api/key/domain has been confirmed. The new endpoint returns HTTP 200 with a valid JSON array format where each element contains the required "value" field with DKIM records. The existing parsing logic at lines 1038-1041 (which uses .as_array()?.iter().filter_map(|record| record.get("value"))) correctly handles this response format. Query parameters remain unchanged. No breaking changes detected.


494-557: Verify circuit handling of non-64-byte-aligned SHA padding with test evidence.

The Consensys Diligence audit of zkemail.nr (December 2024) identified that there are no tests for partial_sha256_var_end when the message size is not a multiple of block size (64 bytes). This directly validates your concern about removing the 64-byte alignment check in generate_partial_sha.

The code shifts SHA padding responsibility from Rust (generate_partial_sha_old checked alignment explicitly) to the circuit's partial_sha256_var_end, but this function is designed to hash additional parts of the message or complete the entire hash. Critical gap: no evidence in the current codebase that non-aligned inputs are tested or that recent audit findings have been addressed.

Before merging, verify:

  1. Tests exist covering remaining bodies with lengths not divisible by 64
  2. The circuit implementation has been updated to handle the documented audit gaps
  3. Hash output matches standard implementations for all edge cases
src/regex/types.rs (1)

6-28: Error type and result alias are well-structured

The ExtractionError variants and the ExtractionResult<T> alias give clear, future-proof error reporting for regex extraction; no changes needed here.

src/circuit/circom.rs (4)

61-220: Serde wrappers and legacy RegexPart handling look correct

The SerializableRegexPart, RegexPartInput, and regex_parts_serde module cleanly support both new enum-based parts and legacy configs while preserving JSON compatibility. The custom DecomposedRegex Debug/Clone implementations are also sound.


371-390: Claim circuit input generation is straightforward and correct

generate_claim_input cleanly uses PaddedEmailAddr::from_email_addr and propagates validation via ?, and the serialized shape of ClaimCircuitInput is simple and consistent.


397-451: Legacy regex circuit input generation matches old format

generate_legacy_circuit_inputs correctly reconstructs {name}RegexIdx and {name}RegexIdxN fields from extract_substr_idxes, preserving the legacy shape for old Circom circuits. No changes needed here.


581-604: NFAGraph path construction and regex_graph_json handling look correct

The NFAGraph-based path correctly guards with has_nfa_graph(), so the .unwrap() on regex_graph_json is safe, and the use of get_max_haystack_length/max_match_length cleanly wires through to gen_circuit_inputs.

Comment thread src/circuit/circom.rs
Comment on lines +503 to +509
let email_circuit_inputs;
if decomposed_regexes[0].has_nfa_graph() {
email_circuit_inputs = generate_circuit_inputs(circuit_input_params.clone())?;
} else {
// TODO: remove all this temperory code for supporting old one
email_circuit_inputs = generate_circuit_inputs_old(circuit_input_params.clone())?;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential panic when decomposed_regexes is empty

generate_circuit_inputs_with_decomposed_regexes_and_external_inputs indexes decomposed_regexes[0] to decide between generate_circuit_inputs and generate_circuit_inputs_old. If the caller passes an empty vector (e.g., only external inputs), this will panic.

Using .first() avoids this and lets you define a safe default (e.g., new compiler by default):

-    let email_circuit_inputs;
-    if decomposed_regexes[0].has_nfa_graph() {
-        email_circuit_inputs = generate_circuit_inputs(circuit_input_params.clone())?;
-    } else {
-        // TODO: remove all this temperory code for supporting old one
-        email_circuit_inputs = generate_circuit_inputs_old(circuit_input_params.clone())?;
-    }
+    let email_circuit_inputs = if decomposed_regexes
+        .first()
+        .map(|r| r.has_nfa_graph())
+        .unwrap_or(true)
+    {
+        generate_circuit_inputs(circuit_input_params.clone())?
+    } else {
+        // TODO: remove all this temporary code for supporting old one
+        generate_circuit_inputs_old(circuit_input_params.clone())?
+    };

This keeps behavior for non-empty inputs while making the empty case safe.

🤖 Prompt for AI Agents
In src/circuit/circom.rs around lines 503 to 509, the code unsafely indexes
decomposed_regexes[0] and can panic when the vector is empty; change the logic
to use decomposed_regexes.first() (or pattern-match) and, if Some(first) call
first.has_nfa_graph() to choose between generate_circuit_inputs and
generate_circuit_inputs_old, otherwise treat the empty case as the safe default
(use generate_circuit_inputs) so the function no longer panics on empty input.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/cryptos.rs (2)

559-559: Minor: Grammatical issue in comment.

The comment reads "Just supported for older circom compiler" but should be "Only supported for older circom compiler" or "Supported only for older circom compiler" for better clarity.

Apply this diff:

-// TODO : Just supported for older circom compiler, since it's circuit doesn't had the padding inside, we padded them externally
+// TODO: Only supported for older circom compiler, since its circuit doesn't have padding inside, so we pad externally

920-929: Consider adding a clarifying comment for the .last() usage.

The use of .captures_iter(from_domain).last() cleverly handles cases where the From header might contain multiple @ symbols (e.g., in display names like "Contact @ Company <user@domain.com>"). A brief comment explaining this would improve code clarity.

     // Extract domain from From header
     let from_domain = from_headers[0].as_str();
     let from_re = Regex::new(r"@([^>\s]+)").unwrap();
+    // Use .last() to get the domain from the actual email address, not from any @ in the display name
     let from_domain = from_re
         .captures_iter(from_domain)
         .last()
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9c0830 and aeeb433.

📒 Files selected for processing (2)
  • package.json (1 hunks)
  • src/cryptos.rs (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/cryptos.rs (2)
src/regex/padding.rs (1)
  • pad_string (18-22)
src/converters.rs (2)
  • hex_to_field (20-67)
  • field_to_hex (76-79)
🔇 Additional comments (8)
package.json (1)

13-13: Verify the version format is appropriate for npm publishing.

The version 0.4.66-16 uses a non-standard format for npm (typically semantic versioning or standard pre-release identifiers like 0.4.66-alpha.16 or 0.4.66-rc.16 are preferred). Given that this package has public npm publish scripts, confirm that this format is intentional and won't cause issues when published to the registry.

src/cryptos.rs (7)

111-128: LGTM! Excellent security improvement.

The email length validation prevents collision attacks where different email addresses exceeding 256 bytes could produce identical padded representations. The error handling is clear and comprehensive.


494-557: LGTM! Well-structured partial SHA generation.

The function properly handles regex compilation errors, UTF-8 conversion errors, and validates the remaining body length. The padding logic is correct, and the comment at line 549 helpfully explains why 64-byte alignment checking is unnecessary (circuit handles it).


873-890: LGTM! Proper error propagation throughout.

The function correctly propagates errors from PaddedEmailAddr::from_email_addr, hex_to_field, and AccountSalt::new using the ? operator. The Result return type is appropriate for these validation steps.


903-906: LGTM! Good defensive handling of header case variations.

The fallback to lowercase "from" header improves robustness across different mail servers that may use different header capitalization.


674-744: LGTM! Comprehensive test coverage for email validation.

The tests thoroughly cover valid cases, boundary conditions (exactly 256 bytes), error cases (exceeding max length), and security concerns (collision prevention). Well-structured test suite.


746-770: LGTM! Good test coverage for account salt calculation.

The tests verify both successful salt calculation with valid inputs and proper error propagation when the email exceeds maximum length. The use of a properly formatted 32-byte hex account code is appropriate.


38-38: Add integration test for the DKIM API endpoint in CI.

The fetch_public_keys function depends on https://archive.zk.email/api/key/domain for DKIM verification, but the existing test skips in CI environments. Either add an integration test with a mock endpoint/fixture, or document the runtime requirement and expected response schema.

Add rsa_key_bits parameter to allow specifying RSA key size (1024 or 2048).
When not specified, key size is auto-detected from the public key length.
Change RSA key size auto-detection to explicitly match supported sizes
(1024-bit and 2048-bit) and return an error for unsupported sizes.

Previously, any key larger than 128 bytes would silently default to
2048-bit parameters, which would cause incorrect behavior for 4096-bit
keys. Now unsupported key sizes return a clear error message.
…024-for-noir-in-our-registry

feat: add support for 1024-bit RSA keys in Noir circuits
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@src/circuit/noir/mod.rs`:
- Around line 37-54: When computing key_bits in src/circuit/noir/mod.rs,
explicitly validate params.rsa_key_bits when it is Some(bits) to only accept
supported sizes (1024 or 2048); if bits is not one of those values return the
same error used for auto-detect (mentioning the unsupported size in bits and
bytes). Update the match arm for params.rsa_key_bits to check the provided bits
and return an anyhow::anyhow error on invalid values so callers passing 4096
cannot proceed, while keeping the existing auto-detection using
parsed_email.public_key.len() unchanged.
- Around line 274-288: The loop processing external_inputs can produce
value_as_byte_strings longer than the computed signal_length, which will yield
invalid circuit input sizes; inside the loop after computing signal_length and
before padding, check if value_as_byte_strings.len() > signal_length and fail
fast by returning an Err with a clear message (include identifying info such as
external_input.name or external_input.max_length and actual length) from the
enclosing function (the block that calls string_to_circom_bigint_bytes and
compute_signal_length), or—if truncation is desired—explicitly truncate
value_as_byte_strings to signal_length using truncate(signal_length) so the JSON
array never exceeds the circuit's expected length; update callers/tests
accordingly.
♻️ Duplicate comments (5)
src/circuit/noir/mod.rs (3)

31-34: Default body-hash behavior is still insecure and inconsistent.
Line 33 defaults to skipping body-hash verification, and later is_some_and treats None as “don’t populate body”, even though circuit inputs were generated with defaults. Please unify defaults and reuse a resolved boolean.

🛠️ Suggested fix
-    let parsed_email =
-        ParsedEmail::new_from_raw_email(email, params.ignore_body_hash_check.unwrap_or(true))
-            .await?;
+    let ignore_body_hash_check = params.ignore_body_hash_check.unwrap_or(false);
+    let parsed_email =
+        ParsedEmail::new_from_raw_email(email, ignore_body_hash_check).await?;
-        if params.ignore_body_hash_check.is_some_and(|x| !x) {
+        if !ignore_body_hash_check {

179-185: Avoid unwrap() on precomputed_sha.
Line 183 can panic despite this function returning Result. Propagate an error instead.

🛠️ Suggested fix
-            let partial_hash = u8_to_u32(email_circuit_inputs.precomputed_sha.unwrap().as_slice())?;
+            let partial_hash = u8_to_u32(
+                email_circuit_inputs
+                    .precomputed_sha
+                    .as_ref()
+                    .ok_or_else(|| anyhow::anyhow!("Precomputed SHA is missing"))?
+                    .as_slice()
+            )?;

313-317: Handle missing body/decoded_body safely.
These unwrap() calls will panic if body generation was skipped (e.g., ignore_body_hash_check true).

🛠️ Suggested fix
             HaystackLocation::Body => {
                 let body = if params.remove_soft_line_breaks {
-                    noir_circuit_input.decoded_body.as_ref().unwrap()
+                    noir_circuit_input.decoded_body.as_ref()
+                        .ok_or_else(|| anyhow::anyhow!("Decoded body is missing when remove_soft_line_breaks is enabled"))?
                 } else {
-                    noir_circuit_input.body.as_ref().unwrap()
+                    noir_circuit_input.body.as_ref()
+                        .ok_or_else(|| anyhow::anyhow!("Body is missing when required for regex matching"))?
                 };
src/circuit/mod.rs (2)

171-184: Propagate partial-SHA errors instead of panicking.
This function returns Result, but Line 181-183 panics on failure.

🛠️ Suggested fix
-        let (precomputed_sha, body_remaining_padded, body_remaining_length) = match result {
-            Ok((sha, remaining, len)) => (sha, remaining, len),
-            Err(e) => panic!("Failed to generate partial SHA: {:?}", e),
-        };
+        let (precomputed_sha, body_remaining_padded, body_remaining_length) = result
+            .map_err(|e| anyhow::anyhow!("Failed to generate partial SHA: {:?}", e))?;

235-248: Same panic issue in legacy path.
generate_circuit_inputs_old should also propagate errors instead of panicking.

🛠️ Suggested fix
-        let (precomputed_sha, body_remaining_padded, body_remaining_length) = match result {
-            Ok((sha, remaining, len)) => (sha, remaining, len),
-            Err(e) => panic!("Failed to generate partial SHA: {:?}", e),
-        };
+        let (precomputed_sha, body_remaining_padded, body_remaining_length) = result
+            .map_err(|e| anyhow::anyhow!("Failed to generate partial SHA: {:?}", e))?;

Comment thread src/circuit/noir/mod.rs
Comment on lines +37 to +54
// Public key bytes: 128 bytes = 1024 bits, 256 bytes = 2048 bits
let key_bits = match params.rsa_key_bits {
Some(bits) => bits,
None => {
let key_bytes = parsed_email.public_key.len();
match key_bytes {
128 => 1024,
256 => 2048,
_ => {
return Err(anyhow::anyhow!(
"Unsupported RSA key size: {} bytes ({} bits). Only 1024-bit and 2048-bit keys are supported.",
key_bytes,
key_bytes * 8
));
}
}
}
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate explicitly provided rsa_key_bits.
Right now only auto-detect enforces supported sizes. If a caller passes 4096, it will slip through and likely produce incompatible limb sizes.

🛠️ Suggested fix
-    let key_bits = match params.rsa_key_bits {
-        Some(bits) => bits,
+    let key_bits = match params.rsa_key_bits {
+        Some(bits @ 1024 | bits @ 2048) => bits,
+        Some(bits) => {
+            return Err(anyhow::anyhow!(
+                "Unsupported RSA key size: {} bits. Only 1024-bit and 2048-bit keys are supported.",
+                bits
+            ));
+        }
         None => {
             let key_bytes = parsed_email.public_key.len();
             match key_bytes {
                 128 => 1024,
                 256 => 2048,
                 _ => {
🤖 Prompt for AI Agents
In `@src/circuit/noir/mod.rs` around lines 37 - 54, When computing key_bits in
src/circuit/noir/mod.rs, explicitly validate params.rsa_key_bits when it is
Some(bits) to only accept supported sizes (1024 or 2048); if bits is not one of
those values return the same error used for auto-detect (mentioning the
unsupported size in bits and bytes). Update the match arm for
params.rsa_key_bits to check the provided bits and return an anyhow::anyhow
error on invalid values so callers passing 4096 cannot proceed, while keeping
the existing auto-detection using parsed_email.public_key.len() unchanged.

Comment thread src/circuit/noir/mod.rs
Comment on lines +274 to +288
// Process external inputs and add them to the circuit inputs
for external_input in external_inputs {
// Use the existing utility function to convert the string to a Vec<String> of byte values
let mut value_as_byte_strings =
string_to_circom_bigint_bytes(&external_input.value.as_deref().unwrap_or(""))?;

let signal_length = compute_signal_length(external_input.max_length);

// Pad the Vec<String> with "0" strings if it's shorter than the target length
if value_as_byte_strings.len() < signal_length {
value_as_byte_strings.extend(
std::iter::repeat("0".to_string())
.take(signal_length - value_as_byte_strings.len()),
);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against oversized external inputs.
If value_as_byte_strings.len() > signal_length, the generated JSON array will exceed circuit expectations. Fail fast or truncate intentionally.

🛠️ Suggested fix
         let signal_length = compute_signal_length(external_input.max_length);

+        if value_as_byte_strings.len() > signal_length {
+            return Err(anyhow::anyhow!(
+                "External input '{}' exceeds max length ({} > {})",
+                external_input.name,
+                value_as_byte_strings.len(),
+                signal_length
+            ));
+        }
+
         // Pad the Vec<String> with "0" strings if it's shorter than the target length
         if value_as_byte_strings.len() < signal_length {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Process external inputs and add them to the circuit inputs
for external_input in external_inputs {
// Use the existing utility function to convert the string to a Vec<String> of byte values
let mut value_as_byte_strings =
string_to_circom_bigint_bytes(&external_input.value.as_deref().unwrap_or(""))?;
let signal_length = compute_signal_length(external_input.max_length);
// Pad the Vec<String> with "0" strings if it's shorter than the target length
if value_as_byte_strings.len() < signal_length {
value_as_byte_strings.extend(
std::iter::repeat("0".to_string())
.take(signal_length - value_as_byte_strings.len()),
);
}
// Process external inputs and add them to the circuit inputs
for external_input in external_inputs {
// Use the existing utility function to convert the string to a Vec<String> of byte values
let mut value_as_byte_strings =
string_to_circom_bigint_bytes(&external_input.value.as_deref().unwrap_or(""))?;
let signal_length = compute_signal_length(external_input.max_length);
if value_as_byte_strings.len() > signal_length {
return Err(anyhow::anyhow!(
"External input '{}' exceeds max length ({} > {})",
external_input.name,
value_as_byte_strings.len(),
signal_length
));
}
// Pad the Vec<String> with "0" strings if it's shorter than the target length
if value_as_byte_strings.len() < signal_length {
value_as_byte_strings.extend(
std::iter::repeat("0".to_string())
.take(signal_length - value_as_byte_strings.len()),
);
}
🤖 Prompt for AI Agents
In `@src/circuit/noir/mod.rs` around lines 274 - 288, The loop processing
external_inputs can produce value_as_byte_strings longer than the computed
signal_length, which will yield invalid circuit input sizes; inside the loop
after computing signal_length and before padding, check if
value_as_byte_strings.len() > signal_length and fail fast by returning an Err
with a clear message (include identifying info such as external_input.name or
external_input.max_length and actual length) from the enclosing function (the
block that calls string_to_circom_bigint_bytes and compute_signal_length), or—if
truncation is desired—explicitly truncate value_as_byte_strings to signal_length
using truncate(signal_length) so the JSON array never exceeds the circuit's
expected length; update callers/tests accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants