-
Notifications
You must be signed in to change notification settings - Fork 28
Open
Labels
enhancementNew feature or requestNew feature or request
Description
RFC 9839 defines problematic code points. It is probably a good idea to filter them or replace them on output.
Something like this:
use unicode_skeleton::UnicodeSkeleton; // Optional for advanced PRECIS-like checks
fn is_problematic(c: char) -> bool {
let cp = c as u32;
// 1. Surrogates: U+D800 to U+DFFF
// (Note: Rust's 'char' type technically shouldn't contain these,
// but they can appear in unchecked byte sequences or UTF-16)
if (0xD800..=0xDFFF).contains(&cp) {
return true;
}
// 2. Noncharacters: U+FDD0..U+FDEF and those ending in FFFE/FFFF
if (0xFDD0..=0xFDEF).contains(&cp) || (cp & 0xFFFE) == 0xFFFE {
return true;
}
// 3. Control Characters: C0 (00-1F, 7F) and C1 (80-9F)
if (0x00..=0x1F).contains(&cp) || cp == 0x7F || (0x80..=0x9F).contains(&cp) {
return true;
}
false
}
fn sanitize_rfc9839(input: &str) -> String {
input.chars()
.map(|c| if is_problematic(c) {
'\u{FFFD}' // Unicode Replacement Character
} else {
c
})
.collect()
}
fn main() {
let raw = "User\u{0000}Name\u{FDD0}";
let cleaned = sanitize_rfc9839(raw);
println!("Original: {:?}", raw);
println!("Cleaned: {:?}", cleaned);
}Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request