guard against problematic code points

RFC 9839 defines problematic code points. It is probably a good idea to filter them or replace them on output.
Something like this:

``` rust
use unicode_skeleton::UnicodeSkeleton; // Optional for advanced PRECIS-like checks

fn is_problematic(c: char) -> bool {
    let cp = c as u32;

    // 1. Surrogates: U+D800 to U+DFFF 
    // (Note: Rust's 'char' type technically shouldn't contain these, 
    // but they can appear in unchecked byte sequences or UTF-16)
    if (0xD800..=0xDFFF).contains(&cp) {
        return true;
    }

    // 2. Noncharacters: U+FDD0..U+FDEF and those ending in FFFE/FFFF
    if (0xFDD0..=0xFDEF).contains(&cp) || (cp & 0xFFFE) == 0xFFFE {
        return true;
    }

    // 3. Control Characters: C0 (00-1F, 7F) and C1 (80-9F)
    if (0x00..=0x1F).contains(&cp) || cp == 0x7F || (0x80..=0x9F).contains(&cp) {
        return true;
    }

    false
}

fn sanitize_rfc9839(input: &str) -> String {
    input.chars()
        .map(|c| if is_problematic(c) {
            '\u{FFFD}' // Unicode Replacement Character
        } else {
            c
        })
        .collect()
}

fn main() {
    let raw = "User\u{0000}Name\u{FDD0}"; 
    let cleaned = sanitize_rfc9839(raw);
    
    println!("Original: {:?}", raw);
    println!("Cleaned:  {:?}", cleaned);
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

guard against problematic code points #176

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

guard against problematic code points #176

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions