Skip to content

guard against problematic code points #176

@anewton1998

Description

@anewton1998

RFC 9839 defines problematic code points. It is probably a good idea to filter them or replace them on output.
Something like this:

use unicode_skeleton::UnicodeSkeleton; // Optional for advanced PRECIS-like checks

fn is_problematic(c: char) -> bool {
    let cp = c as u32;

    // 1. Surrogates: U+D800 to U+DFFF 
    // (Note: Rust's 'char' type technically shouldn't contain these, 
    // but they can appear in unchecked byte sequences or UTF-16)
    if (0xD800..=0xDFFF).contains(&cp) {
        return true;
    }

    // 2. Noncharacters: U+FDD0..U+FDEF and those ending in FFFE/FFFF
    if (0xFDD0..=0xFDEF).contains(&cp) || (cp & 0xFFFE) == 0xFFFE {
        return true;
    }

    // 3. Control Characters: C0 (00-1F, 7F) and C1 (80-9F)
    if (0x00..=0x1F).contains(&cp) || cp == 0x7F || (0x80..=0x9F).contains(&cp) {
        return true;
    }

    false
}

fn sanitize_rfc9839(input: &str) -> String {
    input.chars()
        .map(|c| if is_problematic(c) {
            '\u{FFFD}' // Unicode Replacement Character
        } else {
            c
        })
        .collect()
}

fn main() {
    let raw = "User\u{0000}Name\u{FDD0}"; 
    let cleaned = sanitize_rfc9839(raw);
    
    println!("Original: {:?}", raw);
    println!("Cleaned:  {:?}", cleaned);
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions