Skip to content

Unexpected Illegal surrogate character when parsing field names #1541

@AfzalivE

Description

@AfzalivE

Reference: Discussion in #1494

Affects: 2.21.0/2.21.1 at least, possibly 3.0.x (but not 3.1.0+)

Seems like something in broken/missing in field name decoding with JSON escapes.

In the minimum repro unit tests below, acceptJsonEscapedSurrogatePairInFieldName is failing but acceptJsonEscapedSurrogatePairInStringValue passes. I don't know nearly enough about unicode but seems like either both should fail or both should pass (as evidenced by similar tests in UTF8SurrogateValidation363Test). Seems like the field name codepath is doing something different.

    @Test
    void acceptJsonEscapedSurrogatePairInFieldName() throws Exception
    {
        // JSON: {"\ud83d\udc4d":"value"}
        byte[] doc = new byte[] {
            '{', '"',
            '\\', 'u', 'd', '8', '3', 'd',  // JSON escape: \ud83d (high surrogate)
            '\\', 'u', 'd', 'c', '4', 'd',  // JSON escape: \udc4d (low surrogate)
            '"', ':', '"', 'v', 'a', 'l', 'u', 'e', '"',
            '}'
        };

        try (JsonParser p = FACTORY.createParser(doc)) {
            assertToken(JsonToken.START_OBJECT, p.nextToken());
            assertToken(JsonToken.FIELD_NAME, p.nextToken());
            // The escaped surrogate pair should decode to U+1F44D (thumbs up emoji)
            assertEquals("\uD83D\uDC4D", p.currentName());
            assertToken(JsonToken.VALUE_STRING, p.nextToken());
            assertEquals("value", p.getText());
            assertToken(JsonToken.END_OBJECT, p.nextToken());
        }
    }

    /**
     * Test that JSON escape sequence \ud83d\udc4d in string value is accepted.
     *
     * JSON: {"key":"\ud83d\udc4d"}
     */
    @Test
    void acceptJsonEscapedSurrogatePairInStringValue() throws Exception
    {
        // JSON: {"key":"\ud83d\udc4d"}
        byte[] doc = new byte[] {
            '{', '"', 'k', 'e', 'y', '"', ':', '"',
            '\\', 'u', 'd', '8', '3', 'd',  // JSON escape: \ud83d (high surrogate)
            '\\', 'u', 'd', 'c', '4', 'd',  // JSON escape: \udc4d (low surrogate)
            '"',
            '}'
        };

        try (JsonParser p = FACTORY.createParser(doc)) {
            assertToken(JsonToken.START_OBJECT, p.nextToken());
            assertToken(JsonToken.FIELD_NAME, p.nextToken());
            assertEquals("key", p.currentName());
            assertToken(JsonToken.VALUE_STRING, p.nextToken());
            // The escaped surrogate pair should decode to U+1F44D (thumbs up emoji)
            assertEquals("\uD83D\uDC4D", p.getText());
            assertToken(JsonToken.END_OBJECT, p.nextToken());
        }
    }

Here's what Claude is saying about this fwiw:

When parsing field names, the code at lines 2025-2045 re-encodes the decoded escape sequence value (e.g., 0xD83D from \ud83d) as a 3-byte UTF-8 sequence (0xED 0xA0 0xBD) into the quads buffer, which later gets rejected by addName() as an illegal surrogate—whereas string value parsing avoids this entirely by storing the decoded value directly into a char[] buffer where Java natively handles surrogate pairs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    2.21Issues planned (at earliest) for 2.213.1

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions