BUG: str.find returns byte offset instead of character offset with str dtype by Mr-Neutr0n · Pull Request #64133 · pandas-dev/pandas

Mr-Neutr0n · 2026-02-13T13:37:23Z

pc.find_substring returns byte positions rather than character positions for multi-byte UTF-8 encoded strings. This causes Series.str.find() to return incorrect results when using Arrow-backed StringDtype.

For example, find('a') in '永a' returns 3 (byte offset of 'a' in the UTF-8 encoding of '永') instead of the expected 1 (character offset).

The fix replaces the pc.find_substring call with an elementwise application of Python's str.find, which correctly returns character offsets. This matches the approach already used by _str_rfind in ArrowExtensionArray. As noted by @rhshadrach in the issue, there is no pyarrow.compute function that returns character offsets for this operation.

Test added: test_find_multibyte_chars covers 1-byte (ASCII), 2-byte (Á), 3-byte (永), and 4-byte (🐍) UTF-8 characters across all string dtypes.

…ti-byte UTF-8 chars (pandas-dev#64123) pc.find_substring returns byte positions rather than character positions for multi-byte UTF-8 encoded strings. This causes Series.str.find() to return incorrect results when using Arrow-backed StringDtype, e.g. find('a') in '永a' returns 3 instead of 1. Fix by falling back to elementwise Python str.find(), which correctly returns character offsets. This matches the approach already used by _str_rfind in ArrowExtensionArray.

jorisvandenbossche

Looks good, thanks for the PR!

One thing I am wondering is how much faster the pyarrow method would be compared to the python fallback, for the case of ASCII only, compared to checking if all elements are ASCII. If the difference is big enough, it might still be worth doing a pc.string_is_ascii(..).all() check first.

Mr-Neutr0n · 2026-02-13T18:42:17Z

Good point — for ASCII-only strings the pyarrow path would definitely be faster since it avoids the Python object overhead. A hybrid approach with pc.utf8_is_ascii to fast-path ASCII cases while falling back to elementwise for mixed content could be worth it. Happy to add that if you think the tradeoff makes sense here, though I'd guess most real-world Series have at least some non-ASCII rows so the fallback would trigger often anyway.

Mr-Neutr0n mentioned this pull request Feb 13, 2026

BUG: Different result from str.find depending on dtype #64123

Open

3 tasks

jorisvandenbossche reviewed Feb 13, 2026

View reviewed changes

jorisvandenbossche added Bug Strings String extension data type and string data Arrow pyarrow functionality labels Feb 13, 2026

jorisvandenbossche added this to the 3.0.1 milestone Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: str.find returns byte offset instead of character offset with str dtype#64133

BUG: str.find returns byte offset instead of character offset with str dtype#64133
Mr-Neutr0n wants to merge 1 commit intopandas-dev:mainfrom
Mr-Neutr0n:fix-str-find-byte-offset

Mr-Neutr0n commented Feb 13, 2026

Uh oh!

jorisvandenbossche left a comment

Uh oh!

Mr-Neutr0n commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Mr-Neutr0n commented Feb 13, 2026

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Mr-Neutr0n commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants