Skip to content

BUG: str.find returns byte offset instead of character offset with str dtype#64133

Open
Mr-Neutr0n wants to merge 1 commit intopandas-dev:mainfrom
Mr-Neutr0n:fix-str-find-byte-offset
Open

BUG: str.find returns byte offset instead of character offset with str dtype#64133
Mr-Neutr0n wants to merge 1 commit intopandas-dev:mainfrom
Mr-Neutr0n:fix-str-find-byte-offset

Conversation

@Mr-Neutr0n
Copy link

Fixes #64123

pc.find_substring returns byte positions rather than character positions for multi-byte UTF-8 encoded strings. This causes Series.str.find() to return incorrect results when using Arrow-backed StringDtype.

For example, find('a') in '永a' returns 3 (byte offset of 'a' in the UTF-8 encoding of '永') instead of the expected 1 (character offset).

The fix replaces the pc.find_substring call with an elementwise application of Python's str.find, which correctly returns character offsets. This matches the approach already used by _str_rfind in ArrowExtensionArray. As noted by @rhshadrach in the issue, there is no pyarrow.compute function that returns character offsets for this operation.

Test added: test_find_multibyte_chars covers 1-byte (ASCII), 2-byte (Á), 3-byte (永), and 4-byte (🐍) UTF-8 characters across all string dtypes.

…ti-byte UTF-8 chars (pandas-dev#64123)

pc.find_substring returns byte positions rather than character positions
for multi-byte UTF-8 encoded strings. This causes Series.str.find() to
return incorrect results when using Arrow-backed StringDtype, e.g.
find('a') in '永a' returns 3 instead of 1.

Fix by falling back to elementwise Python str.find(), which correctly
returns character offsets. This matches the approach already used by
_str_rfind in ArrowExtensionArray.
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the PR!

One thing I am wondering is how much faster the pyarrow method would be compared to the python fallback, for the case of ASCII only, compared to checking if all elements are ASCII. If the difference is big enough, it might still be worth doing a pc.string_is_ascii(..).all() check first.

@jorisvandenbossche jorisvandenbossche added Bug Strings String extension data type and string data Arrow pyarrow functionality labels Feb 13, 2026
@jorisvandenbossche jorisvandenbossche added this to the 3.0.1 milestone Feb 13, 2026
@Mr-Neutr0n
Copy link
Author

Good point — for ASCII-only strings the pyarrow path would definitely be faster since it avoids the Python object overhead. A hybrid approach with pc.utf8_is_ascii to fast-path ASCII cases while falling back to elementwise for mixed content could be worth it. Happy to add that if you think the tradeoff makes sense here, though I'd guess most real-world Series have at least some non-ASCII rows so the fallback would trigger often anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Arrow pyarrow functionality Bug Strings String extension data type and string data

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Different result from str.find depending on dtype

2 participants