BUG: str.find returns byte offset instead of character offset with str dtype#64133
BUG: str.find returns byte offset instead of character offset with str dtype#64133Mr-Neutr0n wants to merge 1 commit intopandas-dev:mainfrom
Conversation
…ti-byte UTF-8 chars (pandas-dev#64123) pc.find_substring returns byte positions rather than character positions for multi-byte UTF-8 encoded strings. This causes Series.str.find() to return incorrect results when using Arrow-backed StringDtype, e.g. find('a') in '永a' returns 3 instead of 1. Fix by falling back to elementwise Python str.find(), which correctly returns character offsets. This matches the approach already used by _str_rfind in ArrowExtensionArray.
jorisvandenbossche
left a comment
There was a problem hiding this comment.
Looks good, thanks for the PR!
One thing I am wondering is how much faster the pyarrow method would be compared to the python fallback, for the case of ASCII only, compared to checking if all elements are ASCII. If the difference is big enough, it might still be worth doing a pc.string_is_ascii(..).all() check first.
|
Good point — for ASCII-only strings the pyarrow path would definitely be faster since it avoids the Python object overhead. A hybrid approach with |
Fixes #64123
pc.find_substringreturns byte positions rather than character positions for multi-byte UTF-8 encoded strings. This causesSeries.str.find()to return incorrect results when using Arrow-backedStringDtype.For example,
find('a')in'永a'returns3(byte offset of 'a' in the UTF-8 encoding of '永') instead of the expected1(character offset).The fix replaces the
pc.find_substringcall with an elementwise application of Python'sstr.find, which correctly returns character offsets. This matches the approach already used by_str_rfindinArrowExtensionArray. As noted by @rhshadrach in the issue, there is nopyarrow.computefunction that returns character offsets for this operation.Test added:
test_find_multibyte_charscovers 1-byte (ASCII), 2-byte (Á), 3-byte (永), and 4-byte (🐍) UTF-8 characters across all string dtypes.