Port StringView to Base#60526
Conversation
|
Test and build failures appear unrelated. |
4cfff69 to
aef8a48
Compare
|
Did you not copy over the optimized |
|
The recent generic string hashing function is quite fast - a StringView hashes almost as fast as a String. So I don't think there is a need to specialize |
|
Why not make it exactly as fast, when all it requires is a 1-line change to broaden this method: Line 636 in c3e9456 Contrariwise, if there is no performance advantage to the specialized |
|
Looking into it more, I've found:
Therefore, I've added a special method when hashing |
|
Right, definitely only for |
|
The method for String is annotated total, and not valid for DenseUTF8String because stringviews break the promise of consistency, as they are mutable. |
|
Tagging triage because this is a new feature and should not be added without consideration. |
|
Triage thinks this abstraction bears its weight and likes how heavy the test coverage is. Still needs review, but triage approves. |
|
Thank you for the review - addressed in commit 81d094f |
| offs -= ncodeunits(c) | ||
| # Since StringView is generic over the wrapped array, we could invoke UB | ||
| # if we don't validate the array behaves as expected. | ||
| offs < 1 && error("Invalid implementation of vector length") |
There was a problem hiding this comment.
feels a bit inconsistent to distrust length here, but trust it everywhere else. I guess this comes back to the question in your summary
Can we use @inbounds with a StringView wrapping an unknown array type?
similar questions of course have arisen many times without a clear answer... see e.g. #40397
however, iterate(::AbstractArray) is definitely a more critical function to be never-wrong. I have to imagine users of StringView are going to tend to be already primed to be attentive to indexing bounds and safe memory usage etc. so the @inbounds in this PR seem more acceptable to me. although, just personally speaking, I'd probably have allowed iterate(::AbstractArray) to rely on correct length as well...
adienes
left a comment
There was a problem hiding this comment.
I think w.r.t. general comments and code cleanups I have nothing more to add at the moment, or at least certainly nothing that can't be left for a followup PR.
I am not particularly a String power-user though, so it might still be useful to get someone with more investment in the apis to take a look as well, maybe @tecosaur or @quinnj given the support expressed in #60037 ?
as a more vague comment I do wonder if it might be possible to clean up all the new type aliases into a trait or two? while reviewing I found myself repeatedly confused about which of the big unions was actually the appropriate choice for a certain dispatch. but that could also just be an issue on my side to a lack of familiarity
Absolutely, this is a problem - and a hobby horse of mine. It gets even worse once you realize that |
|
This has been approved by triage, and reviewed twice now, so I think it's ready to merge. If anyone else wants to contribute with a review, please do! |
This PR ports StringViews.jl to Base, as discussed in #60037. Closes #60037.
Tests have been written by a chat-bot and reviewed by me
Decisions for reviewers before reviewing
@inboundswith aStringViewwrapping an unknown array type? The code currently does, but I'm not against removing it. If we do remove it, we may need to re-jigger a bunch of string related code, since shared code paths of strings and string views would mean thatStringperformance could be degraded by removing inbounds annotations.Motivation
We generally want to avoid adding more types to Base when it can live in a package, so this PR requires some justification:
StringViews are the more fundamental string
StringandSubStringare an abstraction over an underlying byte array, and almost all its operations are defined in terms of loading bytes from that array. This abstraction of strings as "arrays in a trench coat" is made explicit by string views.Hence,
StringandSubStringcan be implemented in terms of string views, but not the other way around. And it is better if Base contains the foundations and packages provide implementations on top of those, instead of the other way around.One hint that the relationship is inverted is to look at the implementation of
StringViews.jl: It re-implements tonnes of internal Base string functions, instead of calling into generic, foundational methods in Base.We also have comments like this in Base:
# duck-type s so that external UTF-8 string packages like StringViews can hook in, where the sensible thing would be to define getindex for UTF8-encoded strings in terms of this method, instead of encouraging use of Base internals.Due to the strong intersection of strings as an abstraction and string views, this PR only has ~100 LOC in
base/strings/stringview.jl, whereassrc/inStringViews.jlhas ~800 LOC.Relatedly, I find it unfortunate that this relationship between strings as an abstraction, and arrays as the underlying implementation, is frequently inverted for
Stringin Base - code will often define operations onCodeUnits{UInt8, String}in terms of operations on the string, instead of the other way around. This makes the 'dispatch funnels' awkward because it does not match the fundamental truth that string operations are ultimately defined (even in Base) in terms of byte operations.StringViews could be used in Base itself
Base does a bunch of parsing, e.g. on TOML and Tar files. In my view, string views are very useful for zero-copy operations for efficient parsing. For example, to parse a
Float32from a byte buffer, we now jave to allocate an intermediateString. If we ever want to optimise Tar and TOML parsing, we would need something like string views at some point.Arguments against this PR:
@StefanKarpinski mentioned that he and @oscardssmith was discussing some string redesign that would lessen the need for string views. I don't know the details, but would like to know more. If string views are made obsolete in the future, it may be a bad idea to add this type.
This PR also does complicate some dispatch systems. Since we don't have an interface for expressing denseness, this PR worsens the situation where dispatch is controlled by "big unions" that attempt to encompass the same concept, such as e.g.
DenseString = Union{SubString{String}, String, StringView{<:Union{DenseVector{UInt8}, var"#s18"} where var"#s18"<:(SubArray{UInt8, 1, var"#s16", I, true} where {var"#s16"<:DenseVector{UInt8}, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}})}, SubString{<:StringView{<:Union{DenseVector{UInt8}, var"#s18"} where var"#s18"<:(SubArray{UInt8, 1, var"#s16", I, true} where {var"#s16"<:DenseVector{UInt8}, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}})}}}.By laying extra complexity on a not-so-consistent mess of dispatch and half-abstractions, this PR makes the situations somewhat worse. This problem is possibly temporary, and could be solved, but it does mean that the existence of string views in Base makes existing Base code more complex, and therefore increases the risk of bugs.
Finally, adding any new stuff to Base has obvious downsides: It makes it impossible to change without breaking Base Julia, makes Base CI take longer, contributes to 'feature overload' and so on.
Differences from StringViews.jl to this PR
The
StringViewtype in this PR is different from that in StringViews.jl in several respects, listed below. The purpose of the changes I've made is to make the implementation more 'conservative', such that the semantics of string views are more in line with existing Base code.I have also removed a bunch of auxiliary methods that are slightly off, semantically, mixing up strings, vectors and IOs.
Vector{UInt8}(s::StringView{Vector{UInt8}})now copies, instead of accessing the wrapped vector, because it is confusing that theVector{UInt8}(::StringView)constructor sometimes and sometimes not aliases the string. Thecodeunitsfunction can be used to unambiguously get a vector that aliases the string.StringView(::String)has been removed. String views are - by definition - built fromAbstractVector{UInt8}, not strings. If you for (some weird) reason wants a string view wrapping a string, doStringView(codeunits(::String))StringView(::StringView)is also removed. We construct string views from vectors, and nothing else.StringView(buf::IOBuffer)constructor has been removed. The purpose of this function was to wrap the buffer ofIOBuffer. However, this should be done properly with a function that gets the buffer ofIOBufferas anAbstractVector{UInt8}, then wrapping that in a string view.s[1:2]now copies, instead of returning a view. It has also been type-stabilized.SVRegexMatchhas been removed. Regex matching over a string view just returns aRegexMatch. This is objectively better, but was not done in StringViews.jl becauseRegexMatchused to not be generic over the target string.reverse(::StringView)now returns aStringView{Memory{UInt8}}. It would be nicer to return the same type as the input, but that is not possible generically.StringViewcan only be constructed with a one-based indexed vector. This restriction may be lifted in the future. This check has been added because some existing StringView code assumed that and I didn't want to re-invent StringViews.jl too much in this PR.StringView{T1}wrapping aT2whereT2 <: T1andT1 !== T2. This just complicates the code for no good reason.Future work
More work could be done to refactor the "dispatch funnels" for
AbstractString. In particular, routing more methods throughcodeunits, such that the functions operate onAbstractVectorcould simplify quite a bit. However, this is best left for a future PR.