Skip to content

Port StringView to Base#60526

Merged
oscardssmith merged 19 commits intoJuliaLang:masterfrom
jakobnissen:stringview
Apr 12, 2026
Merged

Port StringView to Base#60526
oscardssmith merged 19 commits intoJuliaLang:masterfrom
jakobnissen:stringview

Conversation

@jakobnissen
Copy link
Copy Markdown
Member

@jakobnissen jakobnissen commented Jan 1, 2026

This PR ports StringViews.jl to Base, as discussed in #60037. Closes #60037.

Tests have been written by a chat-bot and reviewed by me

Decisions for reviewers before reviewing

  • Can we use @inbounds with a StringView wrapping an unknown array type? The code currently does, but I'm not against removing it. If we do remove it, we may need to re-jigger a bunch of string related code, since shared code paths of strings and string views would mean that String performance could be degraded by removing inbounds annotations.

Motivation

We generally want to avoid adding more types to Base when it can live in a package, so this PR requires some justification:

StringViews are the more fundamental string

String and SubString are an abstraction over an underlying byte array, and almost all its operations are defined in terms of loading bytes from that array. This abstraction of strings as "arrays in a trench coat" is made explicit by string views.
Hence, String and SubString can be implemented in terms of string views, but not the other way around. And it is better if Base contains the foundations and packages provide implementations on top of those, instead of the other way around.

One hint that the relationship is inverted is to look at the implementation of StringViews.jl: It re-implements tonnes of internal Base string functions, instead of calling into generic, foundational methods in Base.
We also have comments like this in Base: # duck-type s so that external UTF-8 string packages like StringViews can hook in, where the sensible thing would be to define getindex for UTF8-encoded strings in terms of this method, instead of encouraging use of Base internals.

Due to the strong intersection of strings as an abstraction and string views, this PR only has ~100 LOC in base/strings/stringview.jl, whereas src/ in StringViews.jl has ~800 LOC.

Relatedly, I find it unfortunate that this relationship between strings as an abstraction, and arrays as the underlying implementation, is frequently inverted for String in Base - code will often define operations on CodeUnits{UInt8, String} in terms of operations on the string, instead of the other way around. This makes the 'dispatch funnels' awkward because it does not match the fundamental truth that string operations are ultimately defined (even in Base) in terms of byte operations.

StringViews could be used in Base itself

Base does a bunch of parsing, e.g. on TOML and Tar files. In my view, string views are very useful for zero-copy operations for efficient parsing. For example, to parse a Float32 from a byte buffer, we now jave to allocate an intermediate String. If we ever want to optimise Tar and TOML parsing, we would need something like string views at some point.

Arguments against this PR:

@StefanKarpinski mentioned that he and @oscardssmith was discussing some string redesign that would lessen the need for string views. I don't know the details, but would like to know more. If string views are made obsolete in the future, it may be a bad idea to add this type.

This PR also does complicate some dispatch systems. Since we don't have an interface for expressing denseness, this PR worsens the situation where dispatch is controlled by "big unions" that attempt to encompass the same concept, such as e.g. DenseString = Union{SubString{String}, String, StringView{<:Union{DenseVector{UInt8}, var"#s18"} where var"#s18"<:(SubArray{UInt8, 1, var"#s16", I, true} where {var"#s16"<:DenseVector{UInt8}, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}})}, SubString{<:StringView{<:Union{DenseVector{UInt8}, var"#s18"} where var"#s18"<:(SubArray{UInt8, 1, var"#s16", I, true} where {var"#s16"<:DenseVector{UInt8}, I<:Union{Tuple{Vararg{Real}}, Tuple{AbstractUnitRange, Vararg{Any}}}})}}}.
By laying extra complexity on a not-so-consistent mess of dispatch and half-abstractions, this PR makes the situations somewhat worse. This problem is possibly temporary, and could be solved, but it does mean that the existence of string views in Base makes existing Base code more complex, and therefore increases the risk of bugs.

Finally, adding any new stuff to Base has obvious downsides: It makes it impossible to change without breaking Base Julia, makes Base CI take longer, contributes to 'feature overload' and so on.

Differences from StringViews.jl to this PR

The StringView type in this PR is different from that in StringViews.jl in several respects, listed below. The purpose of the changes I've made is to make the implementation more 'conservative', such that the semantics of string views are more in line with existing Base code.

I have also removed a bunch of auxiliary methods that are slightly off, semantically, mixing up strings, vectors and IOs.

  • Vector{UInt8}(s::StringView{Vector{UInt8}}) now copies, instead of accessing the wrapped vector, because it is confusing that the Vector{UInt8}(::StringView) constructor sometimes and sometimes not aliases the string. The codeunits function can be used to unambiguously get a vector that aliases the string.
  • The constructor StringView(::String) has been removed. String views are - by definition - built from AbstractVector{UInt8}, not strings. If you for (some weird) reason wants a string view wrapping a string, do StringView(codeunits(::String))
  • Similarly, StringView(::StringView) is also removed. We construct string views from vectors, and nothing else.
  • The StringView(buf::IOBuffer) constructor has been removed. The purpose of this function was to wrap the buffer of IOBuffer. However, this should be done properly with a function that gets the buffer of IOBuffer as an AbstractVector{UInt8}, then wrapping that in a string view.
  • Slicing into a string view, e.g. s[1:2] now copies, instead of returning a view. It has also been type-stabilized.
  • SVRegexMatch has been removed. Regex matching over a string view just returns a RegexMatch. This is objectively better, but was not done in StringViews.jl because RegexMatch used to not be generic over the target string.
  • reverse(::StringView) now returns a StringView{Memory{UInt8}}. It would be nicer to return the same type as the input, but that is not possible generically.
  • For now, StringView can only be constructed with a one-based indexed vector. This restriction may be lifted in the future. This check has been added because some existing StringView code assumed that and I didn't want to re-invent StringViews.jl too much in this PR.
  • It is not permitted to construct a StringView{T1} wrapping a T2 where T2 <: T1 and T1 !== T2. This just complicates the code for no good reason.

Future work

More work could be done to refactor the "dispatch funnels" for AbstractString. In particular, routing more methods through codeunits, such that the functions operate on AbstractVector could simplify quite a bit. However, this is best left for a future PR.

@jakobnissen jakobnissen marked this pull request as ready for review January 2, 2026 14:19
@jakobnissen jakobnissen changed the title WIP: Port StringView to Base Port StringView to Base Jan 2, 2026
@jakobnissen jakobnissen added strings "Strings!" feature Indicates new feature / enhancement requests status: waiting for PR reviewer labels Jan 2, 2026
@jakobnissen
Copy link
Copy Markdown
Member Author

Test and build failures appear unrelated.

@stevengj
Copy link
Copy Markdown
Member

stevengj commented Jan 6, 2026

Did you not copy over the optimized hash method that calls hash_bytes, or am I missing something? Seems like you could just broaden hash(::String, ::UInt) to hash(::DenseUTF8String, ::UInt).

@jakobnissen
Copy link
Copy Markdown
Member Author

The recent generic string hashing function is quite fast - a StringView hashes almost as fast as a String. So I don't think there is a need to specialize

@stevengj
Copy link
Copy Markdown
Member

stevengj commented Jan 7, 2026

Why not make it exactly as fast, when all it requires is a 1-line change to broaden this method:

@assume_effects :total hash(data::String, h::UInt) =

Contrariwise, if there is no performance advantage to the specialized hash(::String) method, why not delete that method?

@jakobnissen
Copy link
Copy Markdown
Member Author

Looking into it more, I've found:

  • Pointer-based hashing is faster than AbstractArray-based hashing, though I do not know why.
  • String hashing uses @assume_effects :total, which I'm not comfortable to do for generic StringView

Therefore, I've added a special method when hashing DenseStringViewAndSub, which calls into the pointer-based hashing backend. Other than the assume effects, the only difference now between String and DenseStringViewAndSub hashing are unimportant details, which should not affect speed.

@stevengj
Copy link
Copy Markdown
Member

stevengj commented Jan 8, 2026

Right, definitely only for DenseUTF8String. But why not widen the old method rather than adding a new one?

@jakobnissen
Copy link
Copy Markdown
Member Author

jakobnissen commented Jan 9, 2026

The method for String is annotated total, and not valid for DenseUTF8String because stringviews break the promise of consistency, as they are mutable.
I suppose I could widen it to also take substring of String (EDIT: No I see Jameson explicitly didn't annotate the SubString{String} method as total and so I don't dare do so, because these effects assumptions are nuanced and tricky)

@jakobnissen jakobnissen added the triage This should be discussed on a triage call label Jan 11, 2026
@jakobnissen
Copy link
Copy Markdown
Member Author

Tagging triage because this is a new feature and should not be added without consideration.

@oscardssmith oscardssmith requested a review from gbaraldi January 15, 2026 18:47
@oscardssmith
Copy link
Copy Markdown
Member

Triage thinks this abstraction bears its weight and likes how heavy the test coverage is. Still needs review, but triage approves.

@jakobnissen jakobnissen removed the triage This should be discussed on a triage call label Jan 15, 2026
@jakobnissen
Copy link
Copy Markdown
Member Author

Thank you for the review - addressed in commit 81d094f

offs -= ncodeunits(c)
# Since StringView is generic over the wrapped array, we could invoke UB
# if we don't validate the array behaves as expected.
offs < 1 && error("Invalid implementation of vector length")
Copy link
Copy Markdown
Member

@adienes adienes Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels a bit inconsistent to distrust length here, but trust it everywhere else. I guess this comes back to the question in your summary

Can we use @inbounds with a StringView wrapping an unknown array type?

similar questions of course have arisen many times without a clear answer... see e.g. #40397

however, iterate(::AbstractArray) is definitely a more critical function to be never-wrong. I have to imagine users of StringView are going to tend to be already primed to be attentive to indexing bounds and safe memory usage etc. so the @inbounds in this PR seem more acceptable to me. although, just personally speaking, I'd probably have allowed iterate(::AbstractArray) to rely on correct length as well...

Copy link
Copy Markdown
Member

@adienes adienes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think w.r.t. general comments and code cleanups I have nothing more to add at the moment, or at least certainly nothing that can't be left for a followup PR.

I am not particularly a String power-user though, so it might still be useful to get someone with more investment in the apis to take a look as well, maybe @tecosaur or @quinnj given the support expressed in #60037 ?

as a more vague comment I do wonder if it might be possible to clean up all the new type aliases into a trait or two? while reviewing I found myself repeatedly confused about which of the big unions was actually the appropriate choice for a certain dispatch. but that could also just be an issue on my side to a lack of familiarity

@jakobnissen
Copy link
Copy Markdown
Member Author

jakobnissen commented Apr 11, 2026

I do wonder if it might be possible to clean up all the new type aliases into a trait or two? while reviewing I found myself repeatedly confused about which of the big unions was actually the appropriate choice for a certain dispatch.

Absolutely, this is a problem - and a hobby horse of mine. It gets even worse once you realize that SubArray, SubString and CodeUnits can be nested arbitrarily - e.g. consider the difference between a substring of a stringview and a stringview of a subarray. See #54581. But I think that is out of scope for this PR

@jakobnissen jakobnissen added merge me PR is reviewed. Merge when all tests are passing and removed status: waiting for PR reviewer labels Apr 12, 2026
@jakobnissen
Copy link
Copy Markdown
Member Author

This has been approved by triage, and reviewed twice now, so I think it's ready to merge. If anyone else wants to contribute with a review, please do!

@oscardssmith oscardssmith merged commit 66351eb into JuliaLang:master Apr 12, 2026
9 of 10 checks passed
@oscardssmith oscardssmith removed the merge me PR is reviewed. Merge when all tests are passing label Apr 12, 2026
@jakobnissen jakobnissen deleted the stringview branch April 13, 2026 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Indicates new feature / enhancement requests strings "Strings!"

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Have a StringView in Base

7 participants