Rust has a nice solution of having a &str slice (equivalent of C++ std::string_view) as the lowest common denominator and a deref operator that can easily coerce various string types to the slice.
This allows Rust to have all kinds of string flavors (fixed-len or NUL-terminated or growable, on stack or heap or in ROM, with SSO, with CoW, interned or refcounted, atomically or not, and so on), but they all coerce to the basic &str, so they have the same basic methods, and are compatible with most functions that just want a string without caring how it's allocated.
You can use your own weird string type if you want, but usually you're not forced to convert or copy it to use it with standard library functions or 3rd party dependencies.
Yeah - although the cost of this is that it makes rust harder to learn. (Try answering the common beginner question: “How is &String different from &str?”)
In rust Strings are usually passed in to functions as a &str, and returned from functions as String. It makes sense once you’re used to it, but I think philosophically Rust is much closer to C++ than C. Rust is missing the elegant minimalism of C or Zig.
It just illustrates the point. There is no elegance in simplicity. Zig is not memory safe and its functions „assume” utf8 with UB when they encounter invalid bytes.
C also has &str and &String, but they're "char* that crashes if you free() it" and "char* that leaks if you don't free() it".
Ownership is hard to learn, and it is a barrier to learning Rust. However, don't confuse it with C++. Rust has two+ string types on purpose, not because of carrying 1970's legacy.
> Try answering the common beginner question: “How is &String different from &str?”
This has nothing to do with strings though - how is &Vec<T> different from &[T] or &Box<[T]>? You need to understand ownership, references, and unsized types.
Sort of. It certainly affects strings, and you can't really use strings without understanding the difference.
My point is that its yet another thing in the big bucket of junk you need to learn in order to use rust effectively. Rust is the only language I know of that has separate types for owned strings and a borrowed strings. Its a powerful concept - and its fast and efficient for the computer. But its not "free". The cost is that it makes it harder to learn rust.
While we're on the topic - it also really doesn't help that the names of these types in the standard library are super idiosyncratic. String is to Vec<u8> as &str is to &[u8]. You just have to memorise that. How about Path and PathBuf? One of them is like String, and one is like &str. I swear I have to look it up every single time.
I think the investment is worthwhile - I adore rust. But as I said, rust feels more like a better C++ than a better C. All the pain is front loaded, with the assumption that you'll eventually write enough software in rust to make the investment worthwhile.
Simpler languages like Python or Go get out of your way immediately and let you be productive. I prefer rust, but there's definitely something to be said for that attitude.
I think I would sum this up with saying that strings are not simple, and there is a delicate balance between making simple things simple and complex things possible without shooting yourself in the foot - both as a user, and as a library designer.
Since Rust prefers to have verbosity and correctness over ease, it makes sense to have multiple types to reflect the different requirements of the underlying data.
I sympathize with some of the std library verbosity when it comes to strings (and paths) but I don't think it's really a big criticism of why Rust's learning curve is steep. You have the same issues in C++, and doubly so in C because it doesn't help you at all and when you screw up the program crashes. When you use strings in an unmanaged language and need to manipulate them, you do need to understand the underlying consequences of how those strings are defined and stored. Comparing to managed languages like Python and Go is a bit unfair, because they're working in a different domain.
Python is a great example of what happens when you try and hide complexity too far from the programmer - the python3 ecosystem fracture took millions of dollars and over a decade to settle - because of the inherent complexity in string representations.
> Rust is the only language I know of that has separate types for owned strings and a borrowed strings.
C++ and Objective-C (kinda) have different types for owned and referenced strings. C++ in particular has the same requirements as Rust, where referenced strings need to be able to refer to literals and it's not acceptable to make the default allocate.
Agree that the naming is pretty bad, but ultimately there are like… three or four of these owned/deref pairs you actually have to remember in day to day usage? I do wish they had gone with a consistent naming scheme, but in the grand scheme of things this is one of the more minor things that Rust got wrong.
Yup, I can’t imagine switching to a language that doesn’t have something like that, since in C# you have a similar feature in the form of Span<T> and ReadOnlySpan<T>, to which all manners of sources can be coerced: heap-allocated arrays, stack-allocated buffers, inline arrays, strings, native memory, etc.
> And then everybody proceeds to write their own String library anyway.
Is this true? It was (is!) certainly true for C, but C has an especially emaciated expectation for string processing primitives. Any runtime developed after like 1995 that I can think of has fixed this by providing a sane string implementation people generally agree upon.
Rust and Go both don't have a builtin package for grapheme iteration - and many naively (and incorrectly) think that a 'Go unicode rune == "character"' in Go. I assume the same happens with Rust `char` type.
If you care about unicode-aware string sorting (you should), rather than the naive string sorting the Go and Rust standard libraries provide out of the box... then you probably want a proper Unicode library.
I think the only language that gets Unicode 'right' out of the box is Swift, as it actually provides grapheme iterators, Locale awareness, etc. - but it comes at the cost of the language being tied to the (ever-moving) Unicode standard.
> I think the only language that gets Unicode 'right' out of the box is Swift, as it actually provides grapheme iterators, Locale awareness, etc. - but it comes at the cost of the language being tied to the (ever-moving) Unicode standard.
I think this might (at least partially) be why Rust's stdlib doesn't have this. If it did, then support for it would be tied to Rust's release schedule and which version of Rust you're using. Granted it is every six weeks, and it's usually trivial to update, but that's still a connection that could be an issue.
By having this be in a separate library it means that it can update as and when it needs to, and it's not inherently connected to a specific release of Rust.
I was pessimistic about grapheme-based orientation towards text, deleted it to research more, and I've come to the conclusion that this is simply not a consensus opinion. Can you give me an example where grapheme-based sorting makes a critical difference from codepoint-oriented sorting on a normalized text? Full unicode composition certainly seems to provide a reasonable solution with western languages, CJK characters, and romanization of CJK characters, but that leaves a hell of a lot of scripts that I don't know about.
I mean unicode is incredibly complex, but it doesn't even seem like there's a consensus outside of swift's string implementation of what a grapheme even is.
(Granted, this might support the above concept that people can't even agree on what a string is, but unicode code points seems like a reasonable baseline to expect from a modern language. That said, rust doesn't even include unicode normalization in the standard library, although the common crate for it seems like a reasonable solution.)
The issue I am aware of is with the Thai language that has zero-length unicode codepoints that get superimposed on the preceding non-zero-length unicode codepoint preceding it (or if none is present, an 'empty' non-zero-length placeholder). A non-zero-length unicode codepoint can have multiple zero-length unicode codepoints following it. (In Thai, no more than 2 for morphemically correct words.) For sorting, a normalization needs to happen in the order of these zero-length codepoints in order for unicode codepoint sorting to be correct. The standard practice in Thai is to have vowel signs before tone markers.
In recent years, application support for this has greatly improved.
> it doesn't even seem like there's a consensus outside of swift's string implementation of what a grapheme even is.
Linguistically it's easy, graphemes are the squiggles people actually draw, as distinct from how a machine encodes them. Of course since people aren't a single individual with just one consistent opinion that does mean there's room for nuance - maybe some people think this is two separate squiggles.
Nope. It's not at all fixed because nobody can ever agree on what a "String" is and what performance guarantees the underlying data structure should provide.
Let's just assume a String is UTF-8 to make things "simple".
Is a String mutable or not? Should mutable and immutable Strings have the same underlying structure? If mutable, is a String extensible or not? Can a String be sliced into another String? Can those slices be shared? Should you walk across codepoints or characters (which could be multiple codepoints due to combining)? If you want to insert a codepoint in the middle of a String, what are the performance guarantees?
I can go on and on ...
"String" really has to be a library as there are simply far too many permutations once you step away from "Shove ASCII to tty".
Well sure, people may colloquially refer to a lot of things as "strings"—hell, you could refer to all sequences as strings if you just wanted to argue with people—but the idea of trying to encapsulate this all in the standard library in a single implementation seems confusing semantically and of questionable value. It seems a lot easier to work with a reasonable interpretation of a string with its associated tradeoffs—which again is implied by most standard libraries.
That said, I personally would balk at willing adopting any runtime that didn't enable iterating over a sequence of unicode code points, whether they be stored as utf8 or some 16-bit form, from a string of bytes in 2024 unless I were guaranteed to avoid having to deal with text processing of free-form human input.
With that said, implementing it is not easy, especially in terms of ensuring it is as fast as built-in string or faster (because the project goal is to be faster than built-in string and Go implementation, and match or outperform Rust's String where applicable).
I can't imagine something like that being possible in e.g. Java or most other high-level languages - they simply do not expose appropriate low-level primitives when you need a "proper" string type.
And then everybody proceeds to write their own String library anyway.
"The only winning move is not to play."