Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Rewriting a high performance vector database in Rust (pinecone.io)
126 points by gk1 on Oct 18, 2022 | hide | past | favorite | 145 comments


> If you’re using a higher level language, you’re not going to have access to how the memory is laid out. A simple change, like removing indirection in our list, was an order of magnitude improvement in our latencies since there’s memory prefetching in the compiler and the CPU can anticipate which vectors are going to be loaded next in order to improve the memory footprint.

This is a common experience and I'm still surprised by the choice I constantly see to use a managed-memory languages to build a database - one of a very small set of special cases where having full control over the memory layout might just be a reasonable thing to want. In this universe (absent doing something completely absurd) it's not algorithmic complexity but managing data locality in the cache hierarchy (e.g. reading things from L3 vs main memory vs disk) that makes things orders of magnitude faster, especially if you're in the realm of doing things like SIMD operations to speed things up.

Perhaps there's some level of suck we're willing to tolerate for all the other benefits you get, but I've been noticing a pattern of "align things just so at the higher level and hope they mostly turn out the way you want at the lower level" (e.g. also with the Apache java-y databases like hadoop / hbase / cassandra which I guess were mostly supposed to derive their total throughput from massive scale rather than per-node performance) which is a bit funny.

But also it seems like part of Rust's promise was "low level but make it high level" which seems to be succeeding (zero-cost abstractions and whatnot), so I imagine this will get better over time - having not attempted a project like this myself, I'm not sure what the limitations you'd run up against are in terms of laying things out in memory in a favorable way - I imagine the kind of massive manually managed arena allocations and ad-hoc pointers going everywhere that one normally does doesn't really fly.


Because it is a fake dichotomy.

D, Nim, C#, Swift, not to count all of those that existed since Xerox PARC days.


And Java hopefully soon :)


Noticing that the above stream of consciousness might have come across as crotchety - that was not my intention! A better way to phrase this might have been "the benefits of memory safety and human conveniences of a modern language can be worth giving up a bit of control, even in cases where it might be legitimately beneficial", which was supposed to be praise for the Rust model :-)


> First of all, Python is a garbage collected language, which means it can be extremely slow for writing anything high performance at scale.

I don't think garbage collection is in the top 3 causes of why Python is slow.


This might be a difference of semantics- there is a difference between garbage collection as a concept being slow and python's GIL approach. My understanding is that the GIL would almost always make the top 3 reason of why python is slow in practice - it works for a very specific single threaded execution model but can't really take advantage of modern processors.


I think GIL is not the reason for slowness, it just specifies single threaded interpreter execution model. You can always spin up more interpreters to take advantage of multiple cores.

The reason for slowness - is the weak dynamic type system of Python.

Every single instruction need to be type checked at runtime and thus making everything slow.

Compare to C#/Java which have GC but both are amazingly fast, because these languages have stricter type system. If you add JIT on top (which can selectively replace MSIL/java opcodes with native machine instructions) and it makes perf on par with natively compiled languages like C++.


I'd argue Python is strongly, dynamically typed, and that "weak" and "dynamic" are on different axes.


It's always interesting how often people confuse the dichotomy of "strong/weak typing" with "static/dynamic typing". They are orthogonal to each other indeed - you can have languages that are in each of the four possible combinations.


> I think GIL is not the reason for slowness, it is the weak dynamic types of Python that make it slow.

Python is structurally typed, that makes it dynamic, but it is not weak as there is no type coercion.

> Every single instruction need to be type checked at runtime and thus making everything slow.

This is also wrong, python does not type check anything, not in the "regular" manner of typechecking. It relies on structural typing, if it quacks like a duck, then it is treated like a duck.

In fact, PyPy is an argument against your position as it still allows the same (more or less) behaviour that python has while operating a lot faster due to JIT.

Python doesn't have the luxury of compiling that C# and Java, nor is the VM intended to be high performing.


PyPy is still slower than the competition and largely ignored by the Python community.


I believe the main reason Python is slow is because the cpython project has rejected performance enhancements in favour of keeping the code simple. JavaScript has similar semantics to Python, and its performance is not far behind Java/C# at all.


JS has vastly different semantics to Python, one that lends itself much more to the kind of JIT optimisations needed to make it more performant.


also, is py still interpreted or jit-compiled?


Well, the author went on to rule out Java and Go because they garbage connected, so I don't think they meant the GIL.


> First of all, Python is a garbage, which means it can be extremely slow for writing anything high performance at scale.

Fixed it.


Thankfully Fintech and military weapon control systems aren't high performance.


The authors of this post rewrote _their own_ DBMS in Rust. Which is perfectly ok, but I'm not sure I would trust them to decide that theirs is a "high-performance" DBMS. They don't have any benchmark results except images of their own internal performance measures; they don't offer any way of comparing their performance with other DBMSes (e.g. Vectorwise/Actian Vector, ClickHouse, DuckDB etc. - not to mention Oracle, MS or SAP offerings); and they only have marketing blurb about their numbers: "Up to 10x performance" (with no baseline of course).

So, they took some DBMS (which is probably not so hot in terms of performance) and rewrote it in Rust. Surely possible, possibly useful, but not much to write home about if one is interested in DBMS performance.


I have no problem with people rewriting their projects in whatever language they see fit.

What stood out for me in the article is him saying that it's difficult getting developers with experience in both Python and C++.

So, I wonder, if his in-house devs could pick up Rust that they previously couldn't write, why does he think he can not hire a good programmer and charge him to learn the stack the company uses. Why must they employ someone that already writes Python or C++.

Is Rust such a straight-forward language that people new to the language can write a very performant programme


Despite Rust's steep learning curve, it's also paradoxically easy to add novice Rust programmers to a project.

This is because inexperienced Rust programmers are relatively harmless. Noob mistakes won't compile, rather than running into dangerous gotchas. You can tell noobs not to use `unsafe` (and there are ways to enforce that), and mostly they'll just write inefficient or non-idiomatic code, but the code will be free from data races and memory corruption.

The strictness of the Rust compiler is quite the opposite of something like the C++ Core Guidelines where the majority of the rules aren't enforced by the compiler, and have to be in the programmers' head first.

Noobs make lifetime errors and fight the Rust compiler, but imagine working with a compiler that doesn't tell you when you have lifetime errors.


I think it's great that Rust is safe in that way, where you can have novice programmers write code and you can be pretty sure it won't break anything else.

I recently spent some time fixing performance issues in some novice Rust code, and while the code was pretty clearly written by someone new to systems programming it still all worked fine - https://jackson.dev/post/rust-coreutils-dd/


Memory sanitizers, address sanitizers, leak sanitizers, threading sanitizers, undefined behaviour sanitizers. The visual studio core guidelines checker. The clang-tidy core guideline checker. I could go on but my point is, the landscape does not really look like how you've painted it.


I know about these, but there is a marked difference between Rust and these tools.

Static analysis tools have much harder job analyzing C++ (aliasing and escape analysis are way harder, and static analysis of thread-safety is basically impossible due to lack of thread-safety info in the type system). The results are a trade-off between being sparse or having false positives.

The sanitizers only catch issues they can observe at run time, and that relies on having sufficient test and fuzz coverage. Some data races are incredibly hard to reproduce, and might depend on a timing difference that won't happen in your test harness.

OTOH Rust proves absence of these issues by construction, at compile time.

It's like a difference between dynamically-typed and statically-typed languages. Sure, you can fuzz type errors out of JS or Python, but in statically-typed languages such errors are eliminated entirely at compile time. Rust extends this experience to more classes of errors.


The results are a trade-off between sparse or having false positives.

Rust just takes the other side of the trade-off, and will reject valid programs. Hence why the unsafe keyword exists, and why tools like Miri (https://github.com/rust-lang/miri) exist specifically for rust.


Sure, that's the consequence of Rice's theorem. "Is this program correct?" is formally Undecidable, to dodge that we split the programs three ways: Correct, Not Correct, shrug emoji.

It's obvious that Correct programs compile, Not Correct programs result in a compiler error, but what do we do with shrug emoji ? Rust says those go in the "Not Correct" pile and get a compiler error. C++ says - really, I'm not making this up - that they go in the Correct pile.

There's an immediate short term consequence, but also, after decades, a long term consequence that's arguably worse. Short term, C++ programmers can't know if their non-trivial program is Correct. It might be complete horse shit, the compiler won't necessarily tell them.

Long term, C++ grows worse and worse because there is no reason to shrink that shrug emoji category. All the programs in that category act as though they're fine, so there's no pressure whatsoever to improve the language standard, the compilers, etc.


> Rust just takes the other side of the trade-off, and will reject valid programs

This has always been such a weird claim to me because it's not clear to me what's meant by "valid". My instinct is that it would be hard to define valid in a way without making most languages either accept some invalid programs or reject some valid ones (the exception being defining "valid" as "anything that language X accepts", but that wouldn't really say anything interesting about Rust). Sure, you could draw the line so that Rust is the only one that rejects valid programs, but is it worse to be the only language that rejects some valid programs than to be a language that accepts invalid ones? The alternative is that all languages accept some valid programs, at which point there's nothing specific about Rust that's worse until you start specifying which valid programs each language rejects. If there's a way to define "valid" so that Rust is the only one that rejects valid programs but other languages accept all valid and reject all invalid programs, I think it's non-obvious enough that it should be explicitly stated.


That definition of what is valid or not depends on the type of checker. When you're talking about Rust's borrow checker, then you can take any property that it checks for, such as "there are no aliasing mutable references" or "any variable that is read from has been initialized" and build an example for it that would be correct, but is rejected by the borrow checker.

Here is an example where a variable is definitely initialized, but it will be rejected.

  fn main() {
      let foo;
      if true {
          foo = 1;
      }
    
      println!("{}", foo);
  }


> Rust just takes the other side of the trade-off, and will reject valid programs.

Are we still talking about ease of add novice Rust programmers to a project?


Over the decades, C started to reject more and more "valid" programs, and you have to use explicit casts to go from one pointer type to the other, or from an integer to a pointer. For obviously good reason.


> and that relies on having sufficient test and fuzz coverage

At the faang I worked at, some small portion of servers ran the sanitizers in prod, so you’re not reliant on test coverage nearly so much for catching rare issues.


All of these tools and we still have exploitable buffer overflows in 2022. So either the tools are not working, people are not using them or they are using them but can simply ignore critical warnings.

Your milage may vary, but I think Rust offers a well considered step into the right direction. Stupid and dangerous code of the kind every developer will produce once in a while just won't compile in Rust. You cannot forget to run a check, you can't hide behind not knowing a tool. You can't ignore the warning of you want a running program.

That is not nothing.


Yeah, but Rust guarantees sanity by design. Sanitizers are a patch, and hence not comprehensive.


Assuming you have a CI suite dedicated to haning all of the above, sure. Meanwhile some of us are staring down the barrel of 30-60 minute compile times for a single configuration. Multiplying that out to compile with ubsan, asan, and running the static analysis tools over it would take probably 5-6x longer, _and_ we likely need to do it all twice to ensure were checking our "shipping" code paths.

It's not feasible to have a developer compile for 3 different platforms in two configurations with and without sanitizers, and core guidelines checkers for every change, and these tools take so long it's a huge cost.


Would you trust someone new to C++ to use these properly? The great thing about Rust is that you get all this and more (Rust's static checks are a lot more watertight) right out of the box without any configuration.


They painted it like reality, though, no?

You seem to paint the landscape as full of tools and imply that they're used. Either they're insufficient or they're often under utilized, simply due to the number of bugs we see. No?


I've written quite a bit of production code in C++, Python and Rust, and currently work on a hybrid Rust/Python system. Here's my experience:

- C++ is an unusually large language. And it has many historic footguns, requiring a higher level of vigilance and code review. If I were starting a brand new project today, I wouldn't try to build a team of C++ programmers.

- Untyped Python becomes more difficult to refactor and maintain once you reach 50k to 80k lines on a group project. Typed Python, however, scales nicely beyond this size.

- Rust is a "medium-sized" language. It requires developers to learn more than Go or Python does, but less than C++. And Rust has far fewer traps for the unwary and the reckless than C++. Rust's tooling is also very good in many areas.

- It's tempting to split a project into a fast "core" language, and high-level "glue" language. There are real advantages to this. (Which is why I've done it on one recent project!) But this also comes with costs: everyone needs to be fairly good at two languages, and switch back and forth. And you pay a tax at the boundary.

If I were building a brand new database (and a team to maintain it), I'd actually be strongly tempted to use Rust exclusively. But this is partly because databases rarely have a "business logic" layer that changes constantly, so there's less need for a high level scripting language.

But with a different team or different constraints, C++ could also be the right choice.


Especially business logic should be taken in the firm grip of static compile time guarantees that the hand of a strong type system delivers. Even more so if it changes constantly! Refactoring without fear.

Only software that does not have to run correctly (prototypes, personal hobby projects) can get away with a non-static type system.

When I have to pick a tool and I see it is written in Python I will have a look for an alternative if possible. Because I know it will have many bugs: some known, lots hidden.


> Typed Python, however, scales nicely beyond this size.

Could you say more about what tools and practices make this possible, beyond simply adding type annotations in your code? Asking for a friend.

> It's tempting to split a project into a fast "core" language, and high-level "glue" language.

I did this with C++ and Boost Python back in the day and loved the experience. I wonder if Rust will someday get a high-level language for writing applications and scripts on top of a Rust codebase, like Boost Python for C++ or Tcl for C.


> Could you say more about what tools and practices make this possible, beyond simply adding type annotations in your code? Asking for a friend.

Python with type annotations works really well with type checkers like Mypy, along with LSP servers, and both of those integrate with most development environments.

Using a Python-oriented IDE like Pycharm with type annotated Python also allows for better refactoring options. It reduces the uncertainty and guesswork an IDE's static analyzer must engage in for even basic features you'd take for granted with IDE and statically typed languages.

In practice, developers don't have to keep what can be a massively complex application running in their heads to modify code accurately. A nicely typed project makes it easy to see exactly what types of data get passed around and modified. Before gradual typing, you'd have to backtrack to all of a function's call sites to understand exactly what kind of data it takes and returns. With gradual typing, you can just look at types and rely on Mypy to ensure the right data is passed around.

> I did this with C++ and Boost Python back in the day and loved the experience. I wonder if Rust will someday get a high-level language for writing applications and scripts on top of a Rust codebase, like Boost Python for C++ or Tcl for C.

I haven't used Boost Python, but there are some options for Rust and Python that work well and seem to suit this use case like PyO3.


"But this also comes with costs: everyone needs to be fairly good at two languages, and switch back and forth."

Why does everyone needs to be good at both languages? You can seperate and have the core people writing efficient low level code - and you have higher level scripting/gluing code.


You will have Conway's law in your codebase. Coordination between teams is hard, so teams will prefer to implement features entirely in their language, even where that is technically suboptimal.

You will get hot loops in Python, because a Rust programmer wasn't around, and Rust programmers implementing whole complex business logic in Rust behind a single `do_it()` Python call.


Communication and coordination is surely hard and things like that surely happen, but this is why project management exist.

If it is doing things right, then the rust people don't do complex buisness logic, because it is not assigned to them and they would not even have the details.

And if the python people were too eager and have core stuff implemented and it is affecting performance, than you can always reimplement it low level.

It all depends on the project of course, of what would be the best mix.


A brand new project doesn't need legacy C++ footguns. It can use modern C++.

The part of Python (usually) is that you don't need to be "good at it" it you aren't trying to write super polymorphic core that runs super efficient computations like scipy. If you have a fast core engine for the innner loop, a slow Python management layer is plenty fast.


It takes longer to learn how to use C++ to the same level of proficiency and correctness compared to Rust, in my experience. It's harder to write an incorrect program in Rust.


What are the main correctness risks in C++ if you just never use a raw pointer?


I write c++ every day. Theres no such thing as "just never use a raw pointer", as existing code will use pointers, libraries become pointers, and there's no "non-owning" equivalent of a unique pointer. Even if there was:

- lifetimes are hard. The core guidelines recommend using span and string view, but correct usage of those isnt straightforward:

    std::string_view make(const std::string& in)
    {
        return in;
    }
Is not safe, at all. Span is also super dangerous:

    std::vector<int> vec = {1, 2};
    std::span<int> sp = vec;
    vec.push-back(1); 
Lambda capture semantics are another area where you might end up surprised by the behaviour too, (and by surprised, I mean you'll have memory issues).

Then there's all the normal stuff that still exists like slicing, lack of bounds checking, resource leaks because of improper inheritance use, data races, use-after-free issues. Sure these can be caught by a sanitizer with enough effort, but they still exist.


I don't know about "main", but like, you don't need raw pointers to have UB. uniq_ptr is nullptr after you move it.

And even then, my understanding is that raw pointers are still intended to be used in Modern C++: they're there for when you don't want to transfer ownership.


One thing off the top of my head, from experience:

std::string s(s);

To be fair, compilers will warn you about this nowadays. But when I converted a C codebase to C++ 20 years ago they didn't.

IIRC, references can also refer to de-allocated memory. Also, if you don't pass-by-reference or pointer, you can literally "slice" the dynamic doohickies off your instance so your AlbinoCat behaves like a Cat because all that extra special stuff is gone as far as the function is concerned.

This is just off the top of my head after not working with C++ for 20 years. I'm sure with all the new features it's gained over the past 20 years theres whole new exciting ways to blow your leg off.


how do you observe data you don't own if you never use a raw pointer or reference?

if you use shared ptr for this:

1. shared ptrs aren't made for this use case, they're for shared ownership 2. why using C++ at all if it means reducing its performance to an (atomatocally) performance counted language? Rust allows for much more performance with its safe borrowed references.

That also not what the CPP core guidelines, that are supposed to define modern C++, prescribe: For general use, take T* or T& arguments rather than smart pointers[0]. This rule incurs the risk of use after free if the returned reference is kept for too long by the caller. The rules for reference validity are non trivial (eg you returned a reference to an item of a vector, if you push into your vector you might invalidate the reference), so this is a significant source of bugs even in modern C++.

if you don't use shared ptr or references/pointers for this i'm curious as to which mechanism you're using to observe non owning data

[0]: https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines...


Well...

   UNSAFE {   
      // TODO: Verify all the lines, all the time, are ok
      // Just like you do testing, documentation, security and all that
      // ok?
      #include <iostream>
      using namespace std;
      int main() {

         // YOUR CODE
      }   
   }


What are you saying?


Well if the questions is:

> What are the main correctness risks in C++ if you just never use a raw pointer?

All the code on C/C++ IS a correctness "risks". Only constant, manual inspection could(maybe) say otherwise.

What Rust gives is significant reduction of the risks.


I agree, rust has a difficult learning curve. I’ve often heard at least a year is required to really feel confident.


People really vary, it felt natural to me almost immediately but Rust's surveys indicate that some noticeable proportion of respondents report not being comfortable after more than a year.

This is seen as an important thing to improve because of the slogan "A language empowering everyone to build reliable and efficient software". If Rust isn't empowering everyone because say, 10% of people who try to pick it up can't get anywhere that's not good enough.


It takes a relatively short time to be proficient enough to make useful contributions, maybe half a year to a year to be confident. You can give an experienced developer the Rust book and have them contributing to a Rust codebase quickly.


For me it was ~2 weeks to get out of "learning mode" and into actually writing production code. 3 months to feel proficient. Maybe a year to feel that I had reached parity of fluency with languages I already knew.


6 months is what I heard. I'm currently at ~2 and 6 sounds like a pretty good estimate.


> Is Rust such a straight-forward language that people new to the language can write a very performant programme

Yes. At least that was my experience coming to Rust from a primarily JavaScript background. My first Rust program wasn't as fast as it could have been (I know a lot more about optimising programs now than I did then), but it was still ~30x faster than the JavaScript and Pythons versions my company had attempted to write first. That's without putting all that much effort into optimising it.


>a very performant programme

You might need to define what is "very performant".

I come from C++ performance work side in games. Typical C++ is ~10 times slower than optimized C++. And optimized C++ is sometimes ~2-10 times slower than what game hardware can do. What limits many projects is that cost of going to 'next level' of performance is also nonlinear. Where is on this scale Rust code written by new to the language?


The footgun-to-appropriate-feature ratio is higher with C++ than Rust. Rust also has some excellent Python integration options that are relatively easy to use.


I'm 3 weeks into Rust, and I've found it easy to get over an initial hump. I think where it becomes difficult is in thinking about smart pointers like Rc, RefCell, Arc, etc -- stuff I haven't even encountered yet. For things like certain popular leetcode questions, these concepts become more valuable.


> What stood out for me in the article is him saying that it's difficult getting developers with experience in both Python and C++.

More like they had difficulty finding cheap experienced c++/python devs.


rewriting everything in rust is not just a meme?


Memes typically have some basis in reality. If your project reaches a point where it could benefit from fearless concurrency or better memory control, Rust is probably your best bet at the moment.

I could see huge benefits from Kafka and Cassandra being re-written in Rust.


20 year JVM programmer and current Cassandra user.

Rust rewrite of cassandra, if it could reach feature parity with 4.0, would be a good thing. (C++ port already sortof exists with ScyllaDB but they haven't reached feature parity yet)


Agreed. Java is great for general purpose programming, I just think everyone can benefit from their systems tools being a little more lean. I can’t tell you how often my teams deal with trying to tune the GC


They are rewritten in C++ already; Redpanda and ScyllaDB, respectively. Why waste the effort of rewriting it once again?


I tried in 2017 writing it in rust and found some compiler bugs. I also found compiler bugs in c++ tho to be honest, but I felt more comfortable in c++ so decided to write the first version of it in c++. The huge advantage is that storage engines in particular need to be more conservative in many dimensions and having seen success with scylla, seastar was apealing to me as a 'tried and tested' for storage systems.

Prior systems I had built with facebook folly (c++ lib) and had also written my own eventing systems in the past, but the real value is having seastar being battle tested since 2016. Largely it has been the right decision for us as redpanda for it's young age has benefited from the stability of seastar.


C++ isn't better in the ways outlined.

I'm not necessarily advocating it [1] but the parent's claim was that those programs could benefit from memory safety, thread safety and better concurrency and C++ does not deliver along that axis.

[1] https://www.joelonsoftware.com/2000/04/06/things-you-should-...


kafka clone redpanda is written in rust ?


Hmm? I'm pretty sure it's written in C++.

See also, their install dependencies script.

https://github.com/redpanda-data/redpanda/blob/dev/install-d...


No. C++ and the C* (seastar) framework.


It's like everyone is trying to do things in "Rust" because it's the new thing to do.


Rust is 12 years old. 1.0 was released in 2015.

We've been past the "it's the shiny new thing" phase for a while.


Why do you think the meme exists?


What is a vector database?

https://www.pinecone.io/learn/vector-database/

...was less than informative.


It's a database storing machine learning embeddings.

Example:

Let's suppose I've downloaded all the Gutemberg library books.

I can feed a transformer like Bert or GPT-3 to calculate the embeddings of these books.

These embeddings will represent in vector form (an array of numbers of fixed size) the meaning of these books.

I can save these vectors in this database and this database can then calculate the distance between these vectors, so basically how closely related they are in terms of topic.

If I query this database with the embedding of a sentence like "Love story between teenagers from 2 enemy families in Italy, they die at the end", hopefully the best result will be Romeo & Juliette.

I'm no ML person so take my comment with a grain of salt in terms of how well it works. But in theory, that's the goal.



Standard row-oriented databases store columns on disk like so:

    ABCABCABCABC
Vector databases store them like this:

    AAAABBBBCCCC
This allows faster queries if you just need one (or a few) columns, because unrelated columns don’t have to be processed at all. Caches are more efficient, vector CPU instructions can be used, etc…

The downside is that random single row access is more expensive because a row has to be reassembled from many locations.


You're describing a column store there, not a vector store.


I always thought a columnar store was a series of key:value dictionaries that could produce an RDBMS table result on demand.


That’s a specific physical implantation. If using a key-value storage system then the keys are:

    Row    store: “table id; row id; column id”
    Column store: “table id; column id; row id”
The difference is the order used to sort the data in storage.

In reality, most column store systems use a more complex partitioning scheme with row groups and the like…


I mean, sure. But even DNS is a key:value store, or would you say that's over-simplifying?


> As you can see in the above graph, a commit was merged that caused a huge spike. However, with Criterion, an open source benchmarking tool, we were easily able to identify it, mitigate it, and push a fix.

Wonder what the commit was that caused a more than 2x regression and got a fix instead of an undo.


>In addition, it’s challenging to find developers with experience in both Python and C++

So you decided on a language that makes it even harder to find experienced developers?


Anecdotally, a lot of rust-curious people seem to know python. Projects like pyo3 help a lot as they make it much easier (= safe) to build native modules compared to C, let alone C++.


Rust is seen as more approachable by Javascript and Python devs, so they tend to learn it more often than C or C++.

It is a lot more similar to JS than C++ is.


> It is a lot more similar to JS than C++ is.

That is strange as I have experienced the opposite. I've written all three languages and I've noticed that JS patterns don't translate well to Rust. Many C++ patterns translate well to Rust (albeit after a bit of borrow checker fighting).

Thoughts?


Rust iterators gives you JS vibes, with gotchas mostly related to lambda captures lifetimes. Once you accept that sometimes a `collect` is the easiest way out, it feels okay at the end of the day.


it is arguable that C++ in the modern days is no longer "one language" due to style, libraries, language features and code-base legacy; you have to find a coder that will fit your C++ world, not just C++


Just like it will happen to Rust when it achieves 30 years of history, getting features every six weeks.

How many epochs will exist in 30 years?


As a rough estimate? 10, but I don't know what you think that means.

If you look at what [changes] were introduced by the [2018] and [2021] editions, they weren't as earth shattering as some might think:

2018:

- Module system changes

- Mandatory associated fn argument names

- dyn, async, await and try are now keywords

- You can't write `let s = libc::getenv(k.as_ptr()) as const _;` anymore, instead needing `let s = libc::getenv(k.as_ptr()) as const libc::c_char;` (Method dispatch for raw pointers to inference variables)

2021:

- TryInto, TryFrom and IntoIterator added to the prelude

- cargo dependency resolver changes

- [1, 2, 3].into_iterator() now works

- `|| a.x + 1` now captures `a.x`, not `a`

- Small technically backwards incompatible change to the panic macro formatting string

- any_identifier#, any_identifier"...", and any_identifier'...' are now reserved syntax

- Some previously existing warnings are now errors

- a | b is now matched in pat macro rules

[changes]: https://doc.rust-lang.org/edition-guide

[2018]: https://doc.rust-lang.org/edition-guide/rust-2018/index.html

[2021]: https://doc.rust-lang.org/edition-guide/rust-2021/index.html

Edit: it just dawned on me that by "epoch" you might not have meant the edition mechanism, which was at some point referred to as epochs and have so far happened every three years, but rather "how many iterations of idiomatic Rust code will there be in 30 years". If that was the original intent, you would also consider things like the introduction of the ? operator, or the upcoming let pat = expr else {}, or match ergonomics, or the likely deref patterns, or any number of features that on isolation might not be huge, but that can materially impact what idiomatic code looks like. I personally believe that that kind of iteration and evolution of a language is good and necessary. As long as forwards compatibility is maintained, and that forward compat doesn't hinder the future design space, making things better over time is a great thing!


Yes, with every Rust shop using their own set of favourite features, which in 30 years that will be plenty to choose from.

This isn't unique to C++ as people like to criticize, compare C# 11, Java 19, Python 3.10, C23... with everything in between down to their initial versions.


Glanced through the article, and I see no comparisons on how performance of the DB is in Rust versus their current C++ implementation, no mention of if maintaining the Rust code is easier than their C++ codebase, no stats on how devs are ramping up and how it's tackling their "hard to find a dev who knows both C++ and Python well" issue.


The next paragraph they state:

    We looked at and compared several languages - Go, Java, C++, and Rust. We knew that C++ was harder to scale and maintain high quality as you build a dev team; that Java doesn’t provide the flexibility and systems programming language we needed; and that Go is also a garbage collected language. This left us with Rust. With Rust, the pros around performance, memory management, and ease of use outweighed the cons of it not yet being a very established language.
In other words, they wanted to unify the programming languages and evaluated several. Rust won out of those for performance reasons.

The article is a short recap of a 40 minute video. The video has more context and explains the intentions much better than the web page.

They show a graph of performance over time as the rewrite progressed. There were some small optimisations and problems, a few big regressions, and then a huge improvement that was maintained. Looks like the rewrite process made the database perform significantly better. There's nothing on how much this was caused by the language switch itself, but that's functionally impossible: nobody is rewriting their application twice to see what rewrite is better.


> but that's functionally impossible: nobody is rewriting their application twice to see what rewrite is better.

Agreed, and hypothetically the 2nd rewrite should still be better than the first. So the language would have to make it significantly worse to outweigh the yet again experience in improving things.

To be clear though i'm not stating that every rewrite is assured to be better. However a carefully considered rewrite has a much easier time making decisions learned from any warts discovered in previous implementations. God knows there's always some warts.

As a Rust fanatic, i wouldn't expect Rust itself to be due to the performance gains. It's not expected to be faster than C/C++ typically. Just comparable.


Article also states that the switch from C++ to Rust improves "low level optimized instruction sets, memory layout, and running async tasks."

The first two are also strengths of C++, and for the third the article says that "Rust is async, and Tokio is the one of the most popular async providers ... However, it’s not great for running CPU intensive workloads, like with Pinecone." Puzzling.


I've had more luck with async-std over Tokio for more CPU intensive workloads. But then again, I ran it on a kqueue platform so my experience is probably not representative.


My past experience with Rust async code is that both async-std and Tokio are fairly unimpressive on performance (as async code goes), particularly if you compare to ScyllaDB's runtime or other similar C++ async runtimes.


I liked the part where they said Python is too slow because it's garbage collected, and didn't show any metrics, and then built a new solution and Rust and didn't show metics to compare to the original system.

Makes me think the eng lead just wanted to do Rust, and made up a rationalization.


Well, we already know Python is inherently slower than Rust or any compiled language really, so does one really need metrics to know that the Rust implementation was faster?


Yes. If your bottleneck is magnetic disc seeking times, no amount of language change is going to move the needle (hah!).


a perf change without perf numbers is a bug


Show me metrics where Python can compete with Java/Go, then we can bother with some discussion about Python vs C++/Rust. Actually, show me where Python is faster than Javascript.

But yes, it is a given that C++/Rust is faster than Python unless there is some fundamental algorithmic foolishness done by the C++/Rust programmer. People with industry experience know why that is a given.


Same, "We knew that C++ was harder to scale and maintain high quality as you build a dev team" this just sounds arbitary and a weak excuse to use rust, C++-20 is as scalable as rust with a very rich ecosystem.


> ...with a very rich ecosystem.

As someone who was heavy in C++14/17 it's much nicer in Go/Rust-land.

As much as I love Bjarne, I'm not coming back.


This only proves some skilled people heve too much free time and no creativity...


Rewriting in Rust is not a meme, it's a cycle.

Before Rust became viable, rewrites were done in Go.

From the archives:

- Rewriting a large production system in Go https://news.ycombinator.com/item?id=6234736 (2013)

- How We Moved Our API From Ruby to Go https://news.ycombinator.com/item?id=9693743 (2015)

- Matrix and Riot Confirmed as the Basis for France’s Secure Instant Messenger App https://news.ycombinator.com/item?id=16938545 (2018)

- Toward Vagrant 3.0 https://news.ycombinator.com/item?id=27476676 (2021)

- I’m porting the TypeScript type checker tsc to Go https://news.ycombinator.com/item?id=30074414 (2022)


Which is why old timers eventually learn to just deliver with boring technology.


I honestly feel like rust is boring technology in most senses of the word. It “just works” more than almost any other technology that I’ve used. The ownership system is new and different, but that’s really the only thing.


Honestly, so is Go. I used to use Go, and it's boring as hell. I hate it for a few choices they made, but they definitely achieved their goal. It is quite boring.

I agree with you though, so is Rust. The less boring areas imo these days aren't languages (at least none i see), as all the good languages are boring. Zig for example, is pretty mundane too.

The older i get the more i value confidence in a product. Confidence that it won't crash at runtime. Confidence that i won't be bugged over the weekend. etc


Rust is not boring technology. There's too much ecosystem churn, and new language features are deployed too often.

C++ isn't boring technology, either. If you just want to deliver value, I'd recommend Java.


I've led and been on teams that have written multiple production-grade Rust services that have together delivered 100MM+ USD of value. The number of production bugs has been in the single digits, with exactly one outage that lasted more than a few minutes in the last 3 years. How about yourself?

In my experience, Rust delivers by far the fewest number of bugs in production out of any mainstream language. It gets the fundamentals right like nothing before it. &, &mut, Send and Sync take care of many classes of bugs in the inner loop of productivity.


> There's too much ecosystem churn, and new language features are deployed too often.

That kinda feels like saying Linux is too crazy because new apps get made for Linux frequently.

You can use the same part of the language tomorrow that you used today. Nothing is changing out from under you. If you're afraid of libraries, don't use them. You'd have the same problem in any ecosystem that is new, no?


> That kinda feels like saying Linux is too crazy because new apps get made for Linux frequently.

Apps are okay, but other parts of userland that roll out breaking changes on a regular basis are definitely a problem [1] [2] [3]. Even if they aren't technically part of the kernel, they are usually used with it to provide a complete working system, and they break stuff all the time.

[1]: https://lwn.net/Articles/904892/

[2]: https://lwn.net/Articles/840430/

[3]: https://lwn.net/Articles/777595/


> There's too much ecosystem churn, and new language features are deployed too often.

Not much of an issue if you stick to the stable subset of the language, and libraries that work within that subset.


With my (admittedly limited) experience with the Hadoop ecosystem, I'd sincerely beg for people to stop writing databases in Java... Apart from the way bigger system requirements, dependency version hell, having to monitor GC pauses is just so, so annoying


> Which is why old timers eventually learn to just deliver with ~~boring~~ buggy technology.

There's a reason why folks take the time to rewrite things in Rust. No matter how good you are at C/C++ you will encounter bugs that you would not have if you had written it in Rust.


Assuming there is even a Rust library replacement to start with.

People keep forgetting C++ has 30 years of being deployed in production.

Rust is 2022 is like using C++ in 1990's in terms of ecosystem.


The C++ IDEs available up until about 10 years ago were complete garbage. C++ still doesn't even have a good package manager. All the build systems are pure chaos. The largest C++ package manager has 1500 packages. In comparison, rust's package manager and build system are way easier to use and already have 94,000 packages available to users.

That's not exactly fair to C++ because entire categories of dev tools (like build systems, package managers, IDEs, debuggers, version control, static analyzers, etc.) have matured after C++ did. And let's not forget that when C++ was new, most libraries were proprietary licensed and paid for, whereas today almost all libs are open source. And those improvements (along with general size of the programming community) mean that a trendy language today is going to develop and mature a lot faster than a formerly trendy language did 30 years ago.

IMO Rust in 2022 is a lot closer to java in 2005 than C++ in the 1990s.


Another one that never used Borland, Apple, IBM IDEs.

Where is the Rust IDE that is half as capable as C++ Builder, MPW/Metrowerks, Visual Age, Zortech?

Considering all features they offered across the board in the box, not only code completion.


I've used Borland. I'm sure it was marvelous at the time, but it doesn't hold a candle to CLion or Visual Studio in the 2010s or later.


CLion now ships a cross platform C++ framework with it?

As for Visual Studio, yeah it is great in all aspects, except having nothing else beyond MFC to offer on the GUI department, WinUI is still a mess after UWP.

In any case your examples are for C++ IDEs, reinforcing my case of C++ tooling versus Rust.


This is just plain false. C++ in the 1990s had nothing like serde for example.


The minimum bar for a language has moved up significantly since the 1990s. It isn't enough to just have a neat new idea, you need to ship with nearly-best-of-breed JSON serialization, a web server, a huge standard library with not just strings but things like compression and a lot of networking, and a laundry list of other things (give or take a few things) just to make it to the "barely viable alternate choice" point.

Nice as a language consumer, but a bummer that building new languages and getting some attention is so much harder than it used to be.


Well, Rust checks all of those boxes.

Yes, it's a problem for that language you plan on creating (try specializing into a niche). But it's not something that should impact Rust's adoption.


With various levels of completeness.


Where are the production grade and pure rust tls library ? Key-value store ? Ldap client ? SSH client ?


> production grade and pure rust tls library

You mean rustls? https://github.com/rustls/rustls


Pure rust ? No.


You’re right, I see some *.md-files in the GitHub repo


“ ring exposes a Rust API and is written in a hybrid of Rust, C, and assembly language.”


Aren't the most commonly used libraries for all of those written in C, not C++?

Regardless, I'm surprised you haven't heard of rustls - https://github.com/rustls/rustls


You have great c++ libraries for those. Not in pure rust though.


I haven't used them much, but sled, ldap3 and thrussh do exist. As Rust gains further in popularity I'd expect more of these to become production ready. Meanwhile there's always C and C++ interop.


Sled and thrussh are not production grade. I don’t particularly want to delve into details as I think the effort is laudable. I can explain my position in private if need be.


OK. I mean they're probably less mature, yeah. But high-quality Rust bindings to libssh2 and RocksDB do exist so shrug


And? A drop in the ocean of libraries.


And virtually no one should be starting new projects in C++ and everyone should switch to Rust.


Start by removing C++ from Rust compiler.

Then go around for Khronos, NVidia, Microsoft, Sony, Nintendo, Unreal, Godot,.... to support Rust on their SDKs.


The stable release of Go was maybe four years or so before Rust. So what you’re saying seems to be that people like to rewrite their tech in young and hyped (for good or bad or neutral) languages. Because there is little connection between Rust and Go (other than chronology).


[flagged]


I care.

Programming languages give us different frameworks and guardrails to express computational tasks similarly to how written and spoken languages give us a different set of concepts with which to express ideas. New languages mean a potentially different way of thinking about a problem. Some ideas which are difficult to express in one language are trivial in another.

Discovering these differences is one of the joys of language learning. Language learning requires practice, and rewriting a known work (or translating it you might say) is a great way to deepen your understanding and test which ideas are easier or harder.


You seem to come from a point as if I don't care about the intricate knowledge of programming languages or as if I don't care about learning how different languages manage to solve the same problem but from a different angle or with a different approach. I really do.

What I don't see here in this article is none of that with some very loose arguments around picking Rust vs some other system programming language. Very typical of such type of articles regardless whether they're coming from Rust, C++, Go etc. "I rewrote XYZ in Rust" or "I rewrote XYZ in C++" or "I rewrote XYZ in Go" is basically like saying "I rewrote XYZ in a Turing-complete language". Go figure. How relevant that really is?

Any problem can be solved in any programming language but the important distinction which is very often left unspoken and not demonstrated is what is it that you managed to improve? Is it that you managed to improve the code quality, shorten the development cycle, decreased the number of bugs or was it that you managed to improve the performance? In absence of ability to demonstrate the pros and cons of the approach one took, I will keep finding such articles biased with the lack of real content and therefore annoying. Some will call it programming language circle*erking.


This isn't a technical article. This is an article about about an organization making a business decision based on broad goals, not specific tasks. They explained the technical and human requirements they wanted to optimize for and explained the other options they had available to them and why they chose the way they did. Then they explained their experience. This article is for managers, team leads, directors, to help them navigate similar business decisions.

You original comment was purely dismissive of the entire concept of advocating for and sharing the experience of different languages. You also seem very focused on the technicality that "You can program anything in any language" which while true, you CAN write a Facebook clone in [brainfuck](https://en.wikipedia.org/wiki/Brainfuck), you wouldn't want to for many reasons. For real(tm) languages the decision is harder.


My comment would have been dismissive if I had not provided the rationale which I did and I still stand by my original point. Article provided no or whatsoever evidence why rewriting a vectorized database engine in Rust would solve any problems that could not have been solved with the original implementation language (C++ with Python front-facing API). It reads like a wasted effort providing no strong grounds for doing so.

> This isn't a technical article. This is an article about about an organization making a business decision based on broad goals, not specific tasks. They explained the technical and human requirements they wanted to optimize for and explained the other options they had available to them and why they chose the way they did. Then they explained their experience. This article is for managers, team leads, directors, to help them navigate similar business decisions.

Wow. A text building upon loose arguments around the AVX-512 instruction sets, memory layouts, memory footprint, garbage collection, asynchronous execution, parallel processing and benchmarking is not a technical article but a guide for businesses?

> First of all, Python is a garbage collected language, which means it can be extremely slow for writing anything high performance at scale.

False. Time spent on the frontend API (be it Python connector through JDBC or ODBC) is literally nonexistent to the time spent in the engine actually crunching the analytical workload. Python seems to work just fine for dozens of database engine companies around.

> In addition, it’s challenging to find developers with experience in both Python and C++.

Even if that had been true, which is highly unlikely given the popularity and age of both C++ and Python, finding seasoned Rust developers is easier? C++ and Python is a pretty much regular combination from my experience but even if Python is lacking, devs with C++ background will pick up Python in no time. Anyone basically will.

> we wanted to find some way to unify our code base

I failed to understand the reasons for that unification. However, now that the codebase will be unified, it will be interesting to see in what languages will the ecosystem be developed. Python? Rust?

> while achieving the performance predictability we needed.

If "code unification" had been the only requirement, for which I don't see any strong arguments, there's probably no better language than C or C++ at this point to fit such a task. Rust actually isn't that predictable when it comes to code generation.

> We knew that C++ was harder to scale and maintain high quality as you build a dev team

Again, there's no argument which would support that Rust will somehow excel C++ in this point? Anyway I am pretty much sure that the way of things are quite the opposite. Building a high-performing team in a C++ still at this point is a much better bet.

I could go further and further dissecting the article but I will stop here. Your impression is that they explained all their decisions thoroughly providing a ground for other business decision-makers and I have to disagree with that. Quality of the content is low, arguments are almost non-existent and very loose. All I see is a biased presentation. Or one may say low-quality Rust pitch.

Not to mention how absurd is to _rewrite_ the whole database engine in another language just because. Too bad that the engine isn't open-source so that we can actually see how it goes along the way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: