Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
How to organize large Rust codebases (kerkour.com)
48 points by speckx on July 14, 2024 | hide | past | favorite | 40 comments


I think one should aspire to not have to make use of a container to get to where you can build a project. I don't think it is productive to start off by telling people that it is okay to make a mess of projects and then try to hide that mess inside a container.

I think it is poor advice to tell people that it is okay to sweep their mess under the container carpet.

Of course, this also means that people who maintain tooling for a language need to be open to the possibility that they may have something to learn from developers. And likewise, developers have to understand that not all of the things they want from the standard tooling will necessarily improve it.


I will even add that cargo already provides local dependency isolation without external tools, like Poetry for python.

In my experience, a docker-compose to setup required services like a database or a rabbitmq broker is much more useful. Mandatory containers for dev end up locking the project in them, making the environment hard to change


I usually take it one step further: embed services that are typically external by default so that your application can run stand alone in a useful minimal, zero configuration configuration.

It is a huge timesaver during development not having to orchestrate lots of moving parts and make sure they are in the correct state. And if you can embed the actual implementations you will be using in production instead of mocks or substitutes, even better.

For production you will usually run against external dependencies - but not always. For a surprising number of backends I've written in the past decade, embedded resources have turned out to be good enough. For instance, lots of backends I've written with very modest database requirements just use SQLite.

Back when I was developing in Java embedding dependencies was a bit easier since a lot of the infrastructure we used was also written in Java. What used to be slow integration tests became really fast integration tests (indistinguishable from regular unit tests) and you could build all-in-one-jars that people who need to integrate with you could just download and fire up without needing anything beyond a JVM.

If you are on the receiving end of someone else's server software this is lovely because it makes your job easier.

I always embed a default database with any application or server that needs a database. Now that I mostly develop in Go, I use the transpiled SQLite library (transpiled to Go so you don't need to use cgo. It is possibly a bit slower and I'm unsure about its suitability for production, but for development work it is brilliant). In Java there were a couple of embeddable database alternatives to choose from.

I usually plan to use PostgreSQL (or RDS or similar cloud DB) in production. Even after I add PostgreSQL support I tend to leave the SQLite in as the default, "zero config" option. It is particularly useful when someone else needs to integrate with your server that they can just download and run it without having to know how to configure it. Usually people just need something to test against when integrating with APIs or similar.

If you're really nice to other developers, you also provide some instructions on how to embed your server. So that people using it can just instantiate your server in integration tests. This usually makes for faster and more convenient tests that developers will be more inclined to run on every rebuild.

(For speeding up integration tests I wrote https://github.com/borud/drydock a few years ago as an experiment. After you've run it once, it makes testing against PostgreSQL much faster and it only requires that you have Docker running. It takes care of everything for you. I seem to remember that someone else ran with the same idea and wrote a better tool for this than mine, but I can't remember who. This removes messing with Docker out of the build loop, but it may not be to everyone's liking if they prefer to manage docker from some build tool)


I guess that there would be a few incompatibilities, but you could even try directly embedding postgres https://supabase.com/blog/postgres-wasm


We live in exciting times :-)


While these are nice tips, I'm not sure that I learnt much about organising large Rust codebases. I'm not very familiar with Rust, but the tip of not putting things in mod/lib seems similar to not putting code (other than visibility/aggregation) in `__init__.py` – something I'd advocate for on any Python codebase of any size. The directory structure tips seem ok.

As for versioning dependencies together at the top level, again I'm unfamiliar with Rust, but this tip seems too specific. I've seen large codebases where having different dependency versions was fairly important or even a core use-case. It's nice if it's possible, but the bigger the codebase the more this will necessarily come into question.

I'd love to see more detail about things like the design of internal APIs that works well for large Rust codebases, approaches to ownership that help scale the codebase, approaches to testing that work or don't work in Rust specifically.


Not clear why Docker and "make" should be involved, except that he likes them. Cargo and git can do the job. A build usually produces a single executable, so "packaging" isn't that much of an issue.

There's much that could be said on this subject. How much should be made visible via "mod.rs" and "lib.rs" files? How much lock granularity is appropriate? Should projects be composed of loosely coupled crates? How much should be global? Anything? How do you deal with the problem that if A owns B, B can't reference A? How do you deal with breaking changes to low-level public crates?

The article doesn't say much about any of that. A good article on this subject would be helpful.


I thought it explained Docker and Make pretty well, but to reiterate:

You can use a Dockerfile to provide a Dev container. If you aren't familiar with those it basically lets IDEs open your project inside a Docker image without you having to faff around with Docker. This can give you a reproducible dev environment and make getting dependencies easier for new devs.

You don't need it if your only dependency is Rust, but that may not be the case for larger projects. E.g. I have one that needs OCaml, and that is an absolute nightmare to install (worse than Python!) so I made a Dev container that had it preinstalled.

He suggests Make so that you have common commands written down in a place where people actually run them (i.e. not a README where they are likely to rot). This helps new contributors discover how to do common tasks. This isn't using Make as a build system, just as a command runner. You could equally use Just or even shell scripts if you really hate Windows users... and yourself.

Hope that helps!


If you need containers to build, then you need containers to deploy. While containers are good for getting devs started with development, they should not be the only dev/testing environment.


> If you need containers to build, then you need containers to deploy

That's not true. I've seen plenty of projects that set up a complex build pipeline made up of scripting languages and wrappers that are a pain to install, but the final product doesn't need any of that to run.

For instance, NodeJS web projects can require all manner of transpilers, minimisers, and what have you, but the end result is often a few large text files that will run in any browser if just served as static files.

If you use containers to build, you may as well use containers to deploy, but you don't need to deploy using containers at all. Just copy the compiled end product out of your dev container.


Nobody is suggesting that.


A few years ago I bought the book written by the same author. A few chapters in, and I now filter out anything written by Kerkour.

This time I missed who the author is and started reading. A few paragraphs in, and I’m scrolling to the bottom to see, who the author is — yep, they did it again.

I have a suspicion, the author might be a recent graduate with only Rust in their belt and not much experience in the industry, trying to quickly climb the career ladder pretending they are something else.


Did what again? This seems like solid advice to me. What specifically do you disagree with?


Well, the advice to not use lib and mod is a sound one. I realized that too. Workspaces – definitely, although, that's borderline useful, because it's hard for me to imagine someone growing their Rust codebase without knowing about this feature. Other tips continue the downward trend.

It's hard to disagree with these tips, because they argue for general good (although devcontainer is a controversial one, imo). But anyone with minimal hands-on experience likely already knows all of that.

Honestly, reading this article the second time I think my initial comment might've been too harsh. Maybe I've been salty after remembering that book by the author I didn't like But I still stand by my words, that the article turned out somewhat superficial and not insightful.


> As for versioning dependencies together at the top level, again I'm unfamiliar with Rust, but this tip seems too specific. I've seen large codebases where having different dependency versions was fairly important or even a core use-case. It's nice if it's possible, but the bigger the codebase the more this will necessarily come into question.

What you can do in your toml files is, in the top level you give the "workspace version" of every dependency, and then in the projects you tell them to use either workspace version or some other specific version. That way, most of your projects will use the same, while you can change the few that need a specific one.


The suggestions are purely about file paths, and don't affect the code. It's like advising "symlink __init__.py as __myproject_init__.py". The author opens files via search, doesn't want to have multiple files with the same name.

Declaring dependencies in the workspace ensures that the same version is used by every library you're writing. Versions are generally deduplicated anyway, but workspace makes it easier to bump versions and configure defaults in one place.


kerkour is basically a spam blog.


Misusing makefiles as a poor man's bash script with a switch statement.

I still don't get it.

If you write a bash or POSIX shell script then you need to learn bash/POSIX shell (or suffer the consequences of not learning it).

If you write a Makefile and ignore 99% of it's features you now also need to learn GNU or POSIX make (or suffer the consequences of not learning it).

Strictly a worse deal and an extra dependency.

And no, "everyone uses make this way so it's a common interface which everyone agrees on" is not the answer.

1. Nobody agrees on target names other than `all` and `clean`. So you need some documentation or some help target.

2. Nobody agrees on a help target. So you need documentation or force your developers to read the makefile.

3. Nobody remembers that parallel make does not guarantee the execution order of the targets. So if someone stumbles into your project directory and runs `make -j 8` they will likely be surprised when everything breaks and you will need to document this or force them to read the Makefile.

4. Not every project actually uses Makefiles, a bunch of them use the kind of script(s) I mentioned, some use purpose built "task runners" (i.e. the bits of Make which people who misuse makefiles actually like to use except more ergonomic), some rely on people knowing how to build the project using the normal build tools, some just document the commands you need. Either way, you're going to either have to document which approach you're using in a `^README(\..*)?$` file or rely on your developers to either ls and discover things for themselves or try random commands until something works or gives a useful error message.

So please, if all your project does is use the language default build system, don't bother misusing Make to allow people to run `make` to either a: have the project build itself, b: have the project build and run itself, c: have the makefile dump out a help output, or d: get an error because you used GNU make and I'm on BSD.

If your project needs some non-trivial commands to build it, shove them in a script. If there's more than one command, either use a collection of scripts at the root, or in a sub directory, or a single script with a case statement and a help output. Also, document this in the `^README(\..*)?$`.

Maybe I should be writing this kind of stuff in my blog rather than writing long-form HN comments. I am certainly


I use Makefiles to build Go applications for two reasons: it is available on the systems we develop on, and it provides a real alternative to writing bash scripts.

Build scripts are bad. They tend to lead to complexity over time. It is very easy to just add some ad-hoc magic to solve a problem than to sit down and think about how one should structure things better. This makes it harder for people unfamiliar with the codebase to understand what is going on.

Then you need to choose a language. Bash? Well, not a lot of people know Bash all that well. And it isn't a very nice language to express the kinds of things you need to deal with to solve build problems. People don't bother learning Bash because Bash isn't a very useful language.

Python? What version? What libraries? Now you need to choose what tooling to use for maintaining your tooling. The embedded world uses lots of Python. The short version: it only makes things worse. People waste time having to learn how to maintain the tooling. If they don't, they will struggle.

Makefiles were made for building C. They are tailored to how to C toolchain works. When you make "proper" use of Makefiles this can provide you with a lot of help. If you program C. If you are programming in other languages: not so much.

The good thing about using Makefiles with as little magic as possible is that it forces people to keep things simple. It is okay that you don't make use of 99% of its features. Not everything has to be a demonstration of your mad <tool> skillz.

That being said: it would be nice to have a more approachable alternative to Make that can be used across languages. Something that promotoes declarative builds rather than allow scripting. Because if you allow people to write code to build things, you are not going to be able to understand most people's builds without setting aside time to study them.


You seem to be describing the use of makefiles for the purpose they're intended for: Describing a DAG where each node is a file and each edge is a build step. This is not a misuse of Make and not what I'm complaining about.

That being said, you're under the impression that you're somehow not writing posix shell/bash when writing a makefile. You still are.


Sure, but we're often making sparing use of what make has to offer, like the pattern rules. I think that's okay, but it is also a reminder that perhaps there is room for a similar tool that can support patterns and built-in mechanisms that are more useful to today's languages. Rust and Go in particular.

I think there is a huge difference between implementing logic in bash and invoking programs using shell syntax.

I am aware of the fact that a lot of people used to put shell script beyond simple invocations in Makefiles, but I see that as a degenerate case. You shouldn't do that.


> I am aware of the fact that a lot of people used to put shell script beyond simple invocations in Makefiles, but I see that as a degenerate case. You shouldn't do that.

Okay, so I am not sure why you seemed to object to my initial comment specifically talking about this kind of use of Make.


I was guilty of this habit of misusing Makefiles for a few years. I see why it's appealing but you're totally right for all of the given reasons. Nowadays I use (and highly recommend) using standard-library Python for this sort of thing. It's much more readable than shell scripts and it's completely cross-platform so you can use the same script for Windows, Mac, Linux.


This is probably the worst advice on Rust project organisation I've ever read.

I absolutely loathe opening up a Rust project that adheres to zero defaults and uses makefiles where cargo would completely suffice. Unless you have a mixed codebase cargo will work just fine. Also that multiple Cargo.toml shenanigangs is much better solved with multiple projects and git dependencies, given that a projects of that size will have different teams anyways, and if it doesn't have different teams then you don't need anything but an easy to maintain and debug monolith.


In a future post, I would love to see a discussion of organizing your software abstractions, rather than just organizing files. For example, id love to know some ideas around keeping prior versions of structs around when you are partially upgrading some schema or other. Or about when to use macros to automate in-repo cleanups.

Rust is definitely a beast when managing 200K+ LOC codebases and just managing compiler woes becomes a challenge. Not to mention dealing with issues with compiler targets, feature flags, etc. It definitely goes deeper than file structure.


We've done large schema migrations in a 1 million LoC Rust codebase. Most of the time, these are pretty straightforward. You mark things as deprecated and link a tracking ticket for the migration in the doc-comment of the field/struct.

What's a lot more difficult is making sure migrations don't break important forwards/backwards compatibility guarantees. Unfortunately, Rust won't help you much with these. One solution we've been using in our team to get confidence in the correctness of our schema evolution is to use proc-macros to generate a somewhat exhaustive set of test data, so that every field and enum variant is included at least once. In CI we test the branch version against master, and let each version serialize/deserialize the other version's test cases.


Very cool, I love that! I had a similar idea and it’s great to see that it’s worth exploring


Even when large projects might have different requirements than 'regular sized' projects, some things don't make much sense to me TBH:

What's the point of a development container when there's already Cargo which is supposed to manage dependencies? If there are any system-wide dependencies required which need to be installed outside Cargo I would consider the build system incomplete or broken?

Why is there a separate Makefile instead of using Cargo for all build tasks?

E.g. one of the great advantages of the Rust ecosystem (compared to the C/C++ or Javascript worlds) is that Rust has standard tooling. Why bring in additional tools from outside the Rust ecosystem?


It is very common to have dependencies outside of Cargo. Take cc, the most popular crate for calling the system C/C++ compiler. It currently has 1983 other crates that depend on it [1].

[1]: https://crates.io/crates/cc/reverse_dependencies


Cargo doesn’t handle non-Rust dependencies. For example, the protobuf compiler when called from build.rs.

People use devcontainers, Nix flakes or large build systems like Bazel to describe the build environment.


https://github.com/andrewhickman/protox is a pure-Rust protobuf compiler.

The argument works better for the likes of libclang.


Thank you for the recommendation.


Having multiple Cargo.toml files across workspaces wreaks absolute havoc with editors like VS Code, which perform builds automatically while you work. Makes the debug cycle waaay slower, which is a tradeoff for the faster compile times I suppose?

Unless someone has a way of handling workspaces in VS Code, in which case please let me know


I can answer this!

As suggested in this article, have your dependencies and their versions fixed at your very top level cargo.toml. Every other workspace, should only be referencing via “dependency.workspace = true”.

Set your VSCode workspace to the directory with your top level cargo.toml in it.

At my work, we have a pretty massive core library, shared between 10-ish apps and I have it laid out as above, and don’t have major issues with compilation times.


Nah this doesn't work. Codebase is set up in this way already but VS Code just isn't having it :-\


There's a setting that disabled checks on save or something. I found that somewhere it it speeds up everything so much. I'm too noob to tell you if it has any downsides tho


Another tip I learned by looking at the Zed IDE codebase is naming `lib.rs` for small single purpose workspace crates to something meaningful instead of polluting the codebase with a bunch of lib.rs files.


Some good advice, some bad advice in here. This is necessarily going to be opinionated.

> Provide a development container

Generally unneeded. It is expected that a Rust project can build with cargo build, don't deviate from that. People can `git clone` and `code .`.

Now, a docker might be needed for deployment. As much as I personally dislike docker, at Meilisearch we are providing a Docker image, because our users use it.

This is hard to understand to me as a Rust dev, when we provide a single executable binary, but I'm not in devops and I guess they have good reason to prefer docker images.

> Use workspaces

Yes, definitely.

> Declare your dependencies at the workspace level

Maybe, when it makes sense. Some deps have distinct versions by design.

> Don't use cargo's default folder structure

*Do* use cargo's default folder structure, because it is the default. Please, don't be a special snowflake that decides to do things differently, even with a good reason. The described hierarchy would be super confusing for me as an outsider discovering the codebase. Meanwhile, vs code pretty much doesn't care that there's an intermediate `src` directory. Not naming the root of the crate `lib.rs` also makes it hard to actually find the root component of a crate. Please don't do this.

> Don't put any code in your mod.rs and lib.rs files

Not very useful. Modern IDEs like VsCode will let you define custom patterns so that you can match `<crate-name>/src/lib.rs` to `crate <crate-name>`. Even without doing this, a lot of the time your first interaction with a crate will be through docs.rs or a manual `cargo doc`, or even just the autocomplete of your IDE. Then, finding the definition of an item is just a matter of asking the IDE (or, grepping for the definition, which is easy to do in Rust since all definitions have a prefix keyword such as `struct`, `enum`, `trait` or `fn`).

> Provide a Makefile

Please don't do this! In my experience, Makefiles are brittle, push people towards non-portable scripts (since the Makefile uses a non-portable shell by default), `make` is absent by default in certain systems, ...

Strongly prefer just working with `cargo` where possible. If not possible, Rust has a design pattern called `cargo xtask`[1] that allows adding cargo subcommands that are specific to your project, by compiling a Rust executable that has a much higher probability to be portable and better documented. If you must, use `cargo xtask`.

> Closing words

I'm surprised to not find a word about CI workflows, that are in my opinion key to sanely growing a codebase (well in Rust there's no reason not to have them even on smaller repos, but they quickly become a necessity as more code gets added).

They will ensure that the project:

- has no warning on `main` (allowed locally, error in CI)

- is correctly formatted (check format in CI)

- has passing tests (check tests in CI, + miri if you have unsafe code, +fuzzer tests)

- is linted (clippy)

[1]: https://github.com/matklad/cargo-xtask


> This is hard to understand to me as a Rust dev, when we provide a single executable binary, but I'm not in devops and I guess they have good reason to prefer docker images.

It's mainly about isolation, orchestration and to prevent supply chain attacks.


+1 to everything here. TL;DR: predictable > most things

The development container idea however is useful when you're dealing with any type of distributed system as it allows you to develop against a known setup of those things (e.g. your database, website, api service(s), integration with external non-rust software etc.)

You missed pointing out the cmd folder. I suspect the author watched https://www.youtube.com/watch?v=LghqbUoXEI4 and wrote this post based on what they saw.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: