You're doing the lords work. I often get pushback on doing this with some variat...

derefr · on July 20, 2023

“Why” isn’t a property of the code itself, though; it’s rather a property of the process of creating the code.

Because of this, my personal belief is that the justification for any line of code belongs in the SCM commit message that introduced the code. `git blame` should ultimately take you to a clear, stand-alone explanation of “why” — one that, as a bonus, can be much longer than would be tolerated inline in the code itself.

Of course, I’m probably pretty unusual, in that I frequently write paragraphs of commit message for single-line changes if they’re sufficiently non-obvious; and I also believe that the best way to get to know a codebase isn’t to look at it in its current gestalt form, but rather to read its commit history forward from the beginning. (And that if other programmers also did that, maybe we’d all write better commit messages, for our collective future selves.)

yuliyp · on July 21, 2023

I think there's room for the comments having a brief explanation and a longer one in the commit message. Sometimes people need the callout to go looking at the history because otherwise they might not realize that there's anything significant there at all.

derefr · on July 21, 2023

The "brief explanation" is what the commit message's subject line is for :) Just turn on "show `git blame` beside code" in your IDE, and you'll get all the brief explanations you could ever want!

...but no, to be serious for a moment: this isn't really a workable idea as-is, but it could be. It needs discoverability — mostly in not just throwing noise at you, so that you'll actually pay attention to the signal. If there was some way for your text editor or IDE to not show you all the `git blame` subject lines, but just the ones that were somehow "marked as non-obvious" at commit time, then we could really have something here.

Personally, I think commit messages aren't structured enough. Github had the great idea of enabling PR code-reviewers to select line ranges and annotate them to point out problems. But there's no equivalent mechanism (in Github, or in git, or in any IDE I know of) for annotating the code "at commit time" to explain what you did out-of-band of the code itself, in a way that ends up in the commit message.

In my imagination, text-editors and IDEs would work together with SCMs to establish a standard for "embeddable commit-message code annotations." Rather than the commit being one long block of text, `git commit -p` (or equivalent IDE porcelain) would take you through your staged hunks like `git reset -p` does; but for each hunk, it would ask you to populate a few form fields. You'd give the hunk a (log-scaled) rating of non-obviousness, an arbitrary set of /[A-Z]+/ tags (think "TODO" or "XXX"), an eye-catching one-liner start to the explanation, and then as much additional free-text explanation as you like. All the per-hunk annotations would then get baked into a markdown-like microformat that embeds into the commit message, that text-editors/IDEs could recognize and pull back out of the commit message for display.

And then, your text-editor or IDE would:

1. embed each hunk's annotation-block above the code it references (as long as the code still exists to be referenced — think of it as vertical, hunk-wise "show `git blame` beside code");

2. calculate a visibility score for each annotation-block based on a project-config-file-based, but user-overridable arbitrary formula involving the non-obviousness value, the tags, the file path, and the lexical identifier path from the syntax highlighter (the same lexical-identifier-path modern `git diff` gives as a context line for each diff-hunk);

3a. if the visibility score is > 1, then show the full annotation-block for the hunk by default;

3b. else, if the visibility score is > 0, then show the annotation-block folded to just its first line;

3c. else, hide the annotation-block (but you can still reveal the matching annotation with some hotkey when the annotated lines are focused.)

Of course, because this is just sourced from (let's-pretend-it's-immutable) git history, these annotation-block lines would be "virtual" — i.e. they'd be read-only, and wouldn't have line-numbers in the editor. If the text-editor wants to be fancy, they could even be rendered in a little sticky-note callout box, and could default to rendering in a proportional-width font with soft wrapping. Think of some hybrid of regular doc-comments, and the editing annotations in Office/Google Docs.

---

...though, that's still not going as far as I'd really like. My real wish (that I don't expect to ever really happen) is for us to all be writing code as a series of literate codebase migrations — where your editor shows you the migration you're editing on the left, and the gestalt codebase that's generated as a result of running all migrations up to that one on the right, with the changes from the migration highlighted. You never directly edit the gestalt codebase; you only edit the migrations. And the migrations are what get committed to source-control — meaning that any code comments are there to be the literate documentation for the changes themselves; while the commit messages exist only to document the meta-level "editing process" that justifies the inclusion of the change.

Why? Because the goal is to structure the codebase for reading. Such a codebase would have one definitive way to learn it: just read the migrations like a book, front to back; watching the corresponding generated code evolve with each migration. If you're only interested in one component, then filter for just the migrations that make changes to it (`git log -S` style) and then read those. And if you, during development, realize you've just made a simple typo, or that you wrote some confusing code and later came up with a simpler way to express the same semantics — i.e. if you come up with code that "you wish you could have written from the start" — then you should go back and modify the earlier migration so that it's introduced there, so that new dev reading the code never has to see you introduce the bad version and then correct it, but instead just sees the good version from the start.

In other words: don't think of it as "meticulously grooming a commit history"; instead, think of it as your actual job not being "developer", but rather, as you (and all your coworkers) being the writers and editors of a programming textbook about the process of building program X... which happens to compile to program X.

jimmaswell · on July 21, 2023

I don't understand the focus on commit messages. I never read the git log. You can't assume anyone reading the code has access to the commit history or the time to read it. The codebase itself should contain any important documentation.

derefr · on July 22, 2023

> I never read the git log.

Well, no, because 1. it's not useful, because 2. most people never write anything useful there (which is a two-part vicious cycle), and 3. editors don't usefully surface it.

If we fix #3; and then individual projects fix #2 for themselves with contribution policies that enforce writing good commit messages; then #1 will no longer be true.

> You can't assume anyone reading the code has access to the commit history or the time to read it.

You can if you're the project's core maintainer/architect/whoever decides how people contribute to your software (in a private IT company, the CTO, I suppose.) You get to decide how to onboard people onto your project. And you get to decide what balance there will be between the amount of time new devs waste learning an impenetrable codebase, vs. the amount of time existing devs "waste" making the codebase more lucid by explaining their changes.

> The codebase itself should contain any important documentation.

My entire point is that commit messages are part of "the codebase" — that "the codebase" is the SCM repo, not some particular denuded snapshot of an SCM checkout. And that both humans and software should take more advantage of — even rely upon! — that fact.

jimmaswell · on July 22, 2023

I've been in enough projects that changed version control systems, had to restart the version control from a snapshot for whatever reason (data loss, performance issues with tools due to the commit history, etc) that I wouldn't want to take this approach.

> amount of time new devs waste learning an impenetrable codebase, vs. the amount of time existing devs "waste" making the codebase more lucid by explaining their changes.

That's a false dichotomy. The codebase won't be impenetrable if there are appropriate comments in it. In my experience time would be better spent making the codebase more lucid in the source code than an external commit history. The commit messages should be good too but I only rely on them when something is impossible to understand without digging through and finding the associated ticket/motivation, which is a bad state to be in, so at that point a comment is added. Of course good commit messages are fine too, none of this precludes them.

funcDropShadow · on July 23, 2023

Agreed on most points, but a good SCM provides also the ability to bisect bugs and to show context that is hard to capture by explicit comments. E.g. what changed at the same time as some other piece of code. What was changed by the same person a few days before and after some line got introduced.

Regarding your first point:

> I've been in enough projects that changed version control systems

I have the impression that with the introduction of git, it became suddenly en-vogue to have tools to migrate history from one SCM to another. Therefore, I wouldn't settle on restarting from a snapshot anymore.

With git you can cut off history that is too old but weighs down the tools. You can simplify older history for example, while keeping newer history as it is. That is of course not easy but it can be done with git and some scripting.

lmm · on July 24, 2023

> I've been in enough projects that changed version control systems, had to restart the version control from a snapshot for whatever reason (data loss, performance issues with tools due to the commit history, etc) that I wouldn't want to take this approach.

When was that? I've never seen that in 15-20 years of software development; I've seen plenty of projects change VCS but they always had the option of preserving the history.

yuliyp · on July 22, 2023

Sure it's there, but having to wade through a large history of experiments tried and failed when trying to answer why this thing is here right now just feels subpar. Definitely sometimes it's helpful to read the commits which introduced a behavior, but that feels like a fallback when reading the code as it exists now. It works, but is much slower.

funcDropShadow · on July 23, 2023

> Sure it's there, but having to wade through a large history of experiments tried and failed when trying to answer why this thing is here right now just feels subpar

That is one the downsides of trunk-based developments. One keeps a history of all failed experiments, the usefulness of the commit history deteriorates. That is for reading commit messages as well as for bisecting bugs.

derefr · on July 22, 2023

Which is why I said this in my previous comment:

> In other words: don't think of it as "meticulously grooming a commit history"; instead, think of it as your actual job not being "developer", but rather, as you (and all your coworkers) being the writers and editors of a programming textbook about the process of building program X... which happens to compile to program X.

If you have to "wade through experiments" to read the commit history, that means that the commit history hasn't had a structural editing pass applied to it.

Again: think of your job as writing and editing a textbook on the process of writing your program. As such, the commit history is an entirely mutable object — and, in fact, the product.

Your job as editor of the commit-history is, like the job of an editor of a book, to rearrange the work (through rebasing) into a "narrative" that presents each new feature or aspect of the codebase as a single, well-documented, cohesive commit or sequence of commits.

(If you've ever read a programming book that presents itself as a Socratic dialogue — e.g. Uncle Bob's The Clean Coder — each feature should get its own chapter, and each commit its own discussion and reflected code change.)

Experiments? If they don't contribute to the "narrative" of the evolution of the codebase — helping you to understand what comes later — then get rid of them. If they do contribute, then keep them: you'll want to have read about them.

Features gradually introduced over hundreds of commits? Move the commits together so that the feature happens all at once; squash commits that can't be understood without one-another into single commits; break commits that can be understood as separate "steps" into separate commits.

After factoring hunks that should have been independent out into their own commits, squashing commits with their revert-commits, etc., your commit history, concatenated into a file, should literally be a readable literate-programming metaprogram that you read as a textbook, that when executed, generates your codebase. While also still serving as a commit history!

(Keeping in mind that you still have all the other things living immutably in your SCM — dead experiments in feature branches; a develop branch that immutably reflects the order things were merged in; etc. It's only the main branch that is groomed in this fashion. But this groomed main branch is also the base for new development branches. Which works because nobody is `git merge`ing to main. Like LKML, the output-artifact of a development branch should be a hand-groomed patchset.)

And, like I said, this is all strictly inferior to an approach that actually involves literate programming of a metaprogram of codebase migrations — because, by using git commit-history in this way, you're gaining a narrative view of your codebase, but you're losing the ability to use git commits to track the "process of editing the history of the process of developing the program." Whereas, if you are actually committing the narrative as the content of the commits, then the "process of editing the history" is tracked in the regular git commits of the repo — which themselves require no grooming for presentation.

But "literate programming of a metaprogram that generates the final codebase" can only work if you have editor support for live-generating+viewing the final codebase side-by-side with your edits to the metaprogram. Otherwise it's an impenetrably-thick layer of indirection — the same reason Aspect-Oriented Programming never took off as a paradigm. Whereas "grooming your commit history into a textbook" doesn't require any tooling that doesn't already exist, and can be done today, by any project willing to adopt contribution policies to make it tenable.

---

Or, to put this all another way:

Imagine there is an existing codebase in an SCM, and you're a technical writer trying to tell the story of the development of that codebase in textbook form.

Being technically-minded, you'd create a new git repo for the source code of your textbook — and then begin wading through the messy, un-groomed commit history of the original codebase, to refactor that "narrative" into one that can be clearly presented in textbook form. Your work on each chapter would become commits into your book's repo. When you find a new part of the story you want to tell, across several chapters, you'd make a feature branch in your book's repo to experiment with modifying the chapters to weave in mentions of this side-story. Etc.

Presuming you finish writing this textbook, and publish it, anyone being onboarded to the codebase itself would then be well-advised to first read your textbook, rather than trying to first read the codebase itself. (They wouldn't need to ever see the git history of your textbook, though; that's inside-baseball to them, only relevant to you and any co-editors.)

Now imagine that "writing the textbook that should be read to understand the code in place of reading the code itself" is part of the job of developing the program; that the same SCM repo is used to store both the codebase and this textbook; and that, in fact, the same textual content has to be developed by the same people under the constraints of solving both problems in a DRY fashion. How would you do it?

funcDropShadow · on July 23, 2023

> I never read the git log.

Imho, you are missing out on a great source of insight. When I want to understand some piece of code, I usually start navigating the git log from git blame. Even just one-line commit messages that refer to a ticket can help understanding tremendously. Even the output of git blame itself is helpful. You can see which lines changed together in which order. You see, which colleague to ask questions.

funcDropShadow · on July 23, 2023

One could accomplish a part of your vision, today, by splitting commits into smaller commits. Then you have the relation between hunks of changes and text in the commit message on a smaller level. Then you can use branches with a non-trivial commit message in the merge commit to document the whole set of commits.

As far as I know changes to the Linux kernel are usally submitted as a series of patches, i.e. a sequence of commits. I.e. a branch, although it is usually not represented as git branch while submitting.

jrs235 · on July 21, 2023

I disagree.

WHY: Using _______ sort because at time of writing code, the anticipated sets are ... and given this programming language and environment ... this tends to be more performant (or this code was easiest and quickest to write and understand).

This way when someone later come along and says WTF?! They know why or at least some of the original developers reasoning for choosing that code implementation.

ericbarrett · on July 20, 2023

Completely agree. I find this attitude toward (against?) comments is common in fields where code churn is the norm and time spent at a company rarely exceeds a year or two. When you have to work on something that lasts more than a few seasons, or are providing an API, comments are gold.

sedev · on July 20, 2023

Concurrence: the ability of the code to be self-documenting ends at the borders of the code itself. Factors outside of the code that impose requirements on the code must be documented more conventionally. The "sleep 10 seconds to (hopefully) ensure that a file has finished downloading" anecdote from upthread is a great example.

scubbo · on July 21, 2023

You may enjoy https://brooker.co.za/blog/2020/06/23/code.html

MrBuddyCasino · on July 21, 2023

I indeed enjoyed.

mostlylurks · on July 20, 2023

Self-documenting code is perfectly capable of expressing the "why" in addition to the "what". It's just that often the extra effort and/or complexity required to express the "why" through code is not worth it when a simple comment would suffice.

tetha · on July 20, 2023

> Self-documenting code is perfectly capable of expressing the "why" in addition to the "what".

I don't think anymore that's true, at least in a number of areas.

In another life, I've worked on concurrent data structures in java and/or lock-free java code I'd at this point call too arcane to ever write. The code ended up looking deceptively simple and it was the correct, minimal set of code to write. I don't see any way to express the correctness reasoning for these parts of code in code.

And now I'm dealing with configuration management. Code managing applications and systems. Some of these applications just resist any obvious or intuitive approach, and some others exhibit destructive or catastrophic behavior if approached with an intuitive approach. Again, how to do this in code? The working code is the smallest, most robust set of code I can setup to work around the madness. But why this is the least horrible way to approach this, I cannot express that in code. I can express this in comments.

scubbo · on July 21, 2023

This is a statement which is technically-true (so long as your language of choice has no length on names), but unhelpful since it does not apply in most practical cases.