We’re recently running two machines (master and standby) at M5 Hosting. All of HN runs on a single box, nothing exotic:
CPU: Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz (3500.07-MHz K8-class CPU)
FreeBSD/SMP: 2 package(s) x 4 core(s) x 2 hardware threads
Mirrored SSDs for data, mirrored magnetic for logs (UFS)
HN is a very simple application. Handling a high volume of traffic for a simple application is a very different problem from scaling a highly complex application.
HN is simple, yes. But it could be made more complicated. Personalized feed and data analytics are two complicated things that come to mind. Staying simple is often a choice, and it’s a choice not many companies make.
HN is a straight forward forum. Reddit is one level above that: generalized forums as a service.
Anything HN has had to implement, Reddit has to implement at a generalized, user-facing level, like mod tools.
Frankly, we underestimate how hard forums are, even simple ones. I learned this the hard way rebuilding a popular vBulletin forum into a bespoke forum system.
Every feature people expect from a forum turns into a fractal of smaller moving parts, consideration, and infinite polish. Letting users create and manage forums is an explosion of even more things that used to be simple private /admin tools.
Mod tools are not accessed and used by all users. So the load of mod-tools on the servers is probably negligible.
I agree, most software is deceivingly simple from the outside. Once you start building it, you become more humble about the effort required to build anything moderately complex.
Mod tools aren't used by the majority of users, correct. But the existence of mod tools does make the logic and assumptions of the application different. Now you've got a whole set of permissions and permissions checks, additional interfaces, more schema, etc.
Its not that the mod tools are constantly being used, its that there's now potentially far more code complexity for those tools to even exist.
is reddit really a complex application (regardless of how they build, scale, or deploy it)? Although that makes me wonder, what makes an application complex?
Hacker News changes more often than people think, just not the layout because people here are weirdly fetishistic about it.
Since I've been here they've added vouching for banned users (and actually warning people beforehand) thread folding, Show HN, making the second chance pool public, thread hiding, the past page, various navigation links and the API. They've also been trying to get a mobile stylesheet to work. They've also mentioned making various changes for spam detection and performance. And the url now automatically loads a canonical version if it finds one, and the title is now automatically edited for brevity. And I've probably missed a few things.
And HN isn't a simple application by any means. Go look at the Arc Forum code - it isn't optimized for readability, or scalability or reliability, but joy - for the vibe of experimental academic hacking on a Lisp. It's made of brain farts. Hacker News is probably significantly more complex than that for being attached to a SV startup company and running 'business code' and whatnot.
I mean, that’s not really that much is it. And that’s the point, HN really doesn’t change much. Whereas Reddit, for better or for worse, has a much higher output of new user facing features.
When a VC gives you a giant boatload of money, they insist you "scale up" the company overnight. So you go on a massive hiring spree, and get triple-digit team of engineers before having any market traction.
And they're tasked with building a product that can handle Google-levels of demand, though they currently only have two customers, neither of them paying.
It indeed is imperative, but not for technical reasons.
I would take the money then do none of that. And now I got a 5 years runway, enough time to build a product people like and use, and by then the investors won't be angry anymore.
HN is perhaps the most user friendly site I go to with regularity.
The idea that a website needs to be “rich” to be usable is one of the dumbest things the industry has convinced itself of in the last 20 years (following only ‘yaml is a smart way to encode infrastructure’).
To be fair, it's not as much user-friendly as it is simple, and simple tends to be easier to understand.
For example, if it was more user-friendly, it could have links to jump between root comments, because right now very popular top comments tend to accumulate most interactions, and scrolling down several pages to find the next root thread requires effort.
The people who push the other direction also bring few or no metrics. I.e. there is often no reason to add <bag of features>, except a customer (who didn't buy the product yet) mentioned them as nice to have during initial sales talks.
IMO the solution to YAML-as-config is a strict subset of YAML.
JSON is one strict subset, but one that makes smart trade-offs for strictness and machines like error detection and syntax-typed types.
We decided on a different subset of YAML for our users that were modifying config by hand (even more strict than StrictYAML). Some of the biggest features of YAML are that there is no syntax typing, and collection syntax is simple (e.g. also true for JSON, false for TOML).
For example, a string and a number look the same. This seems bad to us developers at first, but the user doesn't have to waste 20 min chasing down an unmatched quote when modifying config in a <textarea>. Beyond that, it's the same amount of work as making sure the JSON is `"age": 20` instead of `"age": "20"`, one just has noisier syntax.
>Stack Exchange which is way more rich and runs on small (relative) infra.
Yes, I've heard that SO runs on relatively simple and modest infra. And agree that would be a good example.
>HN is not user friendly
How so? I find the HN UX a refreshingly simple and effective experience. It might not have all the bells and whistles of newer discussions fora, but it doesn't obviously need them. I'd say it's a good example of form/function well suited to need. Not perfect perhaps, but very effective.
Try loading it on a 2G (2 bars = 128kbits per second — those are bits not bytes) connection. It loads almost instantly with no fuss. Now try loading virtually any site on the same, if it ever loads at all without timing out, you’ll be waiting over 10 minutes.
There was a YT preso from several years back where the StackExchange founder explained how it ran off just ~10 servers, and could run on half that many if needed. He stressed the simplicity of their arch, and that their problem space was massively cachable, so the servers just had a few hundred GB of ram, and only had to do work to rerender pages, but could store them in cache most of the time. It was a C#.Net app.
So, I think there is a lot more in common than you think between HN and SO.
My pet peeves: No dark mode, sorely lacking for me for reading in the dark, then there is no indication at all that you've got replies (at least a tiny number next to threads perhaps?) and the up/downvote buttons are too small to reliably tap on mobile. Oh, and enumeration support would be fantastic, the workarounds tend to be hard to read.
Other than that, I think it's delightfully ugly and lightweight.
I can't seem to find Harmonic in the iOS App Store, is it Android-only?
Also, HN apps tend to make it harder to send interesting things to Roam or the laptop or Safari's reading list, the website makes that really convenient.
I wouldn’t say it’s not user friendly but I understand where you are coming from. I also missed some more modern features/looks and decided to build my own open source client [0]. Feel free to give it a go to see if it’s more your taste!
I wonder if they use something like CARP[^1] for redundancy. Also, strikes as odd that they didn't go with ZFS for storage, makes FS management _way_ easier for engineers who don't spent all their on these kind of operations.
You might ask what sort of filesystem maintenance they ever need to do. Replacing a disk is covered by the mirror. Backup is straightforward. The second system covers a lot more. If they need to increase hardware capacity, they can build new systems, copy in the background, and swap over with a few minutes of downtime.
(beginner question) How do they store the data? is an SQL db on overkill for such a use case? what would be the alternative? an ad-hoc filesystem based solution? then how do the two servers share the db? and is there redundancy at the db level? is it replicated somehow?
"ad-hoc filesystem based solution" is the closest of your definitions, I think. Last time I saw/heard, HN was built in Arc, a Lisp dialect, and use(s/d) a variant of this (mirrored) code: https://github.com/wting/hackernews
Check out around this area of the code to see how simple it is. All just files and directories: https://github.com/wting/hackernews/blob/master/news.arc#L16... .. the beauty of this simple approach is a lack of moving parts, and it's easy to slap Redis on top if you need caching or something.
File syncing between machines is pretty much an easily solved problem. I don't know how they do it, but it could be something like https://syncthing.net/ or even some scripting with `rsync`. Heck, a cronned `tar | gzip | scp` might even be enough for an app whose data isn't exactly mission critical.
Wow, I had no idea HN was built like that - I'm impressed. I really wish I could read the Arc code better though since I'd love to know more about the details of how data is represented on disk and when things move in and out of memory, etc.
Does anyone know of other open source applications with similar architectures like this?
A single bare metal server is more reliable than most people think it is. Complexity adds a lot of overhead and layer after layer that could possibly fail.
A single server is much faster than most people think, too!
In the microservice or serverless arrangements I've seen, data is scattered across the cloud.
It's common for the dominant factor in performance to be data locality, most times this talk about data locality is about avoiding trips to RAM, or worse disk. But in our "modern" distributed cloud things, finding a bit of data frequently involves a trip over the network. In the monolith world what was once invoking a method on an account object, has become making a HTTP POST to the accounts microservice.
What might have been a microseconds operation in the single server world, might become hundreds of milliseconds in the distributed cloud world. While you can't horizontally scale a single server, your 1000x head start in performance might delay scaling issues for a very long time.
When you give half the effort to set things up properly, a single server can handle a lot of load and traffic, and get a lot of things done.
If you know some details of the services you're going to host on that hardware, the things you can do while saving a lot of resources is considered as black magic by many people who only deploys microservices to K8S systems.
...and you don't need VMs, containers, K8S and anything.
It generally boils down to three things: 1) How much resource a service need to run well, 2) How much the service wants to consume if left unchecked, 3) How performant you want your service to be.
After understanding these parameters, you can limit the resources of your application by running it under a cgroup. Doing this won't allow a service to surpass the limits you've put onto it, and cgroup will pressure your service when it nears its resource limits.
Also, sharing resources is good. Instead of having 10 web server containers, you can host all of them under a single one with virtual hosts, most of the time. This allows good resource sharing and doing more with less processes.
On the extreme case, I'm running a home server (DNS server, torrent client, synching client, a small HTTP server and an FTP server with some other services) under a 512MB OrangePi Zero. The guy works well, and never locks up. It has plenty of free RAM, and none of the services are choking.
I agree but at the same time: Inter-process communication is also faster when a process is allowed to write to or read from another process's memory. Doesn't make it a good idea, though.
Yeah, nowadays you could just scale-up and still have a lot of leeway. A high-end server (w/ redundancy maybe) is more than enough for 95% of all common startup use cases.
Distributed computing only makes sense when you're starting to deal with millions of daily users.
I think the main risk of the HN architecture is that (I believe) it's a single datacentre in a single location. Hopefully they have offsite backups.
The other risk I guess would be that all the NS are AWS/route53 severs, so if that went down they'd be up for about a minutes (looks like the DNS TTL is 2 minutes).
You could host your own NS servers in two different locations on two different providers, you could have a third hot spare server ready to go too, that would allow the service to survive an earthquake flattening San Diego (off site server ready to go), and cope with the loss of AWS DNS. Whether the cost/benefit ratio is there is another matter. I think the serving side of route53 is fairly reliable though (even when the interface to update records fails on a frequent basis), and the cost of being down isn't particularly terrible.
We should ask this question to ourselves: As a user, if I have to manage the risk of US-East-1 going down, may be we should question this entire Cloud thing. It was marketed as "100% bulletproof" in 2005 when we discarded on-prem.
To be fair to amazon I don't think they've had an outage that affected different data centres at the same time, but yes I would far rather have my main and backup on two providers in two appropriately diverse locations.
For HN it probably doesn't matter, although what's the value
For Amazon too it probably doesn't matter -- if I occasionally can't buy something I'm almost certain to simply try again 30 minutes later
For some services a couple of minutes of downtime isn't good enough. Imagine if the superbowl went dark for 5 seconds just as the winning play happened.
Can't replace the kernel though, and updating the web server application itself is going to be hard too. Unless it's some Erlang like hotpluggable server I guess.
Obligatory Lamport quote: "A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."
We managed to run a successful bootcamp school LMS on a single cheapest 1gb ram hetzner vps with rails app, redis cache, postgres, backups, staging env and near zero production issues.
Not hi loaded of course, but still hundreds active users every single day interacting with the app.
Recently had to upgrade to the next tier because of growth.
Modern servers are super fast and reliable as long as you know what you’re doing and don’t waste it on unnecessary overheads like k8s etc.
Adding to this: If you use dedicated servers rather than VPS, you can also squeeze out a lot more performance of the boxes as you have dedicated CPUs just for you, instead of shared (vCPU) that others use too.
AWS and GCP vCPU are absolutely not shared, and any provider that’s still sharing cores or threads is leaving its users open to all manner of atrocious side channels.
vCPU is a "Virtual Core", it's even in the name. When you rent those, you don't get access to a dedicated core/CPU, but one part of a logical processor. The side channel attacks you mention all have mitigations at this point.
Only the very smallest instances don’t get a full core. And when you have multiple vCPU the scheduler _won’t_ ever split you onto a thread. This is definitely true of aws’s nitro, I don’t know for certain about gcp. The attacks are not fully mitigated between hyperthreads.
Is there evidence cloud compute is more than a few percent slower than on-prem?
> Only the very smallest instances don’t get a full core.
Not sure this is true for AWS anymore.
Even a t3.nano gives you 2 vCPUs, which I've always interpreted as meaning they give you 1 core with two threads to ensure that you never share a core with another tenant.
Of course, t2 instances still exist which give 1 vCPU, but there's no reason to create one of those when a t3 gives 2 vCPUs at a lower price. The only reason to be running a t2 is if you created it before t3 existed and just haven't bothered migrating.
Remove Kubernetes and "A few Hetzner dedicated servers running just your application" has even better reliability as you remove more points of failures than "A few Hetzner dedicated servers running Kubernetes".
So much this. System failure is a statistics game. Which is why I like the Erlang approach of having isolated state that you assume will corrupt after a while and you can restart that part of your application with a fresh state.
K8s does this, kind of, on a higher level, where the downtime is often more noticeable because there's more to restart. But if your application is already highly fault-tolerant, this is just another point of failure.
Because people adore stable income with no risks, and because most programmers out there are hacks who will only ever learn one or two ancient languages that lost their competitive advantage 10 years ago.
Take a look at how insanely far you can go with Elixir + Phoenix on a server with 2 vCPUs and 2GB RAM. It's crazy. At least 80% of all web apps ever written will never get to the point where an upgrade would be needed.
So yeah, people love inertia and to parrot the same broken premise quotes like "But StackOverflow shows languages X and Y dominate the industry!".
Nah, it doesn't show that at all. It only shows that programmers are people like all others: going out of the comfortable zone is scary so our real industry is manufacturing bullshit post-hoc rationalizations to conceal the fact that we're scared of change. A mega-productive and efficient change at that.
Social reasons > objective technical reasons. Always has been true.
I completely agree. Most developers simply copy whatever is currently popular without taking into consideration their specific context and biz requirements. Almost all software solutions are massively complexed, with many points of failure, and high latencies for every single transaction, leading to a death spiral of even more complex and expensive “solutions” for “performance” reasons. Microservices is the poster child for this. In contrast, a single high-performance server can handle the full workload of 99.99% of web applications out there. With a separate warm stand-by server ready to take over if needed, and a linear event transaction log as the source of truth.
Because (despite the hype) Erlang embeds principles that can be implemented in other languages and they solve only a narrow set of problems in distributed systems. This does not justify a dedicated programming language.
Is Erlang particularly hyped? I don't usually see it suggested as a solution, I see it almost exclusively in reference to something already written in it, a fait accompli, like WhatsApp.
While Kubernetes definitely has its own failure modes and can be an utter pain in the ass to set up, in my experience it also removes significant failure modes and makes your life way easier.
For me, the most important thing is I don't have to take care in my deployment pipelines about how many servers I have and how they are named/reachable, to stop Docker containers, to re-create Docker containers, how to deal with logs, letsencrypt cert renewals... all I have to do is point kubectl at the cluster's master, submit the current specification on how the system should look like and I don't have to deal with anything else. If a server or a workload crashes, I don't have to set up monitoring to detect and fail over, Kubernetes will automatically take care of that for me. When I add capacity to the cluster for whatever reason, I enroll the new server with kubeadm and it's online - no entering of new backend IPs in reverse proxies, no bullshit. Maintenance (e.g. OS upgrades, hardware maintenance) becomes a breeze as well: drain the node, do the maintenance, start the node, everything works automatically again.
As someone who does this day to day, I think k8s is worth it for anything non-trivial. If I didn't already have this understanding, then it might be a different trade off.
Actually i have experience working un hi loaded env operated by k8s.
But for 99% of projects i see, it’s a waste of time and resources (mostly people resources, not just cpu). HN is a perfect example of a project that doesn’t need it, no matter the traffic.
If you need some additional flexibility and scalability over “bare metal” setup, you can go far with just docker compose or swarm until you have no choice but use k8s
I worked in an academic library where the scale was on the order of hundreds of users per day, but many of those users were faculty/researchers spread out across the globe who got very grumpy when whatever online archive/exhibit was not available.
I migrated our services from a very “pet” oriented architecture to a 4-node Proxmox cluster with Docker Swarm deploying containers across four VMs and it worked great. Services we brought up on this infra still to this date have 100% uptime, through updates and server reboots and other events that formerly would have taken sites offline temporarily.
I looked at k8s briefly and it seemed like total overkill. I probably would have spent weeks or months learning k8s for no appreciable advantage over swarm.
And how many k8s believers ever reach significant scale? It's just like back when NoSQL was the trendy thing and people thought a couple gigabytes meant "big data". Mostly it’s simply cargo culting.
Many large organisations use Kubernetes (Google, Spotify, IBM, etc). Regardless, large scale and very large scale are different. Kubernetes is well suited for controlling fleets of compute resource in the order of 10,000s CPU cores, and terabytes of memory.
The compute overhead to orchestrate these clusters is well worth the simplicity/standardisation/auto-scaling that comes with Kubernetes. Many people have never had to operate VMs in the hundreds or thousands and do not understand the challenges that come with maintaining varied workloads, alongside capacity planning at that scale.
a million nodes running a single application is scale, but a thousand nodes running a thousand applications is also scale, and they are very different beasts.
The FAANGs operate the first kind, k8s is mostly aimed at the second kind scale, so its designed "for scale", for some definitions of scale.
K8s spun off of Google’s Borg operator software specifically designed for high availability at FAANG scale. So essentially K8s is the “community edition.” Go read the Google SRE Book for context.
We use it to serve ruby with 50 million requests per minute just fine. And the best part is the Horizontal Pod Autoscaler which saves our ass during seasonal spikes.
While serverless/lambda are great I do think K8s is the most flexible way to serve rapidly changing containerized workload at scale.
Kubernetes is fantastic, though I think of it more as a tool for managing organizational complexity than ops. You can vertically scale beyond the level needed by most commercial applications nowadays using ad-hoc approaches on a machine or two with 16+ cores, 256GB+ RAM, terabytes of NVM drives, etc. but many (even small) companies have policies, teams, and organizational challenges that demand a lot of the structure and standardization tools like Kubernetes bring to the table.
So I'm not disagreeing with your assertion, but I'd perhaps scope it to saying it's useful overhead at significant organizational scale, but you can certainly operate at a significant technical scale without such things.. and that can be quite fun if you have the team or app for it :)
I guess because of sanity and simplicity in its architecture?
I once wrote a software in Rust, a simple API, one binary in one DigitalOcean instance started by systemd, and nothing else. The things has been working nonstop for years making it the most stable piece of long running software I’ve ever written, and I think it all comes from it being simple without any extra/unnecessary complexity added.
I’m not bragging btw, I actually had to contact the user years after I wrote that because I couldn’t believe that the thing was still working but I hadn’t heard from them in years!
An interesting note would be that multiple 9s of uptime isn't actually necessary to keep users happy on a site like HN. I used to use sites in the early 2000s that were down a lot more regularly than HN but it didn't put me off using them.
Not all "downs" are reflected there. Last time I remember having really bad perf or non-workable, don't remember - but opening in incognito, you would get cached results fast.
This might be true on the infrastructure layer, but HN definitly uses "fancy" technology as Paul developed his own lisp that powers HN, Arc :) http://arclanguage.org/
> Arc is designed for exploratory programming: the kind where you decide what to write by writing it. A good medium for exploratory programming is one that makes programs brief and malleable, so that's what we've aimed for. This is a medium for sketching software.
There are some environments where making sure your transaction is secure on non volatile storage is critically important. And then there is storing comments on a news aggregation site.
Hell, this is one environment where I bet you could get away with a full async mount.
Sure, but the default should be erring on the side of persistence and consistency. The throughput for a site like HN of actual writes is next to nil. It’s short text only comments averaging a couple hundred bytes and you only have to pay for the fsync once anyway. If you have to turn off fsync to make it perform then you’re doing something horribly wrong.
I would love to know tech stack and architecture of HN. And how it evolved (if it did). And what resources they spend (money, manhours etc) to maintain it.
I have seen an error message once, instead of content. I did not remember it but the wording was amazingly perfect.
And there is one minor issue when "reply" button to a specific comment had not been loaded by the time the comment has rendered.
It depends on what you mean by "online" and the service level :
- "online" meaning a HN server responds with something. In this more literal sense, HN always seems to be up.
- "online" meaning normal page load response times. In this sense, HN sometimes times out with "sorry we can't serve your request right now". That seems to happen once a week or once a month. Another example is a super popular thread (e.g. "Trump wins election") that hammers the server and threads take a minute or more to load. This prompts dang to write a comment in the thread asking people to "log out" if they're not posting. This reduces load on the server as rendering pages of signed-out users don't need to query individual user stats, voting counts, hidden comments, etc. This would be a form of adhoc community behavior cooperating to reduce workload on the server rather than spin up extra instances on AWS.
The occasional disruptions to the 2nd meaning of "online" is ok since this is a discussion forum and nobody's revenue is dependent on it. Therefore, it doesn't "need" more uptime than it already has.
There have been times when we have been asked if we can log out if we won’t be commenting because this will then allow the servers to serve straight from cache. I have only seen that happen a few times though.