For all the talk of needing all the cloud infra to run even a simple website, Marginalia hits the frontpage of HN and we can't even bring a single PC sitting in some guy's living room to its knees.
More than the cores and RAM, you have bigger issues with I/O (both throughput and latency) to disk and the network from cloud providers. Physical hardware, even when comparing cores/RAM 1:1, is outrageously faster than cloud services.
Problem is that those volumes are ephemeral and may not provide the reliability guarantees that the EBS volumes do, so they're only really good as cache and not for any persistent data you care about.
I would argue that a rack-mounted chassis with lots of disks is also ephemeral, just less so: most failures in a server can be fixed by swapping some parts, and your data is still there.
But AWS and its competitors don’t have an offering even close to comparable to what you can get in a commodity server. A 1U server with one or two CPU sockets and 8-12 hot-swap NVMe bays is easy to buy and not terribly expensive, and you can easily fill it with 100+ TiB of storage with several hundred Gbps of storage bandwidth and more IOPS than you are likely able to use. EC2 has no comparable offering at any price.
(A Micron 9400 drive supposedly has 7GBps / 56Gbps of usable bandwidth. 10 of them gives 560Gbps, and a modern machine with lots of PCIe 5.0 lanes may actually be able to use a lot of this. As far as I can tell, you literally cannot pay AWS for anywhere near this much bandwidth, but you can buy it for $20k or so.)
> I would argue that a rack-mounted chassis with lots of disks is also ephemeral, just less so: most failures in a server can be fixed by swapping some parts, and your data is still there.
True but this also depends on design decisions AWS made with regards to those volumes.
Indeed it could be that the volume is internally (at the hypervisor level) redundant (maybe with something like ZFS or other proprietary RAID), but there's no way to know.
Furthermore, AWS doesn't allow you to really keep a tab or reservation on the physical machine your VM is on - every time a VM is powered up, it gets assigned a random host machine. If there is a hardware failure they advise you to reboot the instance so it gets rescheduled on another machine, so even though technically your data may still be on that physical host machine, you have no way to get it back.
AWS' intent with these seems to be to act as transient cache/scratchpad so they don't seem to offer much durability or recovery strategies for those volumes. Their hypervisor seems to treat them as disposable which is a fair design decision considering the planned use-case, but it means you can't/shouldn't use it for any persistent data.
Being in control of your own hardware (or at the very least, renting physical hardware from a provider as opposed to a VM like in AWS) will indeed allow you to get reliable direct-attach storage.
I can buy a rather nicer 1U machine with substantially better local storage for something like half the 1-year reserved annual cost of this thing.
If you buy your own servers, you can mix and match CPUs and storage, and you can get a lot of NVMe storage capacity and bandwidth, and cloud providers don’t seem to have comparable products.
Something like €200/mo if you factor in the need for disk space as well. This is also Hetzner we're talking about. They're sort of infamous for horror stories of arbitrarily removed servers and having shitty support. They're the cheapest for a reason.
But with dedicated servers, are we really talking cloud?
I only rent a small server from them, but I've been happy with their support. Talked to a real human who could help me with tech questions, even though I pay them next to nothing.
Most of the problems I read about are during the initial signup stage. They ask for a copy of your passport etc and even then some people can't signup because presumably their info is triggering something in Hetzner's anti fraud checks. This sucks for those people of course.
The other common cause of issues is things like crypto which they don't want in their network at all.
This will sound like I am downplaying what people have exprerienced and/or being apologetic on their behalf but that is not my intention. I am just a small time customer of theirs. I've had 1 or 2 dedicated servers with them for many many years now upgrading and migrating as necessary. (It used to be that if you waited for a year or two and upgraded you'd get a better server for cheaper. Those days are gone.)
I've only dealt with support over email where they have been both capable and helpful, but what I needed was just plugging in a hardware kvm switch (free for a few hours - i never had to pay) or replacing a failing hard drive (they do this with zero friction). Perhaps I am lenient on the tech support staff. After all they are my people. I've been to a few datacenters and have huge respect for what they do.
On the presales side they seem to reply with a matter of fact tone with no flexibility. They are a German company after all.
I'm a bit wary I'd get lumped in with the crypto gang. A lot of what I'm doing with the search engine is fairly out there in terms of pushing the hardware in unusual ways.
It would also suck if there ever was a problem. The full state of the search engine is about 1 Tb of data. It's not easy to just start up somewhere else if it vanished.
in Azure that's roughly 5k per year if you pay for the whole year upfront.
I have the pleasure of playing with 64cores, 256gb RAM and 2xV100 for data science projects every now and then. That turns out to be roughly 32k per year.
I share your perspective on pricing. I had a discussion with my team lead about why we haven't taken on the task of running our own machines. The rationale behind it is that while server management may seem easy, ensuring its security can be complex. Even in a sizable company, it's worth considering whether you want to shoulder the responsibility or simply pay a higher cost to have Microsoft handle it. Personally, I love hosting my own infrastructure. It's fun, potentially saves me some cash, allows me to learn a lot, and gives me full control. However, I understand now that on a company scale, others may see it differently.
--edit--
I forgot to add the following: that's 32k if you run the system 24/7. Usually it's up for a few hours per month, so you end up paying maybe 2k for the whole year.
I'm curious about your network bandwidth/load. You only serve text, right? [Edit: No, I see thumbnail images too!] Is the box in a datacenter? If not, what kind of Internet connection does it have?
Average load today has at worst been about 300 Kb/s TX, 200 Kb/s RX. I've got a 1000/100 mbit/s down/up connection. Seems to be holding without much trouble.
Most pages with images do lazy loading so I'm not hit with 30 images all at once. They're also webp and cached via cloudflare, softens the blow quite a lot.
IMO it's actually incredibly well-documented and thoughtfully organized for a one-person project! You should be proud of what you've put together here!
It's a Debian server running nginx into a bunch of custom java services that use the spark microframework[1]. I use a MariaDB server for link data, and I've built a bespoke index in Java.
[1] https://sparkjava.com/ I don't use springboot or anything like that, besides Spark I'm not using frameworks.
Really? Even with mmapp'ed memory won't the CPU still register user code waiting on reading pages from disk as iowait? I'm so surprised by that that if it doesn't it sounds like a bug.
Yeah it's at least what I've been seeing. Although it could alternatively be that a lot of the I/O activity is predictive reads, and the threads don't actually stall on page faults all I/O that often.
I remember there is ksplice or something like that to upgrade even the kernel without a complete downtime. Everything else can be upgraded piecemeal, provided that worker processes can be restarted without downtime.
If the hardware itself is the reason for the long startup time, kexec allows you to boot a new kernel from within an existing one and avoids the firmware/HW init.
The people strive to have these problems! Hockey stick growth, servers melting under signup requests, payment systems struggling under the stream of subscription payments! Scale up, up, up! And for that you might want to run your setup under k8s since day one, just in case, even though a single inexpensive server would run the whole thing with a 5x capacity reserve. But that would feel like a side project, not a startup!
I'd argue that a lot of modern web engineering pretends to be built for problems must people won't have. So much resume-driven-development is being done on making select, easy parts super scalable while ignoring some elephants in the room such as the datastore.
A good example is the obsession with "fast" web frameworks on your application servers, completely ignoring the fact that your database will be the first to give up even in most "heavy" web frameworks' default configurations without any optimization efforts.
I think HN's stack is the right choice for them and that it fulfills its purpose excellently, but I do seem to recall both of their hard drives failing more or less simultaneously & HN going down for about 8 hours not that long ago.
If that happened at the SaaS company I worked at previously, it would be a bloodbath. The churn would be huge. And our customer's customers would be churning from them. If that happened at a particularly inopportune time, like while we'd been raising money or something, it could potentially endanger the company.
(I'd like to stress again this is not a criticism of HN/dang, but just to illustrate a set of requirements where huge AWS spends do make sense.)
From my experience simple systems perform better on average due to less number of interconnected gears.
Much more complex systems do not perform as consistently as simple ones, and they are exponentially harder to debug, introspect and optimize at the end of the day.
Every time I deploy a service it goes down for anything between 30 seconds and 5 minutes. When I switch indices, the entire search engine is down for a day or more. Since the entire project is essentially non-commercial, I think this is fine. I don't need five nines.
If reliability was extremely important, scales would tilt differently, maybe cloud would be a good option. A lot of it is for CYA's sake as well. If I mess up with my server, that's both my problem and my responsibility. If a cloud provider messes up, then that's a SLA violation and maybe damages are due.