Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> yes, nodes have local disks, but any local filesystem the user can write to is ofen wiped between jobs as the machines are shared resources.

This is completely compatible with containerized systems. Immutable images stay in a filesystem directory users have no access to, so there is no need to wipe them. Write-ability within a running container is completely controlled by the admin configuring how the container executes.

> you don't want to waste cluster time at the start of your job pulling down an entire image to every node, then extract the layers -- it is way faster to put a filesystem image in your home directory, then loop mount that image

This is actually less efficient over time as there's a network access tax every time you use the network filesystem. On top that, 1) You don't have to pull the images at execution time, you can pull them immediately as soon as they're pushed to a remote registry, well before your job starts, and 2) Containers use caching layers so that only changed layers need to be pulled; if only 1 file is changed in a new container image layer, you only pull 1 file, not the entire thing.



there generally is no central shared immutable image store because every job is using its own collection of images.

what you're describing might work well for a small team, but when you have a few hundred to thousand researchers sharing the cluster, very few of those layers are actually shared between jobs

even with a handful of users, most of these container images get fat at the python package installation layer, and that layer is one of the most frequently changed layers, and is frequently only used for a single job


Just to review, here are the options:

1. Create an 8gb file on network storage which is loopback-mounted. Accessing the file requires a block store pull over the network for every file access. According to your claim now, these giant blobs are rarely shared between jobs?

2. Create a Docker image in a remote registry. Layers are downloaded as necessary. According to your claim now, most of the containers will have a single layer which is both huge and changed every time python packages are changed, which you're saying is usually done for each job?

Both of these seem bad.

For the giant loopback file, why are there so many of these giant files which (it would seem) are almost identical except for the python differences? Why are they constantly changing? Why are they all so different? Why does every job have a different image?

For the container images, why are they having bloated image layers when python packages change? Python files are not huge. The layers should be between 5-100MB once new packages are installed. If the network is as fast as you say, transferring this once (even at job start) should take what, 2 seconds, if that? Do it before the job starts and it's instantaneous.

The whole thing sounds inefficient. If we can make kubernetes clusters run 10,000 microservices across 5,000 nodes and make it fast enough for the biggest sites in the world, we can make an HPC cluster (which has higher performance hardware) work too. The people setting this up need to optimize.


example tiny hpc cluster...

100 nodes. 500gb nvme disk per node. maybe 4 gpus per node. 64 cores? all other storage is network. could be nfs, beegfs, lustre.

100s of users that change over time. say 10 go away and 10 new one comes every 6mths. everyone has 50tb of data. tiny amount of code. cpu and/or gpu intensive.

all those users do different things and use different software. they run batch jobs that go for up to a month. and those users are first and foremost scientists. they happen to write python scripts too.

edit: that thing about optimization.. most of the folks who setup hpc clusters turn off hyperthreading.


Container orchestrators all have scheduled jobs that clean up old cached layers. The layers get cached on the local drive (only 500gb? you could easily upgrade to 1tb, they're dirt cheap, and don't need to be "enterprise-grade" for ephemeral storage on a lab rackmount. not that the layers should reach 500gb, because caching and cleanup...). The bulk data is still served over network storage and mounted into the container at runtime. GPU access works.

This is how systems like AWS ECS, or even modern CI/CD providers, work. It's essentially a fleet of machines running Docker, with ephemeral storage and cached layers. For the CI/CD providers, they have millions of random jobs running all the time by tens of thousands of random people with random containers. Works fine. Requires tweaking, but it's an established pattern that scales well. They even re-schedule jobs from a particular customer to the previous VM for a "warm cache". Extremely fast, extremely large scale, all with containers.

It's made better by using hypervisors (or even better: micro-VMs) rather than bare-metal. Abstract the allocations of host, storage and network, makes maintenance, upgrades, live-migration, etc easier. I know academia loves its bare metal, but it's 2025, not 2005.


https://slurm.schedmd.com/containers.html support for containers sort of exists. singularity is the friendliest.

https://modules.readthedocs.io/en/latest/ re-used libraries are packaged this way usually. not in container images.

there's no abstraction or live mirgation. there's only queues of jobs waiting to get cpu and/or gpu time.


> container images get fat at the python package installation layer, and that layer is one of the most frequently changed layers

This might be mitigated by having a standard set of packages, which you install in a lower layer, and then changing ones, at a higher layer.


Well, call them lazy but once you have i.e. biocontainers in which individual bioinformatics programs are prepackaged, hardly any scientist in that field would be reinventing the wheel and often just waste te time trying to install all the requirements and compile a program already running "good enough" using downloaded SIF. Sure, at times with say limited resources one can try to speed up some frequently used software creating SIF from scratch with say newer or more optimized Linux distro (if memory serves me right containers using Alpine Linux/musl library were a bit slower than containers using Ubuntu). But in the end splitting the input into smaller chunks, running i.e genome mapping on multiple nodes then combining the results, should be way faster than "turbo-charging" the genome mapping program run on a single node even with a big number of cores.


the "network tax" is not really a network tax. the network is generally a dedicated storage network using infiniband or roce if you cheap out. the storage network and network storage is generally going to be faster than local nvme.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: