Why not use CephFS instead? It has been thoroughly tested in real-world scenarios and has demonstrated reliability even at petabyte scale. As an open-source solution, it can run on the fastest NVMe storage, achieving very high IOPS with 10 Gigabit or faster interconnect.
I think their "Other distributed filesystem" section does not answer this question.
Among other things, the OSD was not designed with NVMe drives in mind - which is fair, given how old it is - so it's nowhere close to being able to handle modern NVMe IO throughput and IOPS.
For that you need zero-copy, RDMA etc.
Note that there is a next-generation OSD project called Crimson [0], however it's been a while, and I'm not sure how well it's going.
It's based on the awesome Seastar framework [1], backing ScyllaDB.
Achieving such performance would also require many changes to the client (RDMA, etc).
Something like Weka [2] has a much better design for this kind of performance.
I thought the current recommendation was to not have multiple OSDs per NVMe? Tbf I haven’t looked in a while.
I have 3x Samsung NVMe (something enterprise w/ PLP; I forget the model number) across 3 nodes, linked with an Infiniband mesh network. IIRC when I benchmarked it, I could get somewhere around 2000 MBps, bottlenecked by single-core CPU performance. Fast enough for homelab needs.
I don't know man, I think 15M random read IOPS is actually quite fast. I've built multi million IOPS clusters in enterprise settings all on nvme in the past.
> I think 15M random read IOPS is actually quite fast
680x NVMe SSDs over 68 storage servers (so 68 CPUs) for just 15M (or 25M, tuned) random read IOPS is pretty underwhelming. The use cases where 3FS (or some other custom designs) shine are more like, 200M random read IOPS with 64 servers each with 8 PCIe gen 4 NVMe SSDs (512x SSDs in total).
DigitalOcean uses Ceph underneath their S3 and block volume products. When I was there they had 2 teams just managing Ceph, not even any of the control plane stuff built on top.
It is a complete bear to manage and tune at scale. And DO never greenlit offering anything based on CephFS either because it was going to be a whole other host of things to manage.
Then of course you have to fight with the maintainers (Red Hat devs) to get any improvements contributed, assuming you even have team members with the requisite C++ expertise.
Most of the legitimate datacenter-scale direct Ceph alternatives unfortunately are proprietary, in part because it takes so much money and human-expertise-hours to even be able to prove out that scale, they want to recoup costs and stay ahead.
Minio is absolutely not datacenter-scale and I would not expect anything in Go to really reach that point. Garbage collection is a rough thing at such enormous scale.
I bet we'll get one in Rust eventually. Maybe from Oxide computer company? Though despite doing so much OSS, they seem to be focused around their specific server rack OS, not general-purpose solutions
It's not an area I personally work on, but yeah, there's a lot going on. And there will be more in the future, for example, I believe right now we ensure data integrity ourselves, but if you're running something (like Ceph) that does that on its own, you're paying for it twice. And so giving people options like that is important. It's a pretty interesting part of the space!
IMO this is the problem with all storage clusters that you run yourself, not just Ceph. Ultimately, keeping data alive through instance failures is just a lot of maintenance that needs to happen (even with automation).
I thought they used ceph too. But I started looking around and it seems like they have switched to CernVM-FS and in house solution. I'm not sure what changed.
CERN is a heavy user of ceph, with about 100PB of data across cephfs, object stores (used as backend for S3), and block storage (mostly for storage for VMs). CVMFS (https://cernvm.cern.ch/fs/) is used to distribute the software stacks used by LHC experiments across the WLCG (Worldwide LHC Computing Grid), and is back by S3 with ceph for its storage needs. Physics data, however, is stored on EOS (https://eos.web.cern.ch) and CERN just recently crossed the 1EB mark of raw disk storage managed by EOS. EOS is also used as the storage solution for CERNBox (https://cernbox.web.cern.ch/), which holds user data. Data analyses use ROOT and read the data remotely from EOS using XRootD (https://github.com/xrootd/xrootd), as EOS is itself based on XRootD. XRootD is very efficient to read data across the network compared to other solutions. It is also used by other experiments beyond high energy physics, for example by LSST in its clustered database called Qserv (https://qserv.lsst.io).
They didn't switch, they use both for different needs. EOS (CVMFS) is used mainly for physics data storage and user data. Ceph is used for many other things like infrastructure, selfhosted apps..etc.
I think their "Other distributed filesystem" section does not answer this question.