Why not use CephFS instead? It has been thoroughly tested in real-world scenario...

charleshn · 2025-04-17T16:57:16 1744909036

Because it's actually fairly slow.

Among other things, the OSD was not designed with NVMe drives in mind - which is fair, given how old it is - so it's nowhere close to being able to handle modern NVMe IO throughput and IOPS.

For that you need zero-copy, RDMA etc.

Note that there is a next-generation OSD project called Crimson [0], however it's been a while, and I'm not sure how well it's going. It's based on the awesome Seastar framework [1], backing ScyllaDB.

Achieving such performance would also require many changes to the client (RDMA, etc).

Something like Weka [2] has a much better design for this kind of performance.

[0] https://ceph.io/en/news/crimson/

[1] https://seastar.io/

[2] https://www.weka.io/

__turbobrew__ · 2025-04-17T18:11:00 1744913460

With latest ceph releases I am able to saturate modern NVME devices with 2 OSD/NVME. It is kind of a hack to have multiple OSD per NVME, but it works.

I do agree that nvme-of is the next hurdle for ceph performance.

sgarland · 2025-04-18T13:48:39 1744984119

I thought the current recommendation was to not have multiple OSDs per NVMe? Tbf I haven’t looked in a while.

I have 3x Samsung NVMe (something enterprise w/ PLP; I forget the model number) across 3 nodes, linked with an Infiniband mesh network. IIRC when I benchmarked it, I could get somewhere around 2000 MBps, bottlenecked by single-core CPU performance. Fast enough for homelab needs.

__turbobrew__ · 2025-04-18T15:58:24 1744991904

https://ceph.io/en/news/blog/2023/reef-osds-per-nvme/

I benchmarked 1 vs 2 OSD and found 2 OSD was better. I don’t think it is recommended to run more than 2 OSD per NVME.

rthnbgrredf · 2025-04-18T06:57:39 1744959459

> Because it's actually fairly slow.

"We were reading data at 635 GiB/s. We broke 15 million 4k random read IOPS."

Source: https://ceph.io/en/news/blog/2024/ceph-a-journey-to-1tibps/

I don't know man, I think 15M random read IOPS is actually quite fast. I've built multi million IOPS clusters in enterprise settings all on nvme in the past.

rfoo · 2025-04-18T11:19:34 1744975174

> I think 15M random read IOPS is actually quite fast

680x NVMe SSDs over 68 storage servers (so 68 CPUs) for just 15M (or 25M, tuned) random read IOPS is pretty underwhelming. The use cases where 3FS (or some other custom designs) shine are more like, 200M random read IOPS with 64 servers each with 8 PCIe gen 4 NVMe SSDs (512x SSDs in total).

skrtskrt · 2025-04-17T19:29:30 1744918170

DigitalOcean uses Ceph underneath their S3 and block volume products. When I was there they had 2 teams just managing Ceph, not even any of the control plane stuff built on top.

It is a complete bear to manage and tune at scale. And DO never greenlit offering anything based on CephFS either because it was going to be a whole other host of things to manage.

Then of course you have to fight with the maintainers (Red Hat devs) to get any improvements contributed, assuming you even have team members with the requisite C++ expertise.

Andys · 2025-04-17T19:50:55 1744919455

Ceph is massively over-complicated, if I had two teams I'd probably try and write one from scratch instead.

skrtskrt · 2025-04-17T20:35:13 1744922113

Most of the legitimate datacenter-scale direct Ceph alternatives unfortunately are proprietary, in part because it takes so much money and human-expertise-hours to even be able to prove out that scale, they want to recoup costs and stay ahead.

Minio is absolutely not datacenter-scale and I would not expect anything in Go to really reach that point. Garbage collection is a rough thing at such enormous scale.

I bet we'll get one in Rust eventually. Maybe from Oxide computer company? Though despite doing so much OSS, they seem to be focused around their specific server rack OS, not general-purpose solutions

steveklabnik · 2025-04-17T21:11:15 1744924275

> I bet we'll get one in Rust eventually. Maybe from Oxide computer company?

Crucible is our storage service: https://github.com/oxidecomputer/crucible

RFD 60, linked in the README, contains a bit of info about Ceph, which we did evaluate: https://rfd.shared.oxide.computer/rfd/0060

sgarland · 2025-04-18T13:57:09 1744984629

Fascinating! I thought you were using RAIDZ3 with some kind of clever wrapper (or just DRBD), but it’s much more complex than that.

steveklabnik · 2025-04-18T17:22:54 1744996974

It's not an area I personally work on, but yeah, there's a lot going on. And there will be more in the future, for example, I believe right now we ensure data integrity ourselves, but if you're running something (like Ceph) that does that on its own, you're paying for it twice. And so giving people options like that is important. It's a pretty interesting part of the space!

zackangelo · 2025-04-18T13:48:55 1744984135

The 3FS chunk engine is written in Rust.

tempest_ · 2025-04-17T14:33:28 1744900408

We have a couple ceph clusters.

If my systems guys are telling me the truth is it a real time sink to run and can require an awful lot of babysitting at times.

huntaub · 2025-04-17T14:35:19 1744900519

IMO this is the problem with all storage clusters that you run yourself, not just Ceph. Ultimately, keeping data alive through instance failures is just a lot of maintenance that needs to happen (even with automation).

_joel · 2025-04-17T14:38:24 1744900704

I admin'd a cluster about 10 years back and it was 'ok' then, around bluestore. One issue was definitely my mistake but it wasn't all that bad.

elashri · 2025-04-17T15:01:39 1744902099

CERN use CephFS with ~50PB for different applications and they are happy with it.

dfc · 2025-04-17T15:23:39 1744903419

I thought they used ceph too. But I started looking around and it seems like they have switched to CernVM-FS and in house solution. I'm not sure what changed.

amadio · 2025-04-18T12:23:24 1744979004

CERN is a heavy user of ceph, with about 100PB of data across cephfs, object stores (used as backend for S3), and block storage (mostly for storage for VMs). CVMFS (https://cernvm.cern.ch/fs/) is used to distribute the software stacks used by LHC experiments across the WLCG (Worldwide LHC Computing Grid), and is back by S3 with ceph for its storage needs. Physics data, however, is stored on EOS (https://eos.web.cern.ch) and CERN just recently crossed the 1EB mark of raw disk storage managed by EOS. EOS is also used as the storage solution for CERNBox (https://cernbox.web.cern.ch/), which holds user data. Data analyses use ROOT and read the data remotely from EOS using XRootD (https://github.com/xrootd/xrootd), as EOS is itself based on XRootD. XRootD is very efficient to read data across the network compared to other solutions. It is also used by other experiments beyond high energy physics, for example by LSST in its clustered database called Qserv (https://qserv.lsst.io).

elashri · 2025-04-17T16:29:47 1744907387

They didn't switch, they use both for different needs. EOS (CVMFS) is used mainly for physics data storage and user data. Ceph is used for many other things like infrastructure, selfhosted apps..etc.