Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Hello, Daft developer here!

The network indeed becomes the bottleneck. In 2 main ways:

1. Reading data from cloud storage is very expensive. Here’s a blogpost where we talk about some of the optimizations we’ve done in that area: https://blog.getdaft.io/p/announcing-daft-02-10x-faster-io

2. During a global shuffle stage (e.g. sorts, joins, aggregations) network transfer of data between nodes becomes the bottleneck.

This is why the advice is often to stick with a local solution such as DuckDB, Polars or Pandas if you can keep vertically scaling!

However, horizontally scaling does have some advantages:

- Higher aggregate network bandwidth for performing I/O with storage

- Auto-scaling to your workload’s resource requirements

- Scaling to large workloads which may not fit on a single machine. This is more common in Daft usage because we also work with multimodal data such as images, tensors and more for ML data modalities.

Hope this helps!



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: