Daft: Distributed DataFrame for Python

mborch · on March 1, 2024

They mention https://pola.rs/ and highlight how it has the exact same properties (out of core etc) but don't include it in benchmarks and no mention on their detailed comparison page.

wdroz · on March 1, 2024

It's because they are comparing to distributed solutions only. Polars is one of the fastest solution for single computer workflow, but it doesn't support distributed workflow.

mborch · on March 1, 2024

Ah of course, that makes sense, thanks. In their comparison, perhaps they should still mention that just to clear up any confusion.

jaychia · on March 1, 2024

Oh yes good point! We'll be sure to add more details about comparisons with local dataframe libraries such as Pandas/Polars/DuckDB.

baggiponte · on March 1, 2024

I was told they use Polars engine; so they "just" build a distributed system around it

jaychia · on March 1, 2024

Hello! Daft developer here - we don’t directly use Polars as an execution engine, but parts of the codebase (e.g. the expressions API) are heavily influenced by Polars code and hence you may see references to Polars in those sections.

We do have a dependency on the Arrow2 crate like Polars does, but that has been deprecated recently so both projects are having to deal with that right now.

wdroz · on March 1, 2024

I don't see any direct dependencies to polars. But in the comment, they wrote that some part are "taken from", "influenced by", "adapted from" and "based on" polars.

[0] -- https://github.com/search?q=repo%3AEventual-Inc%2FDaft%20pol...

baggiponte · on March 1, 2024

yes exactly, they """took""" part of the Polars engine code.

chrisaycock · on March 1, 2024

Here's the comparison chart to other Dataframe libraries:

https://www.getdaft.io/projects/docs/en/latest/faq/dataframe...

Daft is distributed and vectorized with plenty of benefits from lazy evaluation.

benrutter · on March 1, 2024

Super tiny nitpick but dask is down as non-arrow back end. Since it uses pandas as its underlying data structure, fully arrow backends should be fine as long as the installed version of pandas is 2+

3abiton · on March 1, 2024

Interesting, the main selling point is, pySpark but better as it is python native?

amamama · on March 1, 2024

We did extensive comparison with Spark and a lot of other tools at work. They always came on top. Highly recommend!

pletnes · on March 1, 2024

How does this compare with dask? (Polars was mentioned but isn’t distributed like dask is)

hiyer · on March 1, 2024

There are benchmarks here - https://github.com/Eventual-Inc/Daft?tab=readme-ov-file#benc.... Seems to outperform Dask by a fair bit.

SiempreViernes · on March 1, 2024

These benchmarks seem strange, it looks like some sort of database test suite? And compared to the performance that is listed on the benchmarks own page[1], the Daft results look abysmal? Like Daft takes several hours to complete ten queries and the results the benchmark authors point to can do around a million queries per hour... oh, and the software they use for that is MS-SQL.

[1]: https://www.tpc.org/tpch/results/tpch_advanced_sort5.asp?PRI...

jaychia · on March 1, 2024

Hello! Daft developer here. The benchmarks we performed aren’t directly comparable to the benchmarks on TPC-H’s own page because of differences in hardware, storage etc.

For hardware, we were using AWS i3.2xlarge machines in a distributed cluster. And on the storage side we are reading Parquet files over the network from AWS S3. This is most representative of how users run query engines like Daft.

The TPC-H benchmarks are usually performed on databases which have pre-ingested the data into a single-node server-grade machine that’s running the database.

Note that Daft isn’t really a “database”, because we don’t have proprietary storage. Part of the appeal of using query engines like Daft and Spark is to able to read data “at rest” (as Parquet, CSV, JSON etc). However this will definitely be slower than a database which has pre-ingested the data into indexed storage and proprietary formats!

Hope that helps explain the discrepancies!

pletnes · on March 1, 2024

I’m sure you mean slow = high latency, but I do hope you’re «high throughput», like spark/…?

kristjansson · on March 1, 2024

The QphH metric reported for TPC-H[1] is _not_ queries-per-hour, it's a composite metric:

> The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries.

so it's not directly comparable to seconds/query reported by Daft.

[1]: https://www.tpc.org/tpch/default5.asp

benrutter · on March 1, 2024

This looks really nice! It feels like in the last few years (probably with the increase in data science and analytics) dataframe engines have become really exciting, its very cool to see so much innovation in the field.

One thing i haven't seen yet and I'd be really interested in knowing if column level type hinting is on the plan at all for this (or any similar tools)?

I'm a data engineer and my team use mypy pretty heavily, but that can only really tell us if something is a "dataframe" type. It always strikes me that 90% of data issues during development (incorrect joins, column misspellings or non-type appropriate operations, not factoring in nulls) would be eliminated if mypy had information of the columns and their data types.

jaychia · on March 1, 2024

Interesting. Daft currently does validation on types/names only at runtime. The flow looks like:

1. Construct a dataframe (performs schema inference)

2. Access (now well-typed) columns and operations on those columns in the dataframe, with associated validations.

Unfortunately step (1) can only happen at runtime and not at type-checking-time since it requires running some schema inference logic, and step (2) relies on step (1) because the expressions of computation are "resolved" against those inferred types.

However, if we can fix (1) to happen at type-checking time using user-provided type-hints in place of the schema inference, we can maybe figure out a way to propagate this information through to mypy.

Would love to continue the discussion further as an Issue/Discussion on our Github!

benrutter · on March 1, 2024

Thanks for getting back, thats really interesting, for sure I will do!

antman · on March 1, 2024

Many many dataframes and dialects. Is this dataframe dialect similar to polars or pandas? Porting and testing is a key factor for adoption

Regex for strings and more detail on how partitioning works would be interesting

jaychia · on March 1, 2024

Hello! Daft developer here. We are most similar in API to Polars and PySpark.

And thanks for the feedback! We’ll add more capabilities for regex, as well as flesh out our documentation for partitioning.

Edit: added a new issue for regex support :) https://github.com/Eventual-Inc/Daft/issues/1962

hiyer · on March 1, 2024

Really impressive benchmarks - outperforms even Spark by a good margin. Great work!

sammysidhu · on March 1, 2024

One of the daft developers here - Thank you!

oblvious-earth · on March 1, 2024

Is this comparable to Ibis or is it something that would be a backend for it?

sammysidhu · on March 1, 2024

This would be more comparable to being a backend for ibis. We're working on adding the remaining operations (like regex on strings or trigonometry functions) that ibis requires!

jmakov · on March 1, 2024

Wondering if network becomes a bottleneck. Anybody using this?

jaychia · on March 1, 2024

Hello, Daft developer here!

The network indeed becomes the bottleneck. In 2 main ways:

1. Reading data from cloud storage is very expensive. Here’s a blogpost where we talk about some of the optimizations we’ve done in that area: https://blog.getdaft.io/p/announcing-daft-02-10x-faster-io

2. During a global shuffle stage (e.g. sorts, joins, aggregations) network transfer of data between nodes becomes the bottleneck.

This is why the advice is often to stick with a local solution such as DuckDB, Polars or Pandas if you can keep vertically scaling!

However, horizontally scaling does have some advantages:

- Higher aggregate network bandwidth for performing I/O with storage

- Auto-scaling to your workload’s resource requirements

- Scaling to large workloads which may not fit on a single machine. This is more common in Daft usage because we also work with multimodal data such as images, tensors and more for ML data modalities.

Hope this helps!

teitoklien · on March 1, 2024

Great question!

Depends on your workload size, if the compute time is less than time it takes to transfer the data back and forth, then it might be a bad idea to use it indeed.

Typically these solutions should only be tested after maxing out vertical scaling, before applying for horizontal scaling.

Network is one hell of a destroyer when it comes to advantages gained from distributed computing.

spratzt · on March 1, 2024

I would go even further and argue that for the vast majority of organizations vertical scaling is all you are ever going to need.

sammysidhu · on March 1, 2024

horizontal scaling however provides you with more aggregate network bandwidth. Most enterprises run workloads that downloads data from a data lake (usually S3), which is usually the bottleneck. Having horizontal scaling here allows the query leverage much higher network than just having a large single machine.

whoevercares · on March 1, 2024

Looks like iceberg support is coming - dope!

jaychia · on March 1, 2024

Daft developer here!

We actually already have read support. Check out the pyiceberg docs' Daft section: https://py.iceberg.apache.org/api/#daft

It's also very easy to use from Daft itself: `daft.read_iceberg(pyiceberg_table)`. Give it a shot and let us know how it works for you!

nivekney · on March 1, 2024

I see Rust, I like

sammysidhu · on March 1, 2024

we love rust too!

roter · on March 1, 2024

> To help improve Daft, we collect non-identifiable data.

> To disable this behavior, set the following environment variable: DAFT_ANALYTICS_ENABLED=0

> [0] In short, we collect the following:

> On import, we track system information such as the runner being used, version of Daft, OS, Python version, etc.

> On calls of public methods on the DataFrame object, we track metadata about the execution: the name of the method, the walltime for execution and the class of error raised (if any). Function parameters and stacktraces are not logged, ensuring that user data remains private.

Is this telemetry really necessary?

[0] https://www.getdaft.io/projects/docs/en/latest/faq/telemetry...

buremba · on March 1, 2024

I don't think they call it "really necessary" but I assume it's handy to understand the user base and focus on the right use cases in the long run considering not all the users who use these libraries are active on Github.

jonnat · on March 1, 2024

It has an Apache 2.0 license. I assume it allows it to be repackaged removing telemetry.