They mention https://pola.rs/ and highlight how it has the exact same properties (out of core etc) but don't include it in benchmarks and no mention on their detailed comparison page.
It's because they are comparing to distributed solutions only. Polars is one of the fastest solution for single computer workflow, but it doesn't support distributed workflow.
Hello! Daft developer here - we don’t directly use Polars as an execution engine, but parts of the codebase (e.g. the expressions API) are heavily influenced by Polars code and hence you may see references to Polars in those sections.
We do have a dependency on the Arrow2 crate like Polars does, but that has been deprecated recently so both projects are having to deal with that right now.
I don't see any direct dependencies to polars. But in the comment, they wrote that some part are "taken from", "influenced by", "adapted from" and "based on" polars.
Super tiny nitpick but dask is down as non-arrow back end. Since it uses pandas as its underlying data structure, fully arrow backends should be fine as long as the installed version of pandas is 2+
These benchmarks seem strange, it looks like some sort of database test suite? And compared to the performance that is listed on the benchmarks own page[1], the Daft results look abysmal? Like Daft takes several hours to complete ten queries and the results the benchmark authors point to can do around a million queries per hour... oh, and the software they use for that is MS-SQL.
Hello! Daft developer here. The benchmarks we performed aren’t directly comparable to the benchmarks on TPC-H’s own page because of differences in hardware, storage etc.
For hardware, we were using AWS i3.2xlarge machines in a distributed cluster. And on the storage side we are reading Parquet files over the network from AWS S3. This is most representative of how users run query engines like Daft.
The TPC-H benchmarks are usually performed on databases which have pre-ingested the data into a single-node server-grade machine that’s running the database.
Note that Daft isn’t really a “database”, because we don’t have proprietary storage. Part of the appeal of using query engines like Daft and Spark is to able to read data “at rest” (as Parquet, CSV, JSON etc). However this will definitely be slower than a database which has pre-ingested the data into indexed storage and proprietary formats!
The QphH metric reported for TPC-H[1] is _not_ queries-per-hour, it's a composite metric:
> The performance metric reported by TPC-H is called the TPC-H Composite Query-per-Hour Performance Metric (QphH@Size), and reflects multiple aspects of the capability of the system to process queries.
so it's not directly comparable to seconds/query reported by Daft.
This looks really nice! It feels like in the last few years (probably with the increase in data science and analytics) dataframe engines have become really exciting, its very cool to see so much innovation in the field.
One thing i haven't seen yet and I'd be really interested in knowing if column level type hinting is on the plan at all for this (or any similar tools)?
I'm a data engineer and my team use mypy pretty heavily, but that can only really tell us if something is a "dataframe" type. It always strikes me that 90% of data issues during development (incorrect joins, column misspellings or non-type appropriate operations, not factoring in nulls) would be eliminated if mypy had information of the columns and their data types.
Interesting. Daft currently does validation on types/names only at runtime. The flow looks like:
1. Construct a dataframe (performs schema inference)
2. Access (now well-typed) columns and operations on those columns in the dataframe, with associated validations.
Unfortunately step (1) can only happen at runtime and not at type-checking-time since it requires running some schema inference logic, and step (2) relies on step (1) because the expressions of computation are "resolved" against those inferred types.
However, if we can fix (1) to happen at type-checking time using user-provided type-hints in place of the schema inference, we can maybe figure out a way to propagate this information through to mypy.
Would love to continue the discussion further as an Issue/Discussion on our Github!
This would be more comparable to being a backend for ibis. We're working on adding the remaining operations (like regex on strings or trigonometry functions) that ibis requires!
2. During a global shuffle stage (e.g. sorts, joins, aggregations) network transfer of data between nodes becomes the bottleneck.
This is why the advice is often to stick with a local solution such as DuckDB, Polars or Pandas if you can keep vertically scaling!
However, horizontally scaling does have some advantages:
- Higher aggregate network bandwidth for performing I/O with storage
- Auto-scaling to your workload’s resource requirements
- Scaling to large workloads which may not fit on a single machine. This is more common in Daft usage because we also work with multimodal data such as images, tensors and more for ML data modalities.
Depends on your workload size, if the compute time is less than time it takes to transfer the data back and forth, then it might be a bad idea to use it indeed.
Typically these solutions should only be tested after maxing out vertical scaling, before applying for horizontal scaling.
Network is one hell of a destroyer when it comes to advantages gained from distributed computing.
horizontal scaling however provides you with more aggregate network bandwidth. Most enterprises run workloads that downloads data from a data lake (usually S3), which is usually the bottleneck. Having horizontal scaling here allows the query leverage much higher network than just having a large single machine.
> To help improve Daft, we collect non-identifiable data.
> To disable this behavior, set the following environment variable: DAFT_ANALYTICS_ENABLED=0
> [0] In short, we collect the following:
> On import, we track system information such as the runner being used, version of Daft, OS, Python version, etc.
> On calls of public methods on the DataFrame object, we track metadata about the execution: the name of the method, the walltime for execution and the class of error raised (if any). Function parameters and stacktraces are not logged, ensuring that user data remains private.
I don't think they call it "really necessary" but I assume it's handy to understand the user base and focus on the right use cases in the long run considering not all the users who use these libraries are active on Github.