For "medium data", my company has found a lot of success using dask [0], which m... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

kmax12 on May 23, 2017 | parent | context | favorite | on: Don't use Hadoop when your data isn't that big (20...

For "medium data", my company has found a lot of success using dask [0], which mimics the pandas API, but can scale across multiple cores or machine.

The community around dask is quite active and there's solid documentation to help learn the library. I cannot recommend dask enough for medium data projects for people who want to use python.

They have a great run down of dask vs pyspark [1] to help you understand why'd you use it.

[0] http://dask.pydata.org/en/latest/

[1] http://dask.pydata.org/en/latest/spark.html

_flbt on May 23, 2017 [–]

I've been trying to change all of my Luigi pipeline tasks from using Pandas to Dask, so that I can push a lot more data through. Seems like an easy process so far, and I like the easy implementation of parallel computing.

Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4
Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact