Hacker News new | past | comments | ask | show | jobs | submit login

For "medium data", my company has found a lot of success using dask [0], which mimics the pandas API, but can scale across multiple cores or machine.

The community around dask is quite active and there's solid documentation to help learn the library. I cannot recommend dask enough for medium data projects for people who want to use python.

They have a great run down of dask vs pyspark [1] to help you understand why'd you use it.

[0] http://dask.pydata.org/en/latest/

[1] http://dask.pydata.org/en/latest/spark.html




I've been trying to change all of my Luigi pipeline tasks from using Pandas to Dask, so that I can push a lot more data through. Seems like an easy process so far, and I like the easy implementation of parallel computing.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: