Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This also isn't a straight either or proposition. I build local command line pipelines and do testing and/or processing. When either the amount of data needed to be processed passes into the range where memory or network bandwidth makes the processing more efficient on a Hadoop cluster I make some fairly minimal conversions and run the stream processing on the Hadoop cluster in streaming mode. It hasn't been uncommon for my jobs to be much faster than the same jobs run on the cluster with Hive or some other framework. Much of the speed boils down to the optimizer and the planner.

Overall I find it very efficient to use the same toolset locally and then scale it up to a cluster when and if I need to.



What toolset are you using that you can run both locally and on a Hadoop cluster?


Almost all of them?

The vocabulary of the grandparent comment implies they are using hadoop's streaming mode, and thus one can use a map-reduce streaming abstraction such as MRJob or just plain stdin/stdout; both will work locally and in cluster mode.

Or, if static typing is more agreeable to your development process, running hadoop in "single machine cluster" mode is relatively painless. The same goes for other distributed processing frameworks like Spark.


I believe he mentioned it. The Hadoop streaming mode.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: