What is the point? Who would want to use Hadoop for something below 10GB? Hadoop...

virmundi · on Sept 12, 2016

Kind of depends on what the 10 GB is. For example, on my project, we started on files that were about 10 GB a day. The old system took 9 hours to enhance the data (add columns from other sources based on simple joins). So we did it with Hadoop on two Solaris boxes (18 virtual cores between them). Same data; 45 minutes. But wait there's more.

We then created a two fraud models that took that 10+ GB file (enhancement added about 10%) and executed within about 30 minutes a piece. But concurrently. All on Hadoop. All on arguably terrible hardware. Folks at Twitter and Facebook had never though about using Solaris.

We've continued this pattern. We've switched tooling from Pig to Cascading because Cascading works in the small (your PC without a cluster) and in the large. It's testable with JUnit in a timely manner (looking at you PigUnit). Now we have some 70 fraud models chewing over anywhere from that 10+ GB daily file set to 3 TB. All this in our little 50 node cluster. All within about 14 hours. Total processed data is about 50 TB a day.

As pointed out earlier, Hadoop provides an efficient, scalable, easy distributed application development platform. Cascading makes life very Unix-like (pipes and filters and aggregators). This coupled with a fully async eventing pipe line for workflows built on RabbitMQ makes for an infinitely expandable fraud detection system.

Since all processors communicate only through events and HDFS, we add new fraud models without necessarily dropping the entire system. New models may arrive daily, conform to a set structure, and are literally self-installed from a zip file within about 1 minute.

We used the same event + Hadoop architecture to add claim line edits. These are different from fraud models in that fraud models calculate multiple complex attributes then apply a heuristic to the claim lines. Edits look at a smaller operation scope. But in cascading this is pipe from HDFS -> filter for interesting claim lines -> buffer for denials -> pipe to HDFS output location.

Simple, scalable, logical, improvable, testable. I've seen all of these. As the community comes out with new tools, we get more options. My team is working on machine learning and graphs. Mahout and Giraph. Hard to do all of this easily with a home grown data processing system.

As always, research your needs. Don't get caught up in the hype of a new trend. Don't be closed minded either.