We fell in this "trap" as well. Whilst working on a marketing automation system, we were integrating a Google Analytics/Piwik clone. Our guestimates indicated we were going to be storing around 100GB of events per month. We geared up and got working. The team built complex Hadoop jobs, Pig & Sqoop scripts, lots of high-level abstractions to make writing jobs easier, lots of infrastructure, etc. etc. After about 2 months we scrapped the "big data" idea and redid everything in two weeks using PostgreSQL. As most of the queries were by date, partitioning made a huge difference.
I recall one of the classes was named SimpleJobDescriptor. At the near end it was 500+ lines long. Not so simple after all.
I recall one of the classes was named SimpleJobDescriptor. At the near end it was 500+ lines long. Not so simple after all.