This seems to be a pretty common sentiment here. But, here's my question, if you start up doesn't have big data now, then should you assume it will not have big data tomorrow.
I work with a start up which currently doesn't have "big data", but perhaps "medium data". And, I can perhaps manage without a big data stack in production. But, if I look at the company/sales targets then in the next 6-12 months we will be working with clients that assures "big data".
Now, here are my choices -
1. Stick to python scripts and/or large aws instances because it works for now. If the sales team closes deals in the next few months after working tirelessly for months, then though the sales team sold the client a great solution, but in reality we can't scale and we fail.
2. Plan for your start up to succeed, plan according to the company targets. Try and strike a balance for your production stack that isn't an huge overkill now but isn't under planned/worked either.
It's easy to say we shouldn't use big data stack till we have big data, but its too late (specially for a startup) to start building a big data stack after you have big data.
Been there. Believe me: stick to Python scripts! Always. And when you finally land that first customer and you have trouble scaling, first scale vertically (buy better machine) and work days and nights to build a scalable solution. But no sooner.
Why? Because your problem is not technical, it is business related. You have no idea why your startup will fail or why you will need to pivot. Because if you did, it wouldn't be a startup. Or you would have had that client already.
You might need to throw away your solution because it is not solving the right problem. Actually, it is almost certain that it is solving a problem nobody is prepared to pay for. So stick to Python until people start throwing money at you - because you don't have a product-market fit yet. And your fancy Big Data solution will be worth nothing, because it will be so damn impossible to adapt it to new requirements.
I wish I could send this comment back in time to myself... :-/ But since I can't, how about at least you learn from my mistakes and not yours?
With tools like message queues and Docker making it so easy to scale horizontally you don't even have to go vertically.
We just won an industry award at work for a multi billion data point spatial analysis project that was all done with Python scripts + Docker on EC2 and PostgreSQL/PostGIS on RDS. A consultant was working in parallel with Hadoop etc and we kept up just fine. Use what works not what is "best".
> With tools like message queues and Docker making it so easy to scale horizontally you don't even have to go vertically.
That depends entirely on the workload. It's not always a good idea to move from one sql instance, to a cluster of them. Just buy the better machine that gives you time to make a real scalable solution.
I assume searching for [employer] could direct someone here vs searching for knz/me specifically would take you to my HN profile page.
I'm not ashamed of anything I've said on HN but would rather not have people just searching for my employer ending up here (especially since I work in an office that routinely deals with sensitive political and community issues). It's a minor amount of (perceived) anonymity vs stating my name/job title/employer here!
Well, they could start by using something faster than Python. I would tend to use Common Lisp, but Clojure would be the more modern choice.
But yes, scaling up is far easier than scaling out. A box with 72 cores and 1.5TB of DRAM can be had for around $50k these days. I think it would take a startup a while to outgrow that.
Python is plenty fast where it matters. You have heavily optimized numerical and scientific libraries (numpy and scipy) and can easily escape in C if it matters that much to you. But in my experience bad performance is usually a result of wrong architecture and algorithms, sometimes even outright bugs, often introduced by "optimization" hacks which only make code less readable.
This holds for all languages, of course, not only Python. Forget raw speed, it is just the other end of the stick from Hadoop. Believe me, you don't need it. And even when you think you do, you don't. And when you have measured it and you still need it, ok, you can optimize that bottleneck. Everywhere else, choose proper architecture and write maintainable code and your app will leave others in the dust. Because it is never just about the speed anyway.
I agree that it depends on what you're doing, and the speed of the language often doesn't matter -- that has to be the case or Python would never have caught on to begin with.
But you can write code in Common Lisp or Clojure that's just as readable and maintainable (once you learn the language, obviously) as anything you can write in Python, and the development experience is just as good if not better.
Your choice of "big data" vs. python scripts sounds just like the classic trade-off of scope creep vs. good enough.
IMO the answer is almost always "good enough". This has been expressed in countless tropes/principles from many wise people, like KISS (Keep It Simple Stupid), YAGNI (You Ain't Gonna Need It), "pre-optimization is the root of all evil", etc.
If you go the YAGNI route, and when your lack of scale comes back to bite you (a happy problem to have), you'll have hard data about what exactly needs to be scaled, and you'll build a much better system. Otherwise, you'll dig deeper into the pre-optimization rabbit-hole of hypotheticals, and in that case, it's turtles all the way down (to use another trope).
I work with a start up which currently doesn't have "big data", but perhaps "medium data". And, I can perhaps manage without a big data stack in production. But, if I look at the company/sales targets then in the next 6-12 months we will be working with clients that assures "big data".
Now, here are my choices - 1. Stick to python scripts and/or large aws instances because it works for now. If the sales team closes deals in the next few months after working tirelessly for months, then though the sales team sold the client a great solution, but in reality we can't scale and we fail. 2. Plan for your start up to succeed, plan according to the company targets. Try and strike a balance for your production stack that isn't an huge overkill now but isn't under planned/worked either.
It's easy to say we shouldn't use big data stack till we have big data, but its too late (specially for a startup) to start building a big data stack after you have big data.