I had the task of replicating data from DynamoDB to Redshift on AWS on two large projects on the last few years. The primary factor in using DynamoDB is that is's cheap and scales like nothing else in the galaxy.
Then the other shoe fell. The data was needed for reports. Reports require a static schema. NoSQL (and the developers who love it) despises static schema. "I can add properties whenever I want!"
This process of analyzing and reporting on production data becomes a very time-consuming, costly, and brittle exercise.
So then you have to determine, did we save enough money on the design side (using NoSQL over SQL) and then piss it away on reporting?
I'd argue AWS and other cloud providers need to create a SQL capable relational database as a service. This would (I hope) solve the problem.
But in the meantime, let's build our micro-services on relational databases so we can actually get aggregate data to stakeholders in real-time.
>So then you have to determine, did we save enough money on the design side (using NoSQL over SQL) and then piss it away on reporting?
I spent two years early in my career as a report writer. It sucked horribly, but I learned how important reports are to the business. I remember a professor telling me that any business application reporting is 50% of the value. This is something many developers don't even consider.
Having said that, the solution, as I understand it for the reporting issue with NoSQL is to move your NoSQL data to a proper warehouse for reporting/BI. You'll of course need an ETL layer to do all that, and the ETL layer will need to be smart enough to handle very nullable data and data structures that often get extra, unknown properties, etc.
This solution continues to work even when you have multiple, loosely coupled NoSQL databases handling various parts of a larger system and various separate systems.
This may be true, but reports also normally require a different schema from the OLTP database. Data warehousing with star schemas is a thing and it doesn't matter whether your OLTP database is an RDBMS or a NoSQL solution...you're going to have to push data to your warehouse to run those reports. The last thing you want to be doing is running reporting queries against a database that's required for the operation of your product, even if it is off hours. And reports will never run efficiently if they use a schema that's designed around the requirements of your business' application. For one thing, your update performance will tank since every write requires a ton of index writes. You can mitigate that a bit by maintaining separate indexes on a reporting slave, but that kind of maintenance has its own set of headaches.
So I'm not seeing how reporting is a good rationale for choosing one technology over another. Reporting, when done right, is a completely separate system with a completely distinct technology choice.
The level of effort to move data from DynamoDB to a data-warehouse (which is what Redshift is), is not trivial. It may be that the Data Pipeline tools within AWS are still opaque and incomplete, but but I've seen the same problem with MongoDB implementations.
Reporting is nearly always an afterthought in system architecture and it should be a first-class requirement.
The reason that moving data from a transactional platform to a warehouse platform is not trivial is because you actually have to have ETL processes in place to transform data from one architecture to another.
A data warehouse has nothing to do with the mechanism that stores it. It is a design that comes with its own challenges just like transactional systems do. Transactional systems are optimized for writing. Warehouses are optimized for reading. They way you choose to execute those optimizations is up to you. But getting data from one of those systems to the other is never going to be trivial, and there is no general purpose tool to get things from one format to another because the warehouse structure is too entity-specific to generalize beyond a certain point (and that point of generalization is really the process of creating the design. See the Kimball group's books for more information.)
Redshift is not a data warehouse anymore than Postgres is a data warehouse. In fact, Redshift is a fork of Postgres that leaves out the functionality, data types, and indexes that make it difficult to scale horizontally. Postgres also isn't a data warehouse. Nor is it transactional system. It's just a tool that you can use to build either one of those things with.
OP probably means not worrying about instances, scaling or over/under provisioning etc, just service usage at a more fine grained level.. more like Spanner
It’s been a while since I looked at it but isn’t that how Aurora works? Isn’t that (along with performance) their argument for why you should use it over MySQL or Postgres in RDS?
Exactly. I don't want to care about the server or the hardware or the OS. Just give me a database with tables and standard SQL querying capabilities and performance.
I wasn't suggesting that. I was saying there's no real reason transactional and reporting data can't start with schematic replication and actually is more efficient.
Then the other shoe fell. The data was needed for reports. Reports require a static schema. NoSQL (and the developers who love it) despises static schema. "I can add properties whenever I want!"
This process of analyzing and reporting on production data becomes a very time-consuming, costly, and brittle exercise.
So then you have to determine, did we save enough money on the design side (using NoSQL over SQL) and then piss it away on reporting?
I'd argue AWS and other cloud providers need to create a SQL capable relational database as a service. This would (I hope) solve the problem.
But in the meantime, let's build our micro-services on relational databases so we can actually get aggregate data to stakeholders in real-time.