Slightly off topic, but just the other day we had a case where we had an index, and deleting ~100k entries from a table with a few, small columns caused a huge delay.
Context: Mariadb 10.2. In this table we stored traffic statistics, aggregated by IP address and day. Probably less than 200 bytes per row. They were also written per day, and then cleaned up by a cron job (every month an entire month worth of data was discarded), so "DELETE FROM traffic_stats WHERE day < ?"
This worked fine on the primary, but our monitoring told us that the replicas started to lag, and showed that they spent inordinate amounts of time on the delete statement above. Even though we had an index on the "day" column.
Even weirder, the problem didn't show up in our QA database cluster, which had far less resources (vmware vm in QA vs. pretty solid bare metal in prod).
Even more puzzling, skipping the replication step and running the DELETE statement manually was quite fast (a second or so).
After some frantic search, it turned out that, for reasons lost in time, the QA db cluster used "mixed" replication, while the prod cluster used row-based replication.
In row-based replication, the leader communicates to the replicas which rows were deleted. And since we didn't have a primary key on that table, the replicas did a full-table scan to find which record to delete, and repeated that for each of the 100k records to be deleted.
Adding a primary key to the table fixed the issue immediately, and we switched to mixed replication shortly after.
I've very familiar with this sort of problem. I used to work in high-volume electronics manufacturing and coercing high & variable transaction OLTP databases to be able to be 1) backupable, and 2) replicable for analytics workloads often feels like a dark art. In a lot of cases, trial and error is a perfectly reasonable approach, for better or for worse.
Context: Mariadb 10.2. In this table we stored traffic statistics, aggregated by IP address and day. Probably less than 200 bytes per row. They were also written per day, and then cleaned up by a cron job (every month an entire month worth of data was discarded), so "DELETE FROM traffic_stats WHERE day < ?"
This worked fine on the primary, but our monitoring told us that the replicas started to lag, and showed that they spent inordinate amounts of time on the delete statement above. Even though we had an index on the "day" column.
Even weirder, the problem didn't show up in our QA database cluster, which had far less resources (vmware vm in QA vs. pretty solid bare metal in prod).
Even more puzzling, skipping the replication step and running the DELETE statement manually was quite fast (a second or so).
After some frantic search, it turned out that, for reasons lost in time, the QA db cluster used "mixed" replication, while the prod cluster used row-based replication. In row-based replication, the leader communicates to the replicas which rows were deleted. And since we didn't have a primary key on that table, the replicas did a full-table scan to find which record to delete, and repeated that for each of the 100k records to be deleted.
Adding a primary key to the table fixed the issue immediately, and we switched to mixed replication shortly after.