Why is my CDC So Slow?
Whether it’s our new Netezza Performance Server (NPS) or your IPDA (Mako, etc) the need for change-data-capture from a source never goes away. Our objective is a Netezza table with the identical content and structure of the source database’s counterpart, a best practice in Netezza, to setup a database facing a source, filled with tables identical to the source, for purposes of data ingestion and assimilation.
Let’s pause for a moment. Because many Netezza users stop at this point. The data is in the box. It looks like the source. Let’s build some reports now.
Not so fast. What if we have multiple disparate sources? What if those source have table name collisions (two might have a CUSTOMER while two others have a VENDOR). Not to worry — we simply stand-up another intake database to face that source, likewise fill it with tables identical to the source, and we can cookie-cutter all our sources this way.
But — don’t presume when the data arrives, it’s consumption-ready. It looks like the source — a transactional database. And the tables are optimized for transactional activities. We couldn’t join these tables in the source because they took too long. Over here in Netezza, I can join them all in a fraction of the time. We love this machine!
Uhh — wait a second. Another pause. In another essay, I pointed out there’s no such thing as a general-purpose schema. We should optimize each database toward a purpose. We should not later leverage it for another purpose. Both purposes will suffer and their constituents will, too.
If we build these tables for ingestion and then deploy them for reporting, someone will rightly say — “Let’s make sure the distribution keys face the reporting users.” So they will modify the tables and the users will have a better time than they had before.
But then — something strange happens. The Change-Data-Capture (CDC) process grinds for hours even though it’s only trickling data. What’s going on?
Let’s look under the covers at the mechanics of a CDC operation and clarify.
For any given table, CDC caches records in memory over a timed interval and applies them to the Netezza table. This micro-batch scenario fits Netezza databases well. CDC uses a delete/insert, which will cover all deletes, inserts, and updates (because an update is a delete/insert in Netezza). Once complete, the contents of the Netezza table and source table are in sync.
The critical piece is this: the distribution key. I mention this in other essays, but this is an important application of the principle.
If we want to delete say, 1000 records from a table with say, 100 dataslices distributed on RANDOM, the operation doesn’t know where any data might be, so must ship the keys from all 1000 records to all 100 dataslices. The data is random and the necessary keys could be anywhere and everywhere. The delete operation will grind as it takes on this workload.
Contrast this to a table distributed on a key, and we use the key in the delete operation (such as a primary key). The 1000 keys to delete belong to a given dataslice. So this body is divided into 100 parts, by key, and each part shipped to the appropriate dataslice. The operation is lightning-fast, and is 100 times faster than its RANDOM version.
This is why we say — never, ever perform a delete or update on a random table, or if we won’t use the distribution key in the delete operation. Deletes and Updates must co-locate, or they will grind.
Back to the CDC delete operation. We know we will update or delete on a primary key (from the source tables). This means the distribution key must match the primary key. We don’t need the whole primary key (if the key is multiple columns) just the major key. If we put every CDC-facing table’s major primarykey as the distribution key, the delete operation will always co-locate. The data will fly into the machine and the CDC operations themselves will be largely quiet.
Here’s where it gets tricky. Your DBA will likely say — now the data’s in the box, run your reports and have fun.
Uhh – not so fast.
As mentioned above, you’ll want reports with fast performance. And you’ll want to co-locate the tables where you can. Except here, you can’t. The tables are tuned for the source. Not for reports. The keys make the source ingestion fast. If we change distribution keys for reports, our reports will be faster but it will break our CDC.
What we need is another copy of the source table(s) in another database, call it an operational store or a repository. This repository will also be third normal form — like the ingestion database(s) but it’s a place to integrate all sources to get a picture of the whole business. We will still likely use the same primary keys for distribution keys — and will not run reports from this database, either.
The repository has a purpose, too — business integration. It doesn’t do data intake. It doesn’t do reporting. It fulfills the role of integrating business data into a cohesive, easily-understood whole.
Now – we get to the reporting database. Here we will formulate a star-like schema with facts and dimensions. We have more freedom to hybridize these structures because we will consolidate tables into dimensions and structure our facts to fit the user needs. It is here we will co-locate distribution keys, optimize all our joins, and apply zone maps.
I’ll talk about this flow in another essay, and how to easily capture most-recent changes in a Netezza table.
Those who don’t take the steps to reformulate the data into these additional structures will witness another common problem in Netezza applications:
Think about our transactional database, now mirrored into the staging/ingestion database, with dozens upon dozens of tables. For a reporting tool to join these together, it’s common to see ten, fifteen, twenty tables in a single operation. So complex, people will build one or more views to make sure nobody gets confused on the logic.
But under the covers, this is what happens: Each user query will, on demand, re-integrate, reformulate, and re-calculate the very same information it just got finished re-formulating, re-integrating, and re-calculating for the last on-demand query.
One of the benefits of a data warehouse, particularly a star-schema, the data is pre-formulated, pre-integrated, and pre-calculated at the time it is stored. The on-demand user query simply consumes it, because it’s consumption-ready.
Data in “raw” transaction tables is not consumption-ready. So, it’s important to take this one-more-step and move the content out of staging and into a consumption database with tables optimized for this purpose. Then we build-once-consume many.
I’ll take a deep-dive on integration-on-demand in another essay, but for now think about what our CDC-based solution has done: We reconfigure the CDC-facing tables with a distribution key matching the source table’s primary key. This helps CDC go faster.
Then we pick up this data, even if just the changes, and apply them to one or more other databases, pre-integrating, formulating, calculating along the way. All those ugly left-outer-joins? Do them once, deliver the result to the reporting tables.
This very simple step will cause the average “purist” data modeler and DBA to retch their lunch when they hear of it. It will cause the admins of “traditional” engines to look askance at the Netezza machine and wonder what they could have been thinking when they purchased it.
But the ultimate users of the system, when they see reports with subsecond response, and lightning query turnaround, these same DBAs, admins and modelers will embrace the mystery.
This approach means we must regard our databases in terms of “roles”. Each role has a purpose. We deliberately embrace the practice of purpose-built schemas. The CDC-facing tables with support CDC – we won’t report from them. The reporting tables face the user – we won’t CDC to them.