A Dance of Physics and Logic
“Query tuning in Netezza is like using a steering wheel to make a car go faster.”
I watched with quiet concern as the DBAs, developers and query analyts threw queries at a machine like darts. They formulated the query one way, tuned it, reformulated it. Some slightly faster but they needed orders-of-magnitude more power. Nothing got it off the ground.
I took the original query as they had first written it, added two minor suggestions irrelevant to the query’s functional answer, and the operation boosted into the stratosphere.
What they (thought they) saw however, was how I manipulated the query logic, not how I accessed the physics. I can ask for a show of hands in any room of technologists and get mixed answers on the question “Where is the seat of power?” in your database. Some will say software, others hardware, others a mix of the two, while those who still adhere to the musings of Theodoric of York, Medieval Data Scientist, would say there’s a small toad or dwarf living in the heart of the machine.
The answer – physics. We inherently know that a database running on a laptop is weaker than on a two-hundred-core server, but somehow this message is lost inside a Netezza machine. People really believe changing the logic of a query will affect the physical universe.
Keep in mind, the Netezza SQL interface is a software facade on top of massively parallel hardware. When the SQL enters the machine, the host breaks the SQL apart into “snippets”, and each snippet has a role at a different level of the machine. Some snippets scan disk. Some aggregate results. Some broadcast data. But no matter what we do, the host will still break the query apart without our consent, and those snippets do the work where the machine says they should. This is why manipulating query logic rarely yields results.
To make it even more abstract, users sit in a chair in physical reality. They live by physical clocks in that same reality, and “speed of thought” analytics happens when we create a time-dilation illusion: where only seconds pass in the analyst’s world while many hours pass in reality.
After all, when an analyst immerses in the flow of a true speed-of-thought experience, they hit the submit-key as fast as their fingers can touch the keyboard, synthesize answers, rinse and repeat. And do this for many hours seemingly without blinking. If the machine hiccups, or is slow to return, the illusion-bubble is popped and they re-enter the second-per-second reality with the rest of us. Perhaps their hair is a little grayer, their eyes a little more dilated, but they will swear they unmasked the Will-O-The-Wisp and are about to announce the Next Big Breakthrough. Anticipation is high.
But those who don’t have this speed-of-thought experience, chained to a stodgy commodity-technology, will never know the true meaning of “wheels-up” in the analytics sense. And never achieve this time-dilated immersion experience. The clock on the wall marks true real-time for them, and it’s maddening.
Notice the allusions to physics rather than logic. We don’t doubt the analyst has the logic down-cold. But logic will not dilate time. Only physics can do that. Emmett Brown mastered it with a flux capacitor. We don’t need to actually jump time, but a little dilation never hurt anybody.
The chief factor in query-turnaround within the Netezza machine is the physical implementation of logical structures. We can have the same logical structure, same content, with wildly different physical implementations.
Why is this?
The machine “distribute on” : Co-location of multiple tables on a common distribution key and
The dataslice “organize on” : Co-location of commonly-keyed records on as few disk pages as possible (zone maps).
Each of these can and do re-landscape a table across Netezza physics.
Take for example a national retailer with customers, products, inventory, vendors, employees, each in their own way touches a point-of-sale transaction. If we distribute each on their master ID (e.g. customer_id, product_id, etc.) none will co-locate when we join the tables (their distribution keys don’t match).
A fact table should co-locate with the largest, most active table, so we distribute it on the customer_id. This makes every join operation on these two tables fast and efficient. The other tables, not so much.
Let’s say the product department wants to run analytics on the model. They will join the fact table with the product table, and because they are not co-located, data will broadcast all over the machine as it formulates an answer. The problem here, is that not-so-noisy queries running for the customer department now compete with the very noisy queries of the product department. Both departments suffer.
Your product people might be happy because they don’t know the difference. Your customer people will throw a fit.
One way to resolve this problem is to manufacture and maintain a separate fact table, distributed on product_id, for the product department’s use. What was that? Make a copy of the content?
We can predict the product manager’s conversation with the DBAs:
PM: “I need a version of the primary fact table in my reporting database.”
DB: “You mean a different version of the fact? Like different columns?”
PM: “No, the same columns, same content.”
DB: “Then use the one we have. We’re not copying a hundred billion rows just so you can keep your own version.”
PM: “Well, it’s logically the same but the physical implementation is completely different.”
DB: (Raises an eyebrow) “Oh really? You mean that instead of doing an insert-into-select-from, we’ll move the data over by carrier pigeon?”
PM: (staring) “No, the new table has a different distribution and organize, so it’s a completely different physical table than the original”
DB: (huffs) “You’re just splitting semantic hairs with me. Data is data.”
I watched this conversation from a healthy distance for nearly fourteen months before the DBA acquiesced and installed the necessary table. Prior to this, the PM had been required to manufacture summary tables and an assortment of other supporting tables in lieu of the necessary fact-table. The user experience suffered immensely during this interval, many of them openly questioning whether the Netezza acquisition had been a wise choice.
But once we applied new distribution/organize, 30-minute user-queries reduced to seconds, many sub-second. Where it could only handle five users at a time, now twenty or more enjoyed a stellar experience.
How does making a copy of the same data make such a difference? Because it’s not really a copy. When we think “copy” we think “photocopy”, that is “identical”. A DBA will rarely imagine that using a different distribution and organize will create a version of the table that is, in physical terms, radically different from its counterpart. They see the table logically, just a reference in the catalog.
We unleash Netezza’s massively parallel physics by configuring logical data to a physical form on the hardware. We don’t tune queries — we tune data. Moreover, the physical implementation must be in synergy (and harmony) with how the users intend to consume them. In the above case, the customer-centric consumer had a great experience because the tables were physically configured toward their queries. The product-centric manager however, had a less-than-stellar experience because that same table was not physically configured to match his queries.
The DBA had basically misunderstood that the Netezza high-performance experience rests on the synergy between the logical queries and physical data structures.
In short, each of these managers required a purpose-built form of the data. The DBA thinks in terms of general-purpose and reuse. To him, capacity-planning is about preserving disk storage. He would never imagine that sacrificing disk storage (to build the new table) would translate into an increase in the throughput of the machine.
So, while the DBA is already thinking in physical terms, he believes that users only think in logical terms. Physics has always been the DBA’s problem. Who do those “logical” users think they are, coming to the DBA to offer up a lecture on physics?
In this regard, what if the DBA had built-out the new table but the PM’s staff had not included the new distribution key in their query? Or did not use the optimized zone-maps as determined by the Organize-On? The result would be the same as before: a logical query that is not using the machine’s physics.
At this point, adding the distribution key to the query, or adding filter attributes, is not “tuning” but “debugging”. Once those are in place, we don’t have to “tune” anything else.
Or rather, if the data structures are right, no query tuning is necessary. If the data structures are wrong, no query tuning will matter.
And this is why the aforementioned aficionados were losing their hair. They truly believed that the tried-and-true methods for query-tuning in an Oracle/SQLServer machine would be similar in Netezza.
Alas – they are not.
What does all of this mean? When we submit a logical query, it cannot manufacture power. It can only activate the power that was pre-configured for its use. This is why “query-tuning” doesn’t work so well with Netezza. I once suggested “query tuning in Netezza is like using a steering wheel to make a car go faster.” The actual power is under the hood, not in the user’s hands. While the user can use it the wrong way, the user cannot, through business-logic, make the machine boost by orders-of-magnitude.
What should the developer/user/analyst focus on? They already know what their queries will look like, so they must formulate tables to match that utility. They must distribute and organize the data physically, to match how the query will access it logically.
In short – when forming queries, think logically but act physically. We cannot disregard the physical layout of the table (across the parallel architecture).
When formulating data, think physically but act logically. We cannot build tables as though they are general purpose.
All of this makes a case for purpose-built data models and purpose-built physical models, and the rejection of general-purpose data models in the Netezza machine. After all, it’s a purpose-built machine — unlike its general purpose, commodity counterparts in the marketplace. In those machines (e.g. Oracle, SQLServer) we must engineer a purpose-built model (such as a star-schema) to overcome the the platform’s physical limitations. Why then move away from the general-purpose machine into a purpose-built machine, and attempt to embrace a general-purpose data model?
In case it’s not clear – there is no such thing as a general purpose data model in Netezza.
Could it be that the average Netezza user believes that the machine’s power gives it magical ability to enable a general-purpose model in the way that the general-purpose machines could not? Ever see a third-normal-form model used for reporting in a general-purpose machine? It’s so ugly, they run-don’t-walk toward engineering a purpose-built model (star schema), photocopying data from the general-purpose contents into the purpose-built form. No, the power of the Netezza machine doesn’t give it magical abilities to overcome this problem.
A third-normal form model doesn’t work better in Netezza than a purpose-built model.
So, we make a case for another purpose-built model (different distribution and organize). One that faces the new user-base with tables physically optimized to support that user base. Will all tables conform? Of course not. Will all of the data of the existing fact table(s) conform? Usually not, which is silver lining of the approach. Only take the columns we need.
This is perhaps the one most-misunderstood tradeoffs of requiring multiple user bases to use the same tables even though their physical form only supports one of the user-bases. And that is simply this: When we kick off queries that don’t leverage the physics (mismatched distribution keys), we scan more data on the dataslices and we broadcast more data between the CPUs. This drags the queries down and saturates the machine.
The query drag is a regular thing for the one-off user base, but left to itself, the machine’s capacity will saturate and affect all users. Everyone suffers, and all for the preservation of some disk space. Trust me, if there is a question between happy and productive users versus burning some extra disk space, it’s not a hard decision. Preserving disk storage in the heat of unhappy users is a bad tradeoff. Really bad.
You’ll see me repeat this in various essays: The strength of Netezza is its power. The weakness of Netezza is its power. This is because we’re tempted to build sloppy data structures and inefficient queries, and expect the machine’s power to make up the difference. It will, for a while. But if this inefficiency continues, we’ll reach the four walls of the machine right-quick. It’s not hard to dig-out of a hole like this, but we must dig out.
All solutions start out with a certain available capacity (same number of CPUs, memory, disk storage) and we must optimize these factors for the best possible experience. Throttling CPUs or disk space, or refusing to give up disk space merely to preserve disk capacity, only forestalls the inevitable. We must align the solution’s structures with machine physics and build the queries to leverage that physics
The depiction(above) describes how the modeler’s world (a logical world) in no wise intersects with the physical world, yet the physical world drives the user’s performance experience. The high-intensity physics of Netezza is not just something we “get for free”, it is a resource we must deliberately leverage and fiercely protect.
In the above, the “Logical data structure” is in the machine catalog (using query logic to create it). But once created, it doesn’t have any content, so we use more logic to push data into the physical data structure. The true test of this pairing is when we apply a logical query (top) and it uses the data structure logic/physics to access the machine physics (bottom).
Can we now see why a logical query, all on its own, cannot manufacture or manipulate power? It is the physical data structure working in synergy with the logical query that unleashes Netezza’s power. And this is why some discussions with modelers may require deeper communication about the necessity to leverage the deep-metal physics, honor and protect the machine’s power, while still fulfilling the user’s functional goals.