Saturday, March 24, 2007

Myth : Normalization == Poor Performance

[this article was originally written in 1997] 

Have you ever read something over and over again you know is false but worry if people keep repeating it will become fact? To dictators its called propaganda. Marketing and sales people call it, "Believing your own bull." 

The comparisons may seem extreme (some suggest histrionic) but that's what's happening now with the mantra, "Normalized data can't perform." Repeat it a couple hundred times (maybe you already have) and you'll start believing it too. 

Like many other falsehoods in technology, this kind of thinking attempts to blame technology for people's own ignorance about how to use it. 

Russ Matika, one of the best system programmers I've been lucky to work with, insisted there was nothing wrong with various technologies (whichever one I was complaining about at the time). He suggested we just didn't know how to exploit them.

This is especially true with relational databases--one of the few computer-related technologies with mathematical proof for a 'correct' way to design systems. That proof is called 'set theory' or 'relational set theory.' Set theory is introduced to students as early as 6th grade these days. Although it's not presented to students in the context of database design, the math is the same. 

So what's the supposed problem with normalized (relationally correct) data? DBAs and programmers who are confused about the role of database engines complain normalization results in too many tables. Besides the administrative nightmare, more tables means more JOINs to accomplish simple queries. More JOINs means more CPU cycles (and logical IOs) which translates into longer running queries. So long in fact the wait renders the query useless by the time it leaves the database and appears in the application. 

They're right. They're absolutely right. But the fault lies not with the database engine, its query optimizer, or the design of the database. The fault lies with the programmers and DBAs who think its the database's responsibility to do the work for the application programmer. Its the fault of anyone who confuses the role of the database to do both the storing and organizing (disassembly) of data as well as aggregation and projection (assembly). 

Let me share my own Dickensonian, "Tale of Two Queries." 

A Credit Union has a report called "Consolidated Analysis" which prints, for every customer, totals for all their demand, checking, CDs, closed-end and open-end loans across a page. At the end of the report it displays totals. Sounds like a fairly straight-forward report, doesn't it? All the credit union's data was stored in a relational database with various attributes of the unique account types residing in their own tables. To create a single report with all the information all these tables had to be JOINed together. To make a long story short; using traditional SQL with a fairly typical report-writer program the report took nearly two hours for a measly 150,000 customers. Heck, the printer could print faster than that! 

Undaunted, Russ insisted there was nothing wrong with the database. Nothing wrong with the hardware. Nothing wrong with the network. Only something wrong with our query. He suggested another approach. One that required a bit more programming (pun intended) and additional thought on the part of the programmer--but Cliff was up to the challenge. 

Cliff knew the separate pieces of the report ran quickly. If you just ran the 'Deposit Analysis,' the 'Loan Analysis,' or the 'CD Analysis' reports they ran in just a few minutes. Cliff re-coded his consolidated analysis to basically run the three reports at once, but moving the responsibility of the final join from the database to his application. Since each report was sorted by customer, the application code was a breeze (make sure you finish all three reports for each customer before moving onto the next). 

The new, trivially more complex, consolidated analysis ran in six minutes. 

Isolated incident? 

This same credit union is trying to run an online transaction system with the same relational database. Imagine all teller, ATM, and batch transactions posting to an online system using its new relational database. No batch jobs. No nighttime processing. "No way!" you think? 

Well, in the beginning you were right. A typical savings deposit implemented using database stored procedures (that's what the vendors tell you to do) required 1500-2000 logical IOs to complete. Many of the IOs were duplicate reads of data already read, due mostly to the limitation of variable scoping in stored procedures, parameter limitations, and so on. Sure, the re-read pages were cached, but they still cost CPU cycles. 

Undaunted, Russ insisted there was nothing wrong with the database. Nothing wrong with the hardware. Nothing wrong with the network. Only something wrong with our transactions. 

Cliff analyzed the transactions and discovered there were probably only 100 or so unique IOs required for the transaction. The other 1400-1900 were duplicate reads of account data or reads of system (or persistent) data that changed infrequently. He suggested that by moving the transaction business logic out of the database and into another language with greater flexibility and reading 'fixed' data once we could create a deposit transaction that required only 100 logical IOs. 

Cliff was right. He helped design a transaction processor that initialized the persistent data in its own memory when it started, did only the minimal number of reads to get the account data, and when it was done with all the business logic submitted to the database only the minimum number of updates necessary to affect the transaction. Admittedly, this solution is incrementally more sophisticated than our first, but the results paid-off: a 20x improvement in transaction throughput. More than enough to meet the posting requirements of the system. 

And the database was still normalized. Nothing in the logical data model had been forfeit for performance in the physical. 

So, is there something wrong with relational databases and normalized data? Or is it only our ability to exploit them?
Follow @TomGagne