Anything worth doing is worth doing poorly: 2007

Tuesday, July 10, 2007

Bad Idea : Outsourcing Intellectual Property

A familiar echo

A colleague of mine has a theory why Vista requires 2GB of RAM and a late-model CPU to run satisfactorily. He believes this is likely the first edition of Microsoft's flagship operating system primarily developed in India rather than Redmond Washington.

Except for press reports of Microsoft's huge investments in China and India, and their outsourcing of development to those countries, I'm unaware of precisely what is being outsourced and what measures Microsoft has taken to insure a quality product. Quality being measured not just in bugs and resilience to breakdown, but the quality experienced programmers know can exist in the code itself. The economy of expression. Elegant algorithms. Brilliant structures and modularization. Unless Microsoft releases Vista's source code, which I think unlikely, we'll never know for sure whether Vista has the hidden qualities Paul Graham describes in his essay, Hackers and Painters.

The shot heard round the boardroom

Our development team was asked by one of our largest investors to visit another company he owned and analyze their software, development methodologies, and testing procedures. No greater compliment could have been paid us. The company in question was on the verge of signing a large contract that had the potential for significant revenue growth and pressure on the existing software platform. Company directors were anxious about the deal because the software was showing significant signs of stress. When we visited there were over 800 bugs listed as critical. Among them were reports that took too long to be usable, some of their customers were able to see other customers' data, and invoicing was broken.

We'll skip the messy details, but there are some red flags that predicted their problems. To protect the innocent and guilty alike we'll call the company Newco.

The good

Newco had a great start. Their innovative web-delivered service was easy to learn and use. They didn't need the overhead of a sales staff because the service was self-enrolled. Membership included newsletters with helpful articles both on using the system and advice from industry professionals. Additionally, because the service required only an internet connection it was priced competitively and easily won business from other providers.

The bad

Curiously, Newco's management had no previous experience in either their product's industry or software development. They created the service and attracted quality investors, but that was pretty much the end of their most valuable contributions.

Neither Newco or their directors realized they were in the software business. True, the service wasn't software related, but the entirety of Newco's intellectual property was invested in the software. The danger of not knowing what business you're in is loss of focus. In this case the loss of focus wasn't a mere distraction, it was completely misdirected. Instead of jealously guarding and nurturing that which defined their company, the software, their attentions were elsewhere. From the beginning, software development was an expense to be minimized rather than aggressively invested in.

The ugly

Newco's management was filled with large-company escapees that approached small-company's software development the same way a large company might: simple project management. All they had to do was to find inexpensive labor, describe the requirements, agree on delivery dates, and hold the developer to them.

Their CTOs were either not experienced developing software or weren't given the opportunity. The last CTO had no experience writing or designing software (or in Newco's industry) but instead had many years experience managing projects at a large IT consulting firm.

They peddled their IP for development to outside contractors across three countries and two continents--none of them domestic. This isn't an indictment of the quality available from overseas developers, but evidence of how far away geographically and culturally they dispatched their company's jewels. All the time they did this they didn't have in-house technical expertise to measure or critique the software's design or engineering.

Ultimately, Newco lost complete control of the software. It's design, it's host operating system, the database, development tools, infrastructure tools, language, and issue tracking. In short, they'd lost their ability to be self-deterministic and had become completely dependent on other parties for their survival. By the time we arrived their own intellectual property was completely foreign to them both literally and figuratively.

The clever bookend

Which brings us back to Redmond. If my colleague's suspicions are true what might that say about the business Microsoft is in? It may be they're perfectly capable of managing off-shore development with greater competence than Newco possessed. Or it may indicate an significant change of direction for Microsoft--demonstrating it's no longer in the software development business as much as it is another business, perhaps the patent and property protection business?

Microsoft is certainly a large company. Perhaps one of the largest. It's certainly exercised its marketing, legal, and acquisition might and expertise with the financial resources to back them up. And now that its head is turned toward other activities unrelated to the actual exercise of writing its own software has created an opportunity for other companies that are focused on writing their own software and jealously guarding it to establish a beach-head that wouldn't have been imaginable not too many years ago.

Can you say Google?

Newco was eventually sold at a discount to a competitor for the only thing it possessed worth paying for--its customer list.

Monday, June 18, 2007

Databases as Objects: My schema is a class

In my previous article I wrote that the database is the biggest object in my system. If that is the case, I should be able to test the concept against the Gang of Four's Design Patterns to see how the idea holds up.

But before doing that I need to define, in database terms, what classes are and what their instances may look like.

In OO terms, a class is a template that defines what its instances look like. Cincom's VW Smalltalk's Date class defines two instance variables, day and year. Given those two instance variables any Date class instance can keep track of a date.

My database has a schema. That schema can be executed as a sequence of data definition language (DDL) statements to create a new instance. In addition to our production database we have multiple other instances created with the same schema our developers and quality analysts use to test the system.

Part of a class' template defines its instances methods. Which operations does it support. What behaviors can a user of any of a class' instances expect to be available? Inside a class hierarchy classes inherit the behavior of their superclasses--the classes from which they derive their base behavior. A class can add new behavior or override inherited behavior to create an object with unique capabilities not available in any of its ancestors.

Before I extend any of my database' behaviors, it too, has default behaviors. At the lowest level I can use SQL statements to introspect and interact with my database in all kinds of low-level ways. On their own, these low-level behaviors know nothing of my application or its unique abilities and requirements. Like a class, though, I can add new behavior or even override default behavior using stored procedures and views to provide unique capabilities not available or impractical if they didn't exist.

In the world of Sybase, every database inherits the attributes and behavior of a database named Model.

Model
^
|
|
Efinnet

By itself, this is beginning to look like a class tree--though a very shallow one. Something's belonging to a tree isn't more probably based on the depth of a tree (or its lack of depth). In fact, many OO designers are advocating for shallower hierarchies. In either respect, our database fits right in.

We already talked about instance variables and methods, but what are some of the other OO-ish things my database can do?

Persistence - One if its most important features is its ability to persist itself on disk and maintain its integrity. The entire state of my system is preserved and maintained inside my database object.

Introspection - My database can tell me things about itself, its variables and its methods

Composition - My database is composed of other objects called tables. Some of the tables were inherited from its superclass, others were added to extend its functionality.

Singleton - Instances of my database exist as singletons. For each instance of my system one, and exactly one, instance of my database exists to preserve and protect the state of my system.

Messages - The only way I can communicate to it is by sending messages to it. I can not (and care not) to manipulate its data directly at a low level (disk) because that would risk its integrity--not in a referential way but at a disk-level consistency way.

Extendability - I can extend my database's schema to define new variables (tables) and behaviors (procedures). Even better, I can apply the new schema its instances.

It's amazing it took me 20+ years to recognize the similarities between objects and databases. But now that I'm confident my database is an instance of my schema and in other important respects is in fact an object (singleton) of its own, I can start visiting various of the GoF's patterns to see how well they apply.

Monday, June 11, 2007

I remember my first time...

A recent ACM Queue Advice Column by Kode Vicious, called Advice to a Newbie, asked:

Do you remember the first time? The first time when, after struggling with a piece of code, you felt not only "I can do this!" but also "I love doing this!"

I still remember that rush. It was addictive. When it happened I decided what I wanted to be when a grew up: a computer programmer.

In 1983 I was senior at Troy High School, in Michigan. I was taking a computer programming elective at the same time I was taking trigonometry. We were learning BASIC on Apple IIe computers. Our final assignment was to write a graphic animation of something. Anything. Mine was influenced by both being a high-school student and America's 1981 return to space with NASA's shuttle program.

Using BASIC and the IIe's low-resolution graphics (pixels that seemed the size of a Tic-Tac) I simulated the launch of the Space Shuttle Columbian (did I mention I was in high school?). My rendering of the shuttle was as good as it could have been, considering the resolution, and included a 10-second count-down, smoke, flames, and a lift-off off the top of the screen. After that the shuttle was flying right-to-left, with the appearance of motion provided by stars in the background moving left-to-right. The loops were gigantic. Inside the loops the program made sure the stars disappeared behind the shuttle and reappeared at the appropriate time.

Then the pièce de résistance, a PacMan moved across the screen and gobbled the shuttle into nothingness.

I got an A.

But better than that, I triumphed over the task using BASIC and geometry. The loops moving the stars non-destructively behind the shuttle were nothing compared to the routines to open and close the PacMan's mouth as it moved across the screen. I remember how impressed my parents pretended to be when I showed them the print-out of the code.

I also remember how slow the program ran. It seemed everything was happening under water. I could almost make out each line of the PacMan's mouth closing drawing yellow then black again to open it as it devoured the space ship.

But then something amazing happened.

Our teacher, Mr. Ralph Nutter, who was my older brother's math teacher and swim coach a few years earlier, demonstrated all our projects in front of the entire class--but now they were compiled into machine language. The lift-off was smooth and the screen almost looked as though it were on fire. Most importantly, my PacMan moved across the screen so smoothly and cleanly the jagged resolution was invisible, and it seemed to race over the shuttle so gloriously I could hear the game's music playing inside my head.

And I was hooked.

That was 24 years ago and to this day, it is one of the single biggest life-changing events of my life. Almost everything that's happened to me since turned on what happened that last May in 1983, 4th hour, in the closing days of my last year in school.

Wednesday, June 6, 2007

The database is the biggest object in my system

After posting a link to my The TOA of Tom a couple interesting discussions occurred inside comp.object. While responding to Bryce Jacobs, a better way of describing what we're doing came to me. It's buried in:

In fact, after inspecting multiple C APIs for LDAP, iCAL, and other libraries it appears it's not even foreign to C programmers. Often a structure is used to hold state information for a connection to a resources, but the format of that information isn't exposed to API-users except through an API. Even when a method may only be returning a value from the structure, API programmers know that level of indirection affords them flexibility inside the structure they may use without negatively impacting all the API's users.

So a common pattern is used by both C and OO programmers. What my paper is promoting (and I'll try to do a better job explaining) is that the same pattern be applied to how OO programs access the DB.

Essentially, my paper on transaction processing encourages thinking of the database as one big object with all the rules for data hiding and interfaces OO programmers are already acquainted with.

Why shouldn't applications have embedded SQL? Because it's the same as accessing the private data members of an object. It shouldn't be done. OO programmers know the correct way to interface with an object is to use its method interface--not attempt direct manipulation of the object's data. OO programmer's attempts to violate that rule is what causes so much frustration mapping the application's data graph into a relational database's tables, rows, and columns. Those things belong to the DB--not to the application.

Now, OO programmers and system designers can return to their favorite Patterns books and reevaluate the lessons from a new perspective. Should make some interesting reading.

Monday, June 4, 2007

Transaction Oriented Architecture (aka The TOA of Tom)

I know, I know. It's spelled T-A-O.

I'm willing to go out on a limb and say most new programming in the last 15 years has used object-oriented languages and relational databases. I hear complaints already. True, there are exceptions, but however big you think they are they're only rounding errors compared to the military-industrial complex which has become the object-relational mega market featuring Java, C#, PHP, Python and other OO programming languages.

I'll go out on another limb (while I'm here) and claim most of those systems are horribly designed. I know that because a) they didn't ask for my help and b) everyone's complaining the offspring of OO/RDB shotgun weddings have contorted features. Design patterns for persistence and object hibernation have become more most-popular-kludges and less best-practice how-tos.

What is a transaction?

For us, a transaction is anything that might happen in our system for which we want to provide security and an audit trail. Everything that changes our system is recorded with a user name, a program name, a post date, an effective date, and a transaction type. Financial transactions include amounts and account numbers. Data changes the object and field changed, and the old and new values. Right or wrong, whatever happens in our system leaves a trail.

That trail provides two important features. First, no matter what happens to the system we know how it got that way. If an error is made we know how it happened and how to correct it. It's important to understand we don't back out transactions. We add new transactions reversing the negative (incorrect) affects of the errors. Pretending bad things didn't happen by backing-out transactions and resetting the data to pre-mistake values doesn't make them go away. Erasing data only creates an opportunity for the database to get out-of-sync with the audit trail and weakens the system's integrity if the popular mechanism for correcting mistakes is erasing the evidence.

In the movie Clear and Present Danger, the President of the United States is worried the press will discover one of his best friends was laundering money for a Columbian drug cartel. Presidential advisers recommend putting distance between the president and his murdered friend. Responding to the suggestions the press won't find out, the president responds saying, "They will. They always do." CIA analyst Jack Ryan (Harrison Ford) recommends the opposite approach saying, "There's no sense defusing a bomb after it's already gone off."

Always forward, never backward.

The second major feature is the ability to tell users (and auditors) what the system looked like at any point in time. It's easy to tell users what it looks like now, but what about last month? Or the second quarter last year? Or how about comparing this year-to-date with last-year-to-date? Transactions make that possible.

Related to that second feature is the fact that however the system appears today it is merely the end result of all the transactions posted since our current system bootstrapped October 1, 2002. Most of the database can be recreated by reposting transaction history. For example, today we could lose the entire account table's contents and derive the correct ending balances from history.

Database integrity is paramount and transaction history is an important ingredient. If the database is incorrect or its integrity lost, everything else is cosmetic.

More on transactions later.

Wagging the database

If the database and its integrity are so important, when should its design be influenced by programming language? In a word, never. To believe otherwise suggests a database' design should change whenever the application programming language changes and that applications written in other languages can not be properly accommodated with a database designed for another. Conceiving of that kind of dependence is counter-axiomatic to tenants OO designers strive for: high cohesion within a module and low coupling between modules. Making database persistence an innate responsibility to every business object destroys each object's cohesion and tightly couples those same objects to something they shouldn't be co-dependent with-a database. The result is objects with low-cohesion and tight-coupling. To make matters worse, the contagion isn't another object in the OOPL that can be easily coded-away but a remote object that throws exceptions, is often network-remote, and is affected by many more external influences to design and performance than the objects attempting to integrate its utility with their persistence models.

".. like another hole in the head"

In our own system, the database is just as easily accessed from PHP and Smalltalk as it is from Sybase's isql or the open-source is (written in C). Even were the database inclined to favor object oriented languages like PHP and Smalltalk it must still treat other paradigms equitably. I can think of few other languages as far apart in paradigm as Smalltalk and C.

Think of it as a due-process clause protecting the rights of applications no matter their language, paradigm, compilation, interpretation, or generation, or OS origins.

It's difficult to imagine all these products, frameworks, and seminars for something we want to pretend doesn't exist-the database. We don't need consultants-we need therapy.

The first step of whatever 12-step program launches the next beverage revolution and meets Wednesday nights is to admit our denial. It's not that we don't know the database exists-it's that we want it to disappear. We want it to go away. When we rub our eyes hard, in-between the kaleidescope colors we imagine pure object databases undetectable to the untrained pointy-haired manager or offshore programmer. But when our eyes open we're reminded how horribly awkward, non-standard, and even more difficult-to-share-than-query object-oriented databases are (and wait 'til you see the billing rates those consultants get).

What does it look like?

Everything InStream does, from production support all the way down to development follows a paradigm, if you will, of transaction oriented processing. It's a way of looking at the purpose of your database and how programs interact with it that nullifies any impulse programmers may have to attempt to make the database reflect OO models or OO models reflecting the DB design.

After the database has been designed and independent of whatever language will predominate application programming, stored procedures are created for adding, changing, and querying the database. With those procedures in place scripts can easily be written to populate the database with data so that tests can prove the database is complete and properly designed. It is important to begin exercising the database early in a development process to discover any relational awkwardness and to establish performance baselines for both data changes and queries.

Using database stored procedures also insulates application code from database design changes from the trivial (renaming a column) to the extreme (redesigned tables). Procedures form the first and lowest-level Application Programming Interface (API) of a layered system. Additionally, most relational database provide mechanisms for finding inter-dependencies within the database, like which views depend on which tables, which tables contain which columns, and which procedures depend on which database objects, as well as dependencies between the procedures themselves.

HINT: Don't embed SQL. Once in production you want to minimize the impact of any post-production changes. You only release a system once. Everything after that is an update. The more difficult updates are to deploy the more risk is involved attempting them which may lead to reluctance tackling them and the more traumatic they become. Keep 'em simple and your system will be able to grow without stretching your customers patience.

Friday, June 1, 2007

Programming Rules

In the seven years our company has been around the technological landscape has changed dramatically. Additionally, the way our system is assembled and the components it's assembled from have also changed. Can a system go through that much change and still be consistent with an architecture described seven years earlier?

Yes.

Rules, as I've said in earlier postings, are more durable than technology. That is why rules are important. Design rules persist across technological smokestacks. Good rules are good rules regardless which database you use, regardless which language you develop in, regardless whether your application exists on the web, is client-server, or exists stand-alone in the back office.

I won't give-up all the secret ingredients, but here are some of our favorites at InStream Services:

It's better to be explicit than implicit

The code systems are built with, regardless of programming language, aren't susceptible to the same fragile memories humans suffer. Code never forgets, but humans do. Has that ever happened to you? It happened to me. Considering how complicated software systems are and how many distractions humans are challenged with on a day-to-day basis, having code that distributes its implementation across classes, procedures, and triggers doesn't help programmers understand how something works and actually erects barriers to its enhancement. This is one of the reasons our system has only a single database trigger. Want to know what a procedure does? Everything it does is right inside the procedure's code.

If something is broken I don't want to be hunting around forever tracking it down.

No Polling

Besides wasting CPU it's a cop-out to a work-flow problem. There's no excuse for not knowing when the next thing needs to be done.

No Parsing

We don't write compilers or invent languages. We're surrounded by technology that already knows how to parse grammars. Find the right tool and use it. There are better ways to know what's coming next.

Once and only once

I hate solving problems twice. Keeping both sections of code in-sync with each other and the business is like making sure twins both get the same amount of ice cream. By the time I've measured it perfectly it's already melting. If it seems like it needs to be solved in two different places chances are it should be moved somewhere else and solved once.

No harm running twice

Humans are fallible. Everyone knows it. Why write programs, scripts, fixes, or patches that depend on being run under perfect conditions? Fix scripts should make sure things are still broken before they try to fix something that isn't. Programs should know not to do anything if there isn't anything to do. Nearly everything we've written can be run as many times as people feel like running it without negative consequences. There are enough things that can go wrong that aren't under our control--let's not add to them with things we do control.

Do one thing and do it well

Structured programming, refactoring, and the Unix shells have demonstrated how powerful a concept it is to do one thing and do it well. Narrowing the utility of a program reduces side effects and increases its utility. When all a system's programs do only one thing there is less chance of overlap and duplication. It also means that when something goes wrong there are fewer programs that require fixing--and fewer places to look in the first place.

Desirable undefined behavior

What happens when an unidentified transaction arrives? What happens when something occurs that wasn't planned for and stopping the system is an unacceptable option? You make sure there's default behavior that does something harmless, can be audited, and alert the authorities.

One of the constraints of our system is to be able to import supply chain data from multiple external systems we don't control (see Database Prenutials). Because 3rd-party data is sometimes incomplete our systems need to do something reasonable with imperfect data, especially since the alternative risks the database's integrity.

In production it's important to know that when bugs appear the likelihood of something bad happening is minimized because there's appropriate default behaviors. We know where our system is most at-risk and write our software with harmless fail-safes.

These are some of our rules. At times we've been tempted to break them, but found ways to stick to them. As I wrote in the first Rules to Live By article, when multiple programmers and designers are working with little supervision it's best if they all know what the rules are so no one, especially me, is surprised with the implementation--except in a good way.

Thursday, May 31, 2007

Database Prenuptials

This is the third installment of my "Rules to Live By" series. The first entry was Everything you needed to know about software architecture you learned in kindergarten. It discusses the importance of sharing a vision with coworkers about where the system is going, the metaphor you're using to model after, and the rules it must follow.

The second installment, Production rules easy as 1, 2, 3--but without 2 and 3, talks about our production priorities, and how those priorities are reflected throughout development, testing, and assurance. It also talks about how knowing our priorities and what's important allows us to upgrade our system more aggressively and frequently than is possible for other software teams. The result is to deliver more features and correct problems as quickly as possible for both our users and ourselves.

This article is about our database design. It's intended for software designers and not DBAs. There's nothing here about relational theory, 3rd normal form, or performance tuning. The only thing a DBA may find interesting about our system is that the logical and physical designs are the same. No compromises are made translating our logical design's entities or relationships into real tables. Nothing was denormalized. It was myth in the 90s and still is today, even on more powerful equipment, that normalized data doesn't perform.

The basics

It should go without saying, but won't, that the first step to good database design is understanding your business. Sometimes that understanding comes from interviewing "experts." Our development staff has made it their goal to become experts in our niche of commercial finance so our interviews have a more consultative feel than an interrogation.

Fundamentally, we assume solid database design skills. Consistent table naming makes it easy to predict where data may be stored. Strict attribute naming requirements remove confusion about what a field's meaning is: there are never two meanings--fields are never reused.

Some of our rules may be specific to our industry, but I'm wary of provincialism. Believing we're unique would discourage us from both looking outside ourselves for ideas and from publishing what we've discovered improves our processes.

What's yours is yours

Our system tracks documents exchanged between trading partners. To maintain the system's integrity we never add to, change, or remove these documents if they exist "in the real world" and have been recorded in ours.

Furthermore we never create documents we don't know exist, even if it would make sense that they did. Suppliers don't normally ship without a purchase order and it's easy-enough to create a PO if one is obviously missing. But if we created the purchase order without evidence it exists then we've compromised the system's integrity. Better to be missing a document than to fabricate one.

Real world integrity is more important to our business than relational integrity. Besides, there are other ways to synthesize relational integrity to support these situations.

What's ours is ours

Rather than pollute real-world documents with our data we keep them separate. Our system maintains proxies for customer's documents to which we add our own fields. This separation-of-stuff provides our system with a kind of flexibility otherwise unavailable. We're able to mold and bend our system any number of ways behind the scenes without impacting our customers' view of the world.

Other duties as assigned

Never reduce resolution. One we've created or captured detail information we won't delete or summarize it away. We violated this rule once and haven't forgiven ourselves since (and yes, we knew it was a rule then).

Our code may be object oriented (Smalltalk and PHP) but our database is not. Forcing one to look like the other is a disaster waiting to happen--at least it is in financial systems. Resist the urge to treat your relational database as if it were an object database or to impose your code's object model onto your database. We use an approach we call Transaction Oriented Processing to marry the two. RDBs are from Mars. OO is from Venus. Transaction Oriented Processing is the medium between the two. There is no object-relational impedance mismatch if we let both systems do what they're best at. Using Transaction Oriented Processing to negotiate between OO and RDB technologies is straightforward but requires a separate article to explain. I promise to post one here.

It is better to be explicit than implicit. Nothing in our system is subtle. There are no hints, innuendo, or clues. Intuition is not necessary to find out what's going on. Everything is exactly what it says it is.

For instance, there is only a single trigger in the entire system. If a programmer wants to know what happens when a row is inserted or changed they need look no further than the stored procedure that inserts or changes the row.

You're going to be married to your database for a long time. It's best you learn how to get along with it.

Wednesday, May 30, 2007

Production rules as easy as 1, 2, 3--but without 2 and 3

There used to be a sign in our office listing our company's top priorities. It's been replaced with a much larger sign with vision and mission statements and a list of guiding principles. I miss the old sign because it listed only three easy-to-remember items, ordered by importance. The first item was "protect the cash."

I forget the other two.

I'm sure they were important but not as important to a finance company as protecting the cash. That's Rule #1. If you don't protect the cash it doesn't matter what the other two items are--they're cosmetic. They were not second and third because they were alphabetized. They were second and third because they were less important than Rule #1: Protect the Cash.

But Rule #1 doesn't hold the same imperative for programmers and system administrators as it does accountants, controllers, and credit analysts, so the technology group issued its own.

Rule #1: Protect the database

Like the original list, it doesn't much matter what rules two and three would have been because without a database everything else is cosmetic. But protecting the database goes way beyond security and backups. The database's integrity must be protected. Everything must be entered correctly, everything must balance, and audit trails must be complete. If the data is corrupted what good is having a backup of corrupted data?

Having rule #1 helps prioritize all other activities and allows us to add features and fix bugs more quickly. There are a select few programs and stored procedures that modify the database. Collectively, these are our posting programs. Any changes to them, for any purpose, brings with it commensurate scrutinizing and testing as though we were attempting gene therapy on our own children. It's that important. It's rule #1, and everything else is cosmetic.

The best part of knowing your top priority is how much time it frees up for other purposes; like fixing bugs, adding features, automating the system, and increasing metrics. Basically, more time is available to respond quicker to customers because we know what is and is not critical.

Some managers believe everything is critical and deserves the same attention to detail as posting programs. That platitude is either propaganda for other departments' consumption or an inability or unwillingness to prioritize.

If the purpose of everything-is-critical management is to make other departments feel their issues are given equal weight with all other issues then the practice is either dishonest or a ruse to avoid conflict. Telling someone their issue is a top priority when it isn't demonstrates both parties lack respect for each other.

Any other reason for a lack of prioritization doesn't fool programmers and quality analysts. They know priorities need to exist and that a failure to prioritize indicates a failure to lead. Everything can't be an emergency. Every issue isn't a crisis. They can't be--otherwise we wouldn't be able to recognize something even more important than the very important thing we're already working on.

And that's important.

Thursday, May 24, 2007

Everything you need to know about software architecture you learned in kindergarten

Designing and building software can be easier than you think. What makes it easy is having a set of rules any decision can be tested against so the rules may design the system for you.

The idea isn't so crazy. In 1986 Craig Reynolds simulated the behavior of a flock of birds using only three rules. What was previously thought too complex to program without great effort was suddenly not just easier for programmers to construct, but easier for non-programmers to understand.

In InStream's case rules don't process transactions, but they provide the system design's conceptual integrity. Without enumerated rules the decisions a software designer makes can seem capricious, arbitrary, or simply black magic. As hard as it is to articulate gut feelings it's even harder to debate them or distribute decision making throughout a staff if a gut feeling is their only design guide.

When all a system's code is the result of a small set of rules its behavior is more easily predicted and its complexity more easily comprehended. Knowing what a system does and what it will do, and knowing the rules that created it, make extending and adapting it in unanticipated ways much easier, even by new programmers.

To really pick up development momentum the staff also shares a common metaphor. Few systems are truly ex nihilo and without precedent either man-made or natural. Inspiration for working software models can come from physics, biology, dams, post offices or dry cleaners. When everyone understands the metaphor then everyone knows how to marshal their efforts toward that goal without consulting a single oracle they would otherwise depend on.

So the first rule we learned in kindergarten: sharing. Share your rules and share your inspiration.

Next we'll look at InStream's rules for production, database design, and programming.

Saturday, March 24, 2007

Myth : Normalization == Poor Performance

[this article was originally written in 1997]

Have you ever read something over and over again you know is false but worry if people keep repeating it will become fact? To dictators its called propaganda. Marketing and sales people call it, "Believing your own bull."

The comparisons may seem extreme (some suggest histrionic) but that's what's happening now with the mantra, "Normalized data can't perform." Repeat it a couple hundred times (maybe you already have) and you'll start believing it too.

Like many other falsehoods in technology, this kind of thinking attempts to blame technology for people's own ignorance about how to use it.

Russ Matika, one of the best system programmers I've been lucky to work with, insisted there was nothing wrong with various technologies (whichever one I was complaining about at the time). He suggested we just didn't know how to exploit them.

This is especially true with relational databases--one of the few computer-related technologies with mathematical proof for a 'correct' way to design systems. That proof is called 'set theory' or 'relational set theory.' Set theory is introduced to students as early as 6th grade these days. Although it's not presented to students in the context of database design, the math is the same.

So what's the supposed problem with normalized (relationally correct) data? DBAs and programmers who are confused about the role of database engines complain normalization results in too many tables. Besides the administrative nightmare, more tables means more JOINs to accomplish simple queries. More JOINs means more CPU cycles (and logical IOs) which translates into longer running queries. So long in fact the wait renders the query useless by the time it leaves the database and appears in the application.

They're right. They're absolutely right. But the fault lies not with the database engine, its query optimizer, or the design of the database. The fault lies with the programmers and DBAs who think its the database's responsibility to do the work for the application programmer. Its the fault of anyone who confuses the role of the database to do both the storing and organizing (disassembly) of data as well as aggregation and projection (assembly).

Let me share my own Dickensonian, "Tale of Two Queries."

A Credit Union has a report called "Consolidated Analysis" which prints, for every customer, totals for all their demand, checking, CDs, closed-end and open-end loans across a page. At the end of the report it displays totals. Sounds like a fairly straight-forward report, doesn't it? All the credit union's data was stored in a relational database with various attributes of the unique account types residing in their own tables. To create a single report with all the information all these tables had to be JOINed together. To make a long story short; using traditional SQL with a fairly typical report-writer program the report took nearly two hours for a measly 150,000 customers. Heck, the printer could print faster than that!

Undaunted, Russ insisted there was nothing wrong with the database. Nothing wrong with the hardware. Nothing wrong with the network. Only something wrong with our query. He suggested another approach. One that required a bit more programming (pun intended) and additional thought on the part of the programmer--but Cliff was up to the challenge.

Cliff knew the separate pieces of the report ran quickly. If you just ran the 'Deposit Analysis,' the 'Loan Analysis,' or the 'CD Analysis' reports they ran in just a few minutes. Cliff re-coded his consolidated analysis to basically run the three reports at once, but moving the responsibility of the final join from the database to his application. Since each report was sorted by customer, the application code was a breeze (make sure you finish all three reports for each customer before moving onto the next).

The new, trivially more complex, consolidated analysis ran in six minutes.

Isolated incident?

This same credit union is trying to run an online transaction system with the same relational database. Imagine all teller, ATM, and batch transactions posting to an online system using its new relational database. No batch jobs. No nighttime processing. "No way!" you think?

Well, in the beginning you were right. A typical savings deposit implemented using database stored procedures (that's what the vendors tell you to do) required 1500-2000 logical IOs to complete. Many of the IOs were duplicate reads of data already read, due mostly to the limitation of variable scoping in stored procedures, parameter limitations, and so on. Sure, the re-read pages were cached, but they still cost CPU cycles.

Undaunted, Russ insisted there was nothing wrong with the database. Nothing wrong with the hardware. Nothing wrong with the network. Only something wrong with our transactions.

Cliff analyzed the transactions and discovered there were probably only 100 or so unique IOs required for the transaction. The other 1400-1900 were duplicate reads of account data or reads of system (or persistent) data that changed infrequently. He suggested that by moving the transaction business logic out of the database and into another language with greater flexibility and reading 'fixed' data once we could create a deposit transaction that required only 100 logical IOs.

Cliff was right. He helped design a transaction processor that initialized the persistent data in its own memory when it started, did only the minimal number of reads to get the account data, and when it was done with all the business logic submitted to the database only the minimum number of updates necessary to affect the transaction. Admittedly, this solution is incrementally more sophisticated than our first, but the results paid-off: a 20x improvement in transaction throughput. More than enough to meet the posting requirements of the system.

And the database was still normalized. Nothing in the logical data model had been forfeit for performance in the physical.

So, is there something wrong with relational databases and normalized data? Or is it only our ability to exploit them?