One of the things I love about computer science is the mathematical certainty of the process. The logical constructs of an algorithm define a structure that is rigidly adhered to. If assembled correctly the result of a function is one hundred percent predictable within the bounds of the system. I realize that is a bit simplistic, but there is a beauty in that idea that I find inspiring. For me the comfortable confines of a predefined world help balance out the worrisome chaos of the real world.
Unfortunately complex software systems don’t adhere to this lovely fantasy-land of predictable outcomes I maintain for myself. The ones that survive multiple iterations tend to build up quirks in the trail of data they leave behind. Time will batter and bruise even the most perfectly conceived schema. Unless the software is frozen, somewhere along the line something will trigger a change in the meaning of some data. Maybe a bug, maybe a feature, but either way it leaves a little turd in the record — A small crack in the data’s integrity.
It’s probably no surprise that I’m a big fan of the relational database model. If done properly, it provides a structure to data without which powerful and diverse queries and manipulations would not be possible. Who wouldn’t love that? Once again the enemy of all that is good, “time”, doesn’t love it. Nor does his annoying side kick “complexity”. Bring in the whole crew with “large scale” and you have achieved the trifecta of data turd production. To top it off, the constant improvements of a nimble development team unintentionally make this problem worse.
Don’t get me wrong, I’m not suggesting these are badly designed systems, I’m thinking specifically about ones built by engineers far more experienced and knowledgeable than yours truly. Sadly it’s impossible to plan for every future eventuality, and even if it was, it would undoubtedly be sub-optimal. So just like in real life, turds are unavoidable. One could argue that a different data model like no-sql or multi-value databases don’t suffer from the same types of issues as relational designs. I would argue that’s just silly because nobody cares about those.
Over the last ten years I have been involved in a number of projects for different employers that directly or indirectly deal with dirty data. On the other side of the fence I have been responsible for depositing a bit of fecal matter in the data stream now and then (I mean, shit happens amiright?). Anyway I’m not bragging when I say I have a decent amount of experience with this. Really I’m not, since what I am actually doing is equating myself with the digital equivalent of a pooper-scooper.
Trying to ferret out cause and effect for years old records in a constantly changing system is often a lost cause. This is one of the few times understanding why an inconsistency occurs is less important than just fixing it and moving on. If I am right (with regard to my propensity for being right I will state that in no way is that statistically relevant to the above conclusions, and in fact revealing that metric might negatively bias the reader), and I’m pretty sure I am, dirty data is something that will need to be dealt with sooner or later. Common sense and patience are your best tools. Lots and lots of patience. I do have some tips from the trenches that might help tilt the scales in your favor:
- Before modifying any data have a plan in place to revert the changes. Not just a vague idea in the back of your head, like a real plan that you actually test before hand.
- Review your SQL until it’s solid enough to crush diamonds. Remember the first time you accidentally took down a production server because you forgot a “WHERE” clause on an “UPDATE” statement? Everybody else does too, so cut it out.
- Transactions are manna from heaven, use them wisely:
Know when to commit them.
Know when to roll them back.
Know when to walk away.
Know when to run.
You never count your updates,
when you’re sitting at the command prompt.
There will be time enough for counting,
when the transaction’s done.
- Be ready to be wrong. The comforts of logical outcomes are long gone, now it’s time to juggle probability nightmares in a recursive cycle of alternating head-desks and face-palms. If you are a drinker establish physical obstacles between alcohol sources and your work proximity.
- Beware of the law of diminishing returns. The closer you get to wiping out all of the anomalies in your data, the longer it will take to make progress on the remaining problems.
- Are you replicating data? if so throttle for it. Most of the replication technologies I have worked with are like delicate flowers that completely fall apart on what appears to be the slightest whim.
If we work together we can make the world of data a cleaner place. So roll up your sleeves, plug your nose, and get to it. And remember, a node in a B-Tree index is a terrible thing to waste.