This post was extensively updated and incorporated into my new book, The Data Conversion Cycle: A guide to migrating transactions and other records for system implementation teams. Now available on Amazon.com in both Kindle format for $4.49 and paperback for $6.99.
Some time back in the 80’s, a few college students with far too much time on their hands noticed that cartoon characters generally obeyed the laws of physics, except when it was funnier not to. Among the Cartoon Laws of Physics that they documented was the observation, “Any violent rearrangement of feline matter is impermanent.” This was evidenced by various cartoon episodes where cats were sliced, flattened, stuffed into bottles and violin cases, or otherwise mistreated; yet, “After a few moments of blinking self pity, they re-inflate, elongate, snap back, or solidify.” The corollary to this law is, “A cat will assume the shape of its container.” Until acted on by an outside force, of course, such as a barking dog. I bring this up because data, like cartoon cats, assumes the shape of its container
All transaction processing software applications create, read, update, and delete data records. This is true whether the application under consideration stores these records in relational tables, object databases, fixed-length records in mainframe master files, comma-separated values in text files, or some other format. Generally, each data record is designed based on a particular data model that is meaningful to the users of the application, based on the application designer’s understanding of the objects being represented. And over time, as our understanding of certain objects has grown more sophisticated, so have our data models. These days, I doubt many applications will force you to store your postal code in a five-digit field labeled “Zip Code.” But once upon a time, that was a common dilemma. Of course, those of mixed racial heritage have only recently started seeing a choice that matches their identity, and the transgendered still commonly have to choose between “Male” and “Female.” But that growing sophistication only exacerbates the challenges of converting information stored in one application to be useful to another application.
As we improve our ability to precisely encode the information we collect and maintain to reflect an ever more detailed and nuanced world view, we also have to decide what to do with the information we collected in our “old” containers, based on our “old” models. How do we identify and correct the mangled addresses? How do we map from a model that featured six mutually exclusive values for a required field, to a model that features seven, or five? Who decides, and when and where do we make the update? Do we record the change as part of the audit trail? If a field value is calculated or inferred as part of an integration or interface with another application, system, or service, how do we handle exceptions and errors?
Years ago, I helped a London-based financial institution convert from their old, MS-DOS 25 line by 80 character file server-based shared data file HR system to an ERP. They allowed a rather broad range of alternative values for “Relationship” to an emergency contact, including Husband, Wife, Spouse, Boyfriend, Girlfriend, Domestic Partner, Brother, Sister, Step-Sister, Step-Brother, Sibling, and so on. I asked if we could apply a little Algebra, and “combine like terms.” This led to a little hesitation, so I asked if they could describe the difference in business process between, say, Sister, Step-Sister, and Sibling. That wasn’t the point, I was told – what mattered was what the relationship meant to the employee. I then presented an estimate of the cost to modify the ERP to accommodate such a broad range of less-than-mutually exclusive alternatives, and after some discussion, we agreed on a mapping of the existing data to a smaller, non-overlapping list of alternative values. Apparently, cost mattered even more than specificity.
I bring this up because there may be similar anomalies, inexplicit (or inexplicable) choices, and weirdnesses lurking in the data stored in the system of record you’re planning to replace, or source interface data from. Don’t assume that moving data records will be a simple matter of export, translate, and load – you might have to engage in some interesting conversations along the way. There may be different degrees of precision in numerical values between the legacy system and its replacement. There may be assumptions about time zones in time stamps. You’ll need to find these data model differences, assess the impact of them, and get the owners of the data to agree on a course of action. In short, you’ll need to act as an “outside force,” like the barking dog in those cartoons, driving that misshapen data back into a more natural state.