The Data Conversion Cycle – Now Available on Amazon.com

My new book, “The Data Conversion Cycle: A guide to migrating transactions and other records, for system implementation teams,” is now available on Amazon.com in both Kindle format for $4.49 and paperback for $6.99. If you buy the paperback version, you can also buy the Kindle version for 99 cents in what Amazon calls “matchbook” pricing.

When asked for the most common sources of problems for software system implementation projects, experienced system implementers and consultants always list data conversion among their top three. Converting from one production record-keeping system to another is a challenge because you not only have a moving target; you also have a moving origin, as records are created and updated each day while the project is in progress. This book expands on a series of blog posts on The Practicing IT Project Manager website. Originally written for my project manager following, I extensively revised the content for a general business audience.

This book was designed to be a resource for project teams comprised of not just project managers and IT specialists, but the people working in the business areas who own and maintain the data records and will use the new systems. The goal was to provide a clear model expressed in a common language for a cross-functional team.
The first six chapters explain data conversion as an iterative process, from defining the scope to mapping source system records to the target system, to extraction and loading, to validation. This methodology works well with Agile methods, especially those involving iterative prototyping. However, it can also be used with more traditional planning-intensive approaches.

I also include a chapter on incorporating data conversion into the project planning process and a chapter on risk management. The risk management chapter starts with the basics and goes into considerable detail in identifying risks applicable to data conversion. The book includes an Appendix with an example output of a risk identification meeting and the types of information to include in a risk register. There is also a chapter on measuring progress when using this iterative approach, and a Glossary.

As always, thanks for reading my stuff.

Of Normalization, Transformation, and Soup Spoons

Two SpoonsOver the last few months, Nick Pisano and I have been exchanging observations on the technical challenges of extracting actionable information from the massive volumes of project historical records. Nick’s latest post at his Life, Project Management, and Everything blog continues his ruminations on finding ways to make this mountain of data more accessible to data warehousing approaches, via normalization. Here are my thoughts on the matter:

Normalization, as defined by Edgar F. Codd and popularized by Chris Date, refers to relational databases. It is a design approach that seeks to reduce the occurrence of duplicate data. Relational databases managed using a single application, or a group of applications that respect the referential integrity rules of the system of record, benefit from this rigorous approach – it reduces the risk of errors creeping into the data as records are added or maintained, and makes management reporting queries more efficient.

We Don’t Maintain History

Historical records, on the other hand, are not maintained; they are queried. Careful controls are used to ensure that historical records are not updated, corrected, tweaked, or otherwise dinked with. That said, not all historical records are consumable in their native state. Many require some form of rationalization before they can be used. For example, two databases containing shoe sizes may have incompatible values, if the data was collected in different countries. But there is a more difficult conundrum: unaccounted-for states.

In an earlier time, a field called “Sex” would have two mutually exclusive values. Then someone noted that the transgendered might not self-identify with either value. In other jurisdictions, where privacy matters are given more weight, and the entity collecting the data has to be able to justify the request, a “None of your damned business” value is required. And as more records are prepared by software actors, we’ve found the need for “We don’t know, exactly.” Consequently, commingling history with new-age records requires more than simple translation – it requires an understanding of the original request that resulted in the recorded response, and the nuances implied in the recorded values. That level of understanding can usually be provided only through transformation at the source, rather than at the destination.

Knowledge Merits Design

All that said: if knowledge is an artifact, then it merits design. But just as there is more than one valid, effective design for a soup spoon, so will there be multiple valid designs for the knowledge accumulated during a project. I would suggest that, while normalization focuses on standardizing design, it does not address history, nor does it account for data nuances. I would also point out that current query technology and natural-language processing has made the analysis of history much less dependent on the relational models used to collect and maintain it. In order to support management reporting across lots of databases, you need to be able to make queries that are meaningful in the context of the question at hand, without regard to the underlying values in the database, or the primary and foreign keys. Aggregating the transformed responses requires a certain amount of faith in the sources, but just as relational designs make records less subject to error by eliminating redundant values, so does localized transformation reduce errors by eliminating redundant rules.

One day, our brilliant designs will be seen as quaint, or at least insufficient. With luck, we’ll live long enough to be embarrassed by them. Peace be with you.

Measuring Progress in the Iterative Data Conversion Cycle

This post was extensively updated and incorporated into my new book, The Data Conversion Cycle: A guide to migrating transactions and other records for system implementation teams. Now available on Amazon.com in both Kindle format for $4.49 and paperback for $6.99.

This is the last article in the series on the data conversion cycle.  As is frequently noted, you can’t manage what you don’t measure.  However, as both the Hawthorne experiments and Werner Heisenberg found in the 1920’s, the act of measuring a phenomenon influences the object under observation.  So the trick is to measure carefully, so that any influence your measurement has is at least neutral, and preferably desirable.  Consequently, I’m going to close out this (admittedly interminable) series on the data conversion cycle with considerations for assessing data conversion process quality, as the team “learns how to move data.”

As previously noted, a beneficial side effect of an iterative approach to data conversion is that the team eventually gets good at it.  But what constitutes “goodness?”  For most projects, “good” would be defined as error-free, fast, and predictable.  The trick is expressing those attributes in such a way as to make them measurable, without driving one at the expense of the others.  To that end:

  • Error rate: the number of number of corrections to be made in the target system subsequent to the load, divided by the number of records loaded.  This ignores the “learning” errors in mapping or extraction processes in order to concentrate on outcomes.
  • Extraction time: total time (as opposed to work hours) from copy of the source system to the extraction and formatting of records, to the transfer of the extracted data to the load team.
  • Load time: total time to load the formatted records to the target system.
  • Validation time: total time required for validation of the load.
  • Predictability: sum of Extraction time, Load time, and Validation time, divided by the predicted time required for them.  A value of 1.0 means that the process is absolutely predictable, whereas variances from 1.0 indicate the degree of uncertainty.

Plainly, the error rate is critical to the users of the system, as they will have to make any needed corrections.  Also, the more time it takes for the extraction, load and validation, the longer the users will be unable to enter transactions, and the more transactions will accumulate for entry once the target system is finally available to the users.  But predictability is vital to both the users and the conversion team, as a tight, accurate cutover schedule is in everyone’s best interest.  The ability to minimize the unknowns (read: risks) in the cutover to production is largely a function of the predictability of the process.

Tracking these metrics in each cycle will give the project team the ability to measure improvements, but also guide decision making on where to expend resources.  On most projects, improvements in the validation processes will reduce validation time, with the side benefit of improving predictability.  Driving automation of the extraction processes will usually produce the same benefits, frequently with the added benefit of a reduced error rate.  But in order to get the best return on investment, it is useful to analyze the metrics from each conversion, so that efforts to reduce the error rate increase the extraction and load times more than necessary.  Measurements allow for trade-offs, so you don’t go past the point of diminishing returns on any one metric.

Thanks for reading through all of these posts over the last two months.  As previously mentioned, I plan to consolidate these posts into a Kindle book.  Special thanks to Samad Aidane for the “blog a book” idea!