Of Normalization, Transformation, and Soup Spoons

Two SpoonsOver the last few months, Nick Pisano and I have been exchanging observations on the technical challenges of extracting actionable information from the massive volumes of project historical records. Nick’s latest post at his Life, Project Management, and Everything blog continues his ruminations on finding ways to make this mountain of data more accessible to data warehousing approaches, via normalization. Here are my thoughts on the matter:

Normalization, as defined by Edgar F. Codd and popularized by Chris Date, refers to relational databases. It is a design approach that seeks to reduce the occurrence of duplicate data. Relational databases managed using a single application, or a group of applications that respect the referential integrity rules of the system of record, benefit from this rigorous approach – it reduces the risk of errors creeping into the data as records are added or maintained, and makes management reporting queries more efficient.

We Don’t Maintain History

Historical records, on the other hand, are not maintained; they are queried. Careful controls are used to ensure that historical records are not updated, corrected, tweaked, or otherwise dinked with. That said, not all historical records are consumable in their native state. Many require some form of rationalization before they can be used. For example, two databases containing shoe sizes may have incompatible values, if the data was collected in different countries. But there is a more difficult conundrum: unaccounted-for states.

In an earlier time, a field called “Sex” would have two mutually exclusive values. Then someone noted that the transgendered might not self-identify with either value. In other jurisdictions, where privacy matters are given more weight, and the entity collecting the data has to be able to justify the request, a “None of your damned business” value is required. And as more records are prepared by software actors, we’ve found the need for “We don’t know, exactly.” Consequently, commingling history with new-age records requires more than simple translation – it requires an understanding of the original request that resulted in the recorded response, and the nuances implied in the recorded values. That level of understanding can usually be provided only through transformation at the source, rather than at the destination.

Knowledge Merits Design

All that said: if knowledge is an artifact, then it merits design. But just as there is more than one valid, effective design for a soup spoon, so will there be multiple valid designs for the knowledge accumulated during a project. I would suggest that, while normalization focuses on standardizing design, it does not address history, nor does it account for data nuances. I would also point out that current query technology and natural-language processing has made the analysis of history much less dependent on the relational models used to collect and maintain it. In order to support management reporting across lots of databases, you need to be able to make queries that are meaningful in the context of the question at hand, without regard to the underlying values in the database, or the primary and foreign keys. Aggregating the transformed responses requires a certain amount of faith in the sources, but just as relational designs make records less subject to error by eliminating redundant values, so does localized transformation reduce errors by eliminating redundant rules.

One day, our brilliant designs will be seen as quaint, or at least insufficient. With luck, we’ll live long enough to be embarrassed by them. Peace be with you.

Measuring Progress in the Iterative Data Conversion Cycle

This is the last article in the series on the data conversion cycle.  As is frequently noted, you can’t manage what you don’t measure.  However, as both the Hawthorne experiments and Werner Heisenberg found in the 1920’s, the act of measuring a phenomenon influences the object under observation.  So the trick is to measure carefully, so that any influence your measurement has is at least neutral, and preferably desirable.  Consequently, I’m going to close out this (admittedly interminable) series on the data conversion cycle with considerations for assessing data conversion process quality, as the team “learns how to move data.”

As previously noted, a beneficial side effect of an iterative approach to data conversion is that the team eventually gets good at it.  But what constitutes “goodness?”  For most projects, “good” would be defined as error-free, fast, and predictable.  The trick is expressing those attributes in such a way as to make them measurable, without driving one at the expense of the others.  To that end:

  • Error rate: the number of number of corrections to be made in the target system subsequent to the load, divided by the number of records loaded.  This ignores the “learning” errors in mapping or extraction processes in order to concentrate on outcomes.
  • Extraction time: total time (as opposed to work hours) from copy of the source system to the extraction and formatting of records, to the transfer of the extracted data to the load team.
  • Load time: total time to load the formatted records to the target system.
  • Validation time: total time required for validation of the load.
  • Predictability: sum of Extraction time, Load time, and Validation time, divided by the predicted time required for them.  A value of 1.0 means that the process is absolutely predictable, whereas variances from 1.0 indicate the degree of uncertainty.

Plainly, the error rate is critical to the users of the system, as they will have to make any needed corrections.  Also, the more time it takes for the extraction, load and validation, the longer the users will be unable to enter transactions, and the more transactions will accumulate for entry once the target system is finally available to the users.  But predictability is vital to both the users and the conversion team, as a tight, accurate cutover schedule is in everyone’s best interest.  The ability to minimize the unknowns (read: risks) in the cutover to production is largely a function of the predictability of the process.

Tracking these metrics in each cycle will give the project team the ability to measure improvements, but also guide decision making on where to expend resources.  On most projects, improvements in the validation processes will reduce validation time, with the side benefit of improving predictability.  Driving automation of the extraction processes will usually produce the same benefits, frequently with the added benefit of a reduced error rate.  But in order to get the best return on investment, it is useful to analyze the metrics from each conversion, so that efforts to reduce the error rate increase the extraction and load times more than necessary.  Measurements allow for trade-offs, so you don’t go past the point of diminishing returns on any one metric.

Thanks for reading through all of these posts over the last two months.  As previously mentioned, I plan to consolidate these posts into a Kindle book.  Special thanks to Samad Aidane for the “blog a book” idea!

A Risk Taxonomy for Data Conversion

Continuing the series on the data conversion cycle.  In order to conduct a meaningful risk assessment of a planned data conversion effort, we need to consider the not just the legacy source system and the target, but the remainder of the project.  A risk taxonomy identifies and defines relationships between activities and resources as classes, activities, and attributes.  The team conducting the risk assessment can then use the taxonomy as a source of questions, to determine what is unknown. In this installment, I’ll outline a risk taxonomy for data conversion that includes sources of risk from the larger project and operating environment, as well as related activities.  This holistic view of risk will allow the project team to better identify the unknowns that matter.

For example, the team members charged with defining the requirements for the conversion can use the taxonomy to identify and anticipate unknowns in each of the activities in the Requirements Class.  A leading practice is to phrase a question, using the attributes of each activity.  “Is the project scope stable enough to let us accurately define what records should be converted?”  “Is the target system configuration complete and stable enough for us to map existing records to it?”

The Requirements Class

This class includes the activities that define the requirements for the data to be converted and loaded to the target system.

  • Define scope. This is the analysis done to determine what records will be migrated.  Attributes include stability of the project scope, completeness of the analysis, and accuracy of the detailed scope description.
  • Map data elements.  This is the analysis done to determine where each data element in scope will be placed in the target system, and what changes or transformations must be performed on the source data in order for it to be meaningful.  Attributes include stability of the target system configuration, correctness of the mappings, and completeness of the requirements.
  • Correct invalid mappings.  This is the corrective action taken, based on the validation of a load, to update the mappings.  Attributes include version control, quality control, and updates to the validation plan for the next load.
  • Staff the activities.  This is the people management process that assigns team members to perform the tasks.  Attributes include availability of functional subject matter experts, source system experts, and target system experts.

The Source Extraction Class

This class includes the activities that produce data records, formatted in a way that will allow them to be loaded to the target system.

  • Develop legacy system extraction processes.  This is the development of queries, reports, and related artifacts that will select specific transactions, format them appropriately, and make them available for loading to the target system.  Attributes include software quality, availability of hardware and network resources, and performance.
  • Extract records from the legacy system.  This activity consists of executing the extraction processes from a snapshot of production and making the resulting data available to the team members who will load the target system.  Attributes include access to the system containing the snapshot, sequencing, performance, defect handling, and secure file transfer.
  • Correct invalid processes.  This is corrective action taken, based on the validation of the load, to update the extraction, re-formatting, and file transfer processes that produce the data and make it available to the Load team.  Attributes include version control, quality control, and updates to the validation plan for the next load.
  • Provide resources.  This is the management activity that assigns team members and provides them with access to server resources required to perform their tasks.  Attributes include availability of operations team members and developers, availability of disk space, program access to the source system, and availability of secure file storage.

The Load and Validate Class

This class includes the activities required to load the records extracted from the source to the target system, validate that they were loaded correctly, and initiate corrective actions.

  • Load data records to the target system.  This activity consists of executing the target system load processes on the data records provided by the Source extraction team.  Attributes include software quality, extracted data quality, availability of hardware and network resources, and performance.
  • Validate loaded records.  This is an inspection activity, designed to ensure that the data stored in the source system has been properly moved to the target system, and is ready to support planned operations in production.  Attributes include coverage, accuracy, tools, and access to the target system.
  • Correct invalid records in production.  This is a records management activity.  Attributes include accuracy of historical records, quality of the validation process, and access to the production system.
  • Provide resources.  This is the management activity that assigns team members cleared to access the data, and provides them with access to the loaded system required to perform their tasks.  Attributes include availability of qualified team members, security controls, and access to the loaded target system.

The Project and Environmental Factors Class

This class includes those project activities outside of the data conversion cycle processes that influence or drive the activities listed above.

  • Manage project scope.  This is a project management activity.  Attributes include change control, configuration management, and communication.
  • Manage system design.  This is a target system development activity.  The design of the system is a primary input to the conversion scope and mapping activities.  Attributes include completeness, validity, and maintainability.
  • Manage implementation.  This is a target system development activity.  The configuration of the system greatly influences the mapping and transformation of the source data to make it usable by the target system.  Attributes include completeness, stability, modularity, and traceability of changes.
  • Manage Operations.  This is a data center activity.  Attributes include production system management, operating schedule, and network access.

Plainly, this list is just a starting point.  There are a number of potential risks in these classes, but based on the environment, the scope of the project and the state of development, there may be many more.  The project team should review the list, asking questions as they go, and identifying the unknowns as issues or potential risks requiring qualitative analysis.  Next week, I’ll conclude this series with a look at how we can measure progress in the iterative data conversion cycle.