Validating Data Conversion

This post was extensively updated and incorporated into my new book, The Data Conversion Cycle: A guide to migrating transactions and other records for system implementation teams. Now available on Amazon.com in both Kindle format for $4.49 and paperback for $6.99.

Continuing the series on the data conversion cycle: subsequent to any bulk load of data from another system, the accuracy and completeness of the data in the target system must be assessed.  This process is generally referred to as validation, and it has three objectives:

  1. Identify errors in mapping data elements between systems
  2. Identify errors in the source system extraction processes
  3. Identify errors in the source system records

As you can see from the diagram below, the validation process is the key feedback point in the data conversion cycle.

Data Conversion CycleValidation typically uses several techniques, selected based on the nature of the data being inspected.  They include:

  • Application level comparison – A user logs in to each system and compares the information in the source system with the information in the target system.  This is the most labor-intensive technique, but invaluable for confirming that the data elements were mapped to the correct fields in the target system (Objective 1).
  • Report level comparison – A user or analyst compares a report from the source system with a report containing the same records, prepared from the target system.  The reviewer looks for differences in the number of records, and incorrect or truncated values.  Once identified, a member of the conversion team should investigate whether the difference was caused by the extraction process (Objective 2) or resulted from issues in the source system (Objective 3).  This technique doesn’t require access to the system, or even knowledge of the data or subject matter.  However, it can be fairly labor intensive.  This technique is best used when reviewing data with a limited number of values, such as assignment to an organization, as arbitrary values are difficult to compare.
  • Excel vlookup automated comparison – An analyst prepares an Excel workbook using the data extracted from the source system in one worksheet, and the corresponding data extracted from the target system in another worksheet.  An automated comparison in a third worksheet is then possible using vlookup and other analytical functions within Excel.  This approach requires more preparation, but is usually fastest in execution, especially when inspecting very large numbers of records containing arbitrary values, such as strings (names and addresses), dates, and numerical values.  As with report level comparison, differences are investigated to determine whether the root cause was the extraction process (Objective 2) or an error in the source system (Objective 3).

Much like Chinese cooking, most of the labor in a data validation exercise is in the preparation.  To that end, validation planning should include an analysis of the data being loaded to the source system, to determine the following:

  • What user-accessible field mappings have been made?  It may be possible to identify one or two users with the access rights to check all of the mappings, using application level comparison.  Note that it is not necessary to check multiple records; the purpose is to ensure that the data is “landing on the correct runway.”
  • Based on the records types being loaded, are there delivered audit reports built into the target system that can facilitate a report level comparison with the source?  If there isn’t a complete correlation between the two systems, generally a custom report can be derived from an existing report, either on the source or target system, to facilitate the comparison.
  •  Which data elements would be difficult to compare visually, in a report level comparison?  It is useful to identify the specific fields requiring an automated comparison, so that custom reports can be created in advance to extract the records for loading into Excel.  It is also common practice to load the source records into the worksheets and prepare the comparison worksheet while the data is being loaded into the target system, to save time.
  • What dependencies exist?  Usually, there are no dependencies in the inspection process, because the data, once loaded, is static.  However, if there are issues that will impact the availability of conversion resources, they should be identified in advance.

A well-planned and coordinated validation process can proceed on a broad front, with a number of workers inspecting specific record types, using specified tools and techniques.  The team should have the goal of minimizing the time required to conduct validation, as it is the final step before a move to production.  This is critical, in that during the time between the final extract of data from the legacy system used as the source, to the move of the new system to production, transaction will continue to be processed.  These transactions must then be re-entered into the new system.

The validation process falls between loading the data to the target system and actually using it, in every cycle.  Whether that initial use is development, testing, or production, the process needs to be followed.  Note that validation findings of the final load before the move to production may have to be subjected to a triage process, in that any corrections will have to be made in that production system.  Consequently, the principal measure of success of validation in earlier cycles should be the elimination of sources of error.  A solid validation and correction process should reduce the number of corrections in production to nearly zero.

Next week, I’ll address creating a data conversion plan, based on the data conversion cycle, and integrating it into the overall project plan.

Extracting Records and Loading Them Into The Target System

This post was extensively updated and incorporated into my new book, The Data Conversion Cycle: A guide to migrating transactions and other records for system implementation teams. Now available on Amazon.com in both Kindle format for $4.49 and paperback for $6.99.

Last week, I continued this series on the data conversion cycle with a look at how to develop legacy system extraction processes.  This week, we’ll look at actually executing those processes, and then loading the resulting files into the target system.

Data Conversion CycleOnce the extraction and translation processes have been developed and unit tested, and the sequence of execution scripted, you are ready to create the load files.  However, first determine whether this load will require a refresh of the source from production.  Remember: always extract data from a static system, copied from production at the correct point in the production cycle.  In most cases, it will matter whether certain activities are in process.  You don’t want to extract payroll data from a copy taken in the middle of a payroll calculation, for example.

During execution of the extraction script, be sure to note any exceptions, error conditions, or failures.  It may be necessary to halt the process, correct the problem, and start over.  If so, don’t fail to alert the developers of the extraction and conversion processes, and if corrections to the data are required, the people responsible for maintenance of the data in the production system.  This is a part of the feedback process needed to implement the lessons learned from one cycle to the next.  As previously noted, it is important to maintain an appropriate level of security for the data records, in transit and at rest.  Don’t just Email files to the next person in the chain!  This includes any exceptions or error reports.  Just because the information is no longer in the system of record doesn’t mean it isn’t subject to the same controls.

Once the extraction and conversion process is complete, transfer control of the resulting load files to the team members responsible for running the loading tool.  Note that the loading process will typically require a certain sequence to be followed, and it won’t necessarily be the same sequence in which the load files were created.  Consequently, while it might be tempting to commence loading as soon as possible, keep in mind that rework by the extraction team might make it impossible for the load to be completed.  Especially in the first conversion cycle, be aware of differences in dependencies.

Typically, the loading tool will validate the records to be loaded against other records and configurations already in the system.  You should have a protocol in place for how to handle these exceptions, updated for each conversion cycle.  Any manual corrections to the load files need to be communicated back to the appropriate people – source production system owners, extraction process developers, or configuration team – responsible for preventing the error in the next cycle.  If some records can’t be loaded, a determination should be made on the impact of the failure, and how to proceed with subsequent load files.  You should also continuously update progress in loading the files.  Note execution / load time and the number of records loaded, to facilitate planning of subsequent loads.

At the conclusion of the loading script, ensure appropriate audit reports are run to identify problems such as orphaned records, missing or incomplete records, or other problems.  Note that this is different from validating the load, as the goal is assessing readiness for processing.  The load validation process addresses the completeness and accuracy of the records loaded into the system, as compared to their representation in the source system.  I’ll address the validation process next week.  In the meantime, if you have any comments or lessons learned from past data conversion projects to share, please leave a comment below.

Developing Legacy System Data Extraction Processes

This post was extensively updated and incorporated into my new book, The Data Conversion Cycle: A guide to migrating transactions and other records for system implementation teams. Now available on Amazon.com in both Kindle format for $4.49 and paperback for $6.99.

Last week, I continued this series on the data conversion cycle with an examination of mapping data records for conversion.  This week, I’ll address considerations in developing processes for the extraction and translation of legacy system records, in preparation for loading into the new system.  This all assumes that the new system has known and well-documented tools for loading records in bulk, so that they don’t have to be re-entered manually.  So with that proviso, our conversion scope, and the results of our mapping exercise, let’s consider what it takes to develop a reliable and reproducible data extraction process.

First, it is necessary to use a static copy of the production system.  You can’t have records being created and updated while you are in the process of extracting them, or you risk losing referential integrity.  So schedule a copy to be taken at an appropriate point in the operating cycle, and shut off access to the copy for everyone but the team members who need access for the conversion process.  If possible, set their access for read-only.  This will ensure a stable source for conversion.

Next, review the mapping documents for dependencies, to identify the logical sequence required to prepare the load files.  It might be that certain keys need to be established and re-used for related records.  Or it might be that the business rules for certain translations require examination of multiple records.  In any case, it is likely that there is an optimum order for execution of the processes you’ll develop.  Once you have the logical sequence, you can then decompose the overall conversion map into a series of extractions and format conversions.

At this point, you can define the record extraction queries.  Depending on the system and tools available, this might involve SQL queries, system reports, or any number of other methods.  But in all cases, you will specify the fields to retrieve as defined in the mapping document, and select records (the “where” clause in a SQL query) as defined in the scope document.  The combination of the two will determine what information is in the extracted data records.  Once you have tested these queries, and ensured that you are able to extract exactly what you need in order to prepare the load files, you can then create the translation and format conversion logic, as specified in the mapping document.  Finally, you can assemble the records into the format required for the loading tool.

Note that you will need to maintain an appropriate level of security for the data records, in transit and at rest.  This includes the copy of the production system, extracts and other work files, and the finished load files.  Also note that this overall process needs to be reproducible, from one conversion cycle to the next.  Based on the tools in use, develop scripts to ensure execution is consistent, so that lessons learned can be properly applied in the next iteration.

Next week, I’ll address actually extracting records from the legacy system, and loading the records to the target system.  In the meantime, if you have some lessons learned from past data conversion projects to share, please leave a comment.