Helping subject-matter experts analyze more than 100 million rows of legacy data in preparation for cleansing and migration.

Project Details

Idaho Department of Transportation
New Hampshire Department of Motor Vehicles


The state of Idaho's Department of Transportation and the state of New Hampshire's Department of Motor Vehicles (NHDMV) wanted to modernize their data and transfer operations off of their mainframe and onto modern platforms. The dataset sizes for Idaho and New Hampshire were each more than 100 million individual records and each record had a few dozen to a few hundred field values. To add to the challenge, a majority of the tens of billions of individual field values were kept as free-form string fields with sometimes little or no edit validation in place to prevent typographical errors, data-entry errors, or misuse or multiple use of fields.

For these projects, RESPEC staff provided custom-built, data-cleansing, staging, and warehousing infrastructures to help automate and manage the states’ modernization efforts in preparation for migrating to the new systems. The SME Tools Analysis Toolset (STAT), which RESPEC staff designed and built, was an essential part of these operations.

STAT provides an interactive user interface that enables subject-matter experts (SMEs) to quickly and easily browse through entire tables, individual columns, distinct value listings, validation exceptions, and data rules violation reports while optionally applying custom filtering and sorting, drilling down into the underlying source records, and exporting results to files for further analysis. A custom-built analytical data store on the back-end allows interactive real-time results, even for very large datasets. Two sets of tools were developed to support two distinct workflows: (1) tools for “undirected data exploration,” to be used early in the modernization process to “preview” the data as well as build up a starting catalog of rules, validations, and issues, and (2) “directed exploration” tools for reviewing the results of applying these rules and validations. In practice, the combination of undirected and directed analyses; real-time query response; and a clean, organized, simplified view of the data has allowed customers to understand their data better.

The ITD and NHDMV projects are ongoing and, with the cleansing, the systems are achieving high levels of accuracy. Both the ITD and NHDMV have expressed great satisfaction with the STAT toolset, which allows them to already see the results of the data-cleansing efforts.