Introduction to

the TDQM Research Program

May 1992 TDQM-92-01

Stuart E. Madnick

Richard Y. Wang

Total Data Quality Management (TDQM) Research Program

Room E53-320

Sloan School of Management

Massachusetts Institute of Technology

Cambridge, MA 02139 USA

Tel: 617-253-2656

Fax: 617-253-3321

Acknowledgments: Work reported herein has been supported, in part, by MITís Total Data Quality Management (TDQM) Research Program, MITís International Financial Services Research Center (IFSRC), Fujitsu Personal Systems, Inc. and Bull-HN.

TDQM 92-01 Intro, rev

Introduction to the TDQM Research Program

Motivation

In recent years most corporations, large and small, have initiated Total Quality Management (TQM) programs with goals that include 100% satisfaction for customers and no product defects. Quality management programs have been a key factor in the success of companies in many industries.

Often TQM programs and other strategic corporate initiatives are not entirely successful or even fail because the data used to monitor and support organizational processes are incorrect or incomplete or otherwise faulty or inappropriate for a given application. Anecdotal evidence and a growing literature point to data being defective at levels of 10% or more in a variety of applications and industrial contexts, including sales-force automation, direct-mail programs and productivity improvement programs.

MITís Total Data Quality Management (TDQM) research effort has been grown from industry needs for high quality data. The overall objective of this program is to establish a solid theoretical foundation in this embryonic field and, from this work, to devise practical methods for business and industry to improve data quality. We will develop tools and other capabilities necessary for data quality management in the technical, economic, and organizational phases of business operations.

Research Agenda

Research Objective

The TDQM project has both long-term and short-term focuses. The long-term goal of this research program is to create a theory of data quality based on reference disciplines such as computer science, the study of organizational behavior, statistics, accounting, and the total quality management field. This theory of data quality, in turn, may serve as a foundation for other research contexts where the quality of information is an important component. In the short term, the research goal is to create a center of excellence among practitioners of data quality techniques and to act as a clearinghouse for effective methods and project experiences.

Research Scope

There are three major components of the TDQM research program: data quality definition, analysis, and improvement (Figure 1). The definition component focuses on defining and measuring data quality. The analysis component identifies and calculates the impacts of poor quality data, and the benefits of high quality data, on an organization's effectiveness. Finally, the improvement component involves redesigning business practices and implementing new technologies in order to significantly improve the quality of corporate information. Each of these are briefly described below, along with an example and outline of key research directions.

Figure 1: The Three Components of the TDQM Research Program

Definition of Data Quality. Although the notion of "data quality" may seem intuitively obvious, data quality is not well defined in current practice. Our studies have revealed that data quality has a number of dimensions for data users, including accuracy, believability, relevancy, and timeliness. A clear and uniform articulation of data quality metrics is needed. In fact, even a relatively obvious dimension, such as accuracy, does not have a sufficiently robust definition to make techniques apparent as to how to measure the accuracy of data. This component of the research addresses issues of data quality definition, measurement, and derivations.

The research issues we are addressing are: (a) identification of the key dimensions of data quality, (b) precise and meaningful definitions of each dimension, (c) methods of measuring each dimension for base data, and (d) a data quality algebra (DQA) for computing the quality of derived data.

Analysis of Data Quality Impact on a Business. This component addresses the value chain relationship between high quality data and the successful operation of a business (the flip side is how low quality data negatively impacts a business.) Our analysis techniques relate data quality to key business parameters, such as sales, customer satisfaction, and profitability. To illustrate the importance of this kind of analysis, we describe the case of a transportation company. In this company poor data quality and usage caused 77% of missed deliveries, resulting in significant operating costs due to repetition of work and rerouting of shipments. Even more significant was the finding that the use of poor quality information was the major reason for an estimated loss of market share evaluated at about $1 billion in sales.

The research issues we are focusing on are : (a) quantification of business impact of data quality to firms through a collection of case studies, (b) development of Data Quality Value Chain Analysis (DQVCA) techniques to relate data quality to key business parameters, such as sales, customer satisfaction, and profitability, and (c) development of an economic model of the value of quality data.

Improvement of Data Quality. This component addresses various methods for improving data quality. These methods can be grouped into three interrelated categories: (i) business redesign, (ii) data quality motivation, (iii) use of new technologies, and (iv) data interpretation technology. Business redesign attempts to simply and streamline the operation to minimize the opportunity for data errors to occur. Data quality motivation deals with employee rewards, benefits, and perceptions to encourage more careful attention to improving the quality of data handled by the appropriate members of the organization. New data capture technologies can significantly improve quality through techniques such as automated entry and direct inter-computer communication. Data interpretation technologies assist the user in understanding the meaning of the data so that it is not used incorrectly. For example, in the transportation company example, radio frequency-based data entry devices (for equipment and cargo inventories data capture) were introduced in mobile vehicles which scanned up and down container yards for real-time inventory. This introduced both a new technology and a business redesign, resulting in more accurate and timely data.

The research issues we are working on are: (a) analyzing direct entry technologies, such as mobile computing technologies, neural network techniques for handwriting analysis, and portable communicating terminals, (b) studying connectivity among information systems, (c) representing and automatically using knowledge about the semantics of the data, and (d) creating new paradigms for system design that incorporate data quality tags such as for time and source.

Research Accomplishments

The TDQM team has to date produced more than ten papers on topics such as defining, measuring, analyzing, and improving data quality.

One of these articles involved a survey of data consumers to identify the most important dimensions of high-quality data. The survey showed that data users consider the value-added dimensions of the data and the relevancy of the data to be even more important than accuracy (we assume, within certain limits!) when using data to make key strategic decisions. This is in contrast to the prior expectation that accuracy would be rated as the most important factor.

Another article involved the development of quality measures for reports derived from defective raw data. Quality refers to a number of data characteristics, including completeness, accuracy, appropriateness and consistency. Based on this work, a data quality calculus is being developed to assess the quality of such derived data. Given a database query, and a measurement of the quality of the raw data, the calculus computes the accuracy of the query result. Assuming we can measure the quality of the raw data, the user may make a judgment on the risk involved in using the data. This calculus also enables data quality analysts to study the impact of alternative quality control strategies. Examination of these quality measures for different application queries will identify likely candidates for quality enhancement.

Specific Activities for the Current Academic Year

The current research agenda includes five primary efforts. First, we are continuing our case studies of existing organizational efforts to enhance or control data quality. Many organizations are implementing explicit data quality programs that include automatic data acquisition, integrated information systems, and process control for data. In order to develop and disseminate management expertise in these areas we must learn from ongoing efforts by looking at objectives, approaches and outcomes for both successes and failures.

Second, we are conducting a data quality literature survey. The survey will include research efforts in accounting, management science, management information systems, and databases. We will examine literature on the full range of organizational, technical and economic aspects of data quality management. A related effort is our development of a framework for data quality management based on the ISO 9000 international quality standard. This standard can be applied to aspects of data quality at many organizational levels, including management, production, marketing and accounting.

Third, we will study how business processes can be designed to improve the quality of information that is acquired, processed, and disseminated throughout an organization or activity. Business process design in this context refers to using information technology to provide the right information to the right people at the right time. The object of this research is to develop design principles for businesses and to evaluate these designs in terms of information requirements. We will develop formal models for process representation and a means for their analysis.

Fourth, we will develop a process perspective of the dimensions of data quality. We believe that the term data quality, though used in a variety of contexts, has been inadequately conceptualized and defined. To improve and manage data quality, we must define and determine boundaries for the concept of data quality. A process view of data is necessary for several reasons: (1) data quality defects, in general, are difficult to detect by simple inspection of the data product; (2) definitions of data quality dimensions and defects, while useful intuitively, tend to be ambiguous and interdependent; and (3) in line with the cornerstone of TQM philosophy, emphasis should be placed on process management to improve product quality. The process view enables clear determination of specific data needs and the contexts in which data are to be used. We believe that this work will help to link process redesign efforts with data quality management.

Fifth, we are researching how to measure and estimate the quality of derived data. Having assessed the quality of an organization's base data sets, these data sets will be processed through a variety of operations, being combined and filtered to produce reports, forecasts, and other more specialized databases. The research question here is how to determine analytically the quality metrics of these derived data sets based on information about the quality of the original data set and given the operations performed. We focus initially on the relational database operations and their extensions such as addition, subtraction, maximum, and count.

Sponsorship

Research Sponsors

The TDQM program is a joint effort between members of the MIT Information Technology group, industry partners, and related industry-specific research programs at MIT (including the Leaders for Manufacturing, the International Financial Services Research Center, and the Center for Transportation Studies). Ultimately, TDQM expects to draw sponsors from a wide range of industries, including finance, transportation, manufacturing, and telecommunications. Fujitsu Personal Systems, Inc. and Bull HN Information Systems, Inc. have joined as founding sponsors.

Mr. Daryll Wartluft, Vice President of Bull Information Systems, explained why Bull is sponsoring our work as follows: ìthe TDQM effort is an important part of our relationship with MIT and ties in well with Bull's worldwide commitment both to our own internal quality and to assisting our customers in better understanding and improving their information quality.î

Mr. Lou Panetta, Executive Vice President of Marketing and Sales at Fujitsu Personal Systems put it this way. ìTotal data quality management of today's corporations requires 100% accurate data input from the field sales, service, and support organizations. Erroneous data entering the organization compounds itself as decisions are made based on inaccurate information. Accurate and up-to-date information can help companies service their customers better, thereby giving them a competitive edge.î

Dr. Robb Wilmot, Chairman of Fujitsu Personal Systems agrees: ìThe prototypical chairman's statement in an annual report mentions 100% customer satisfaction by the third paragraph -- if not sooner -- with no explicit recognition in the enterprise that the data with which this goal is to be achieved is typically between 5% to 10% defective. If you analyze the supply chain costs in a large company, it is not uncommon to find that half of the total cost is rework caused by defective data -- with untold competitive costs.î

Cooperation with Sponsors

In order to establish a center for excellence on TDQM, we are interested in collecting cases of data quality projects including completed projects so that we can document the lessons learned from the project and quantify business impacts of these projects. In addition, we are interested in analyzing how various companies approach their respective data quality projects, both in data quality enhancement and data quality control.

We are interested in pursuing various other joint activities with sponsors on this project. Organizations with an interest in our research agenda are invited to become sponsoring members of the TDQM Project. Project members can: (a) serve as study or test sites for the TDQM research activities, (b) attend special TDQM symposiums at MIT, (c) involve personnel in TDQM research efforts, (d) receive TDQM working papers, (e) contribute in general, through participation and sponsorship, to the development and advancement of TDQM.

TDQM Research Program Membership

In order to pursue meaningful research and keep close interactions with industry, we are seeking distinguished corporations that have an interest in total data quality management to become sponsoring members of the TDQM Project. For more information, please contact either Professor Stuart Madnick (617/253-6671) or Professor Richard Wang (617/253-0442).

Concluding Remarks

Data are used to support most activities in modern organizations, be they operational, managerial, or strategic in nature. If these data are defective, there are many ways that poor data quality can affect organizational effectiveness and efficiency. Without a systematic and comprehensive way to conceptualize and address the data quality issue, organizations are left to grapple with this problem in an ad hoc, and piece meal manner. The TDQM effort aims to construct a paradigm for data quality management, to serve as a center for excellence in managerial and technology practice, as well as to develop a rigorous foundation and discipline for data quality to extend into the future.