Toward Total Data

Quality Management (TDQM)

June 1992 TDQM-92-02

Richard Y. Wang

Henry B. Kon

Total Data Quality Management (TDQM) Research Program

Room E53-320

Sloan School of Management

Massachusetts Institute of Technology

Cambridge, MA 02139 USA

617-253-2656

Fax: 617-253-3321

Acknowledgments: Work reported herein has been supported, in part, by MITís Total Data Quality Management (TDQM) Research Program, MITís International Financial Services Research Center (IFSRC), Fujitsu Personal Systems, Inc. and Bull-HN.

Towards Total Data Quality Management (TDQM)

1 Social and Managerial Impacts of Data Quality

Organizations in industries such as banking, insurance, retail, consumer marketing, and health care are increasingly integrating their business processes across functional, product, and geographic lines. The integration of these business processes, in turn, accelerates demand for more effective application systems for product development, product delivery, and customer service [Rockart, 1989 #86]. As a result, many applications today require access to corporate functional and product databases. Unfortunately, most databases are not error-free, and some contain a surprisingly large number of errors [Johnson, 1981 #401]. In a recent industry executive report, Computerworld surveyed 500 medium size corporations (with annual sales of more than $20 million), and reported that more than 60% of the firms had problems in data quality. The Wall Street Journal also reported that:

Thanks to computers, huge databases brimming with information are at our fingertips, just waiting to be tapped. They can be mined to find sales prospects among existing customers; they can be analyzed to unearth costly corporate habits; they can be manipulated to divine future trends. Just one problem: Those huge databases may be full of junk. ... In a world where people are moving to total quality management, one of the critical areas is data.

In general, inaccurate, out-of-date, or incomplete data can have significant impacts both socially and economically [Laudon, 1986 #359; Liepins, 1989 #537; Liepins, 1990 #509; Wang, 1992 #564; Zarkovich, 1966 #505]. Below we illustrate the social ramifications and discuss three managerial perspectives of data quality in three key areas: customer service, managerial support, and productivity.

1.1 Social Impacts: Privacy and Security

Errors in credit reporting are one of the most striking examples of the social consequences of poor quality data. The credit industry not only collects financial data on individuals, but also compiles employment records. In addition to the denial of credit, an error on a personal report can cause employment problems. One congressional witness testified that ìhe lost his job when he was reported as having a criminal record ... a record that really belonged to a man with a similar name.î In light of such testimonies, congress is pushing for legislation to require that the credit industry retain accurate data no matter where the data originated.

More generally, data is generated and used on a variety of potentially sensitive topics on individuals and organizations including data on medical, financial, employment, consumption, and legal activities. The organizations creating and using this data may include government agencies, insurance companies, banks, marketing, financial organizations, and countless others. The collectors, users, and subjects of this data may all have an interest in the quality of this data.

1.2 Managerial Impacts: Customer Service, Managerial Support, Productivity

In our effort to assess the managerial and economic impacts of data quality in corporate environments, we interviewed employees from many organizations, including the following which we will call Worldwide Shipping, Bullish Securities, and Comfortable Flight. Three areas where data quality affects corporate profits for these organizations are illustrated below.

Customer Service. When higher data quality results in better customer service, there can be a direct positive impact on the bottom line. For example, Worldwide Shipping is one of the largest providers of international ocean freight services. Under the old way of doing business, collection methods for data on cargo and equipment inventories were highly labor-intensive and error-prone. Inaccuracies in the data were commonplace, often causing shipments to be sent to the wrong destination and sometimes to be lost altogether, resulting in unhappy customers, investigative efforts to locate lost goods, and re-routing.

In the late 1980ís, among other quality control programs, Worldwide began installing radio frequency-based tracking mechanisms in their shipping ports to keep track of their containers, chassis, and trucks. The tracking mechanisms are much like bar-code scanners in that each container (i.e., ìboxî that sits on a truckís trailer) can be uniquely and reliably identified as it moves through checkpoints, with real-time transaction database updates.

The end result is that Worldwide can now provide up to the second and exact data to their customers about the location and thus the delivery schedules for their goods. Many customers consider this a critical factor in choosing Worldwide as their shipping vendor.

Managerial Support. Because it is strategic to an organizationís success, the managerial decision making arena is an area where data quality can impact the bottom line. With the proliferation of Management Support Systems, more data will originate from databases both within and across organizational boundaries.

Bullish Securities, a major New York investment bank, illustrates the value of data in such systems. Recently, the bank implemented a risk management system to gather information documenting its securities positions. With complete and timely data, the system serves as a tool which executives use to monitor the firm's exposure to various market risks. However, when critical data was mismanaged, the system left the bank vulnerable to major losses. For example, during a recent incident, data availability and timeliness problems caused the risk management system to fail to alert management of a large interest rate exposure. When interest rates changed dramatically, Bullish was caught unaware and absorbed a net loss totaling more than $250 million.

Productivity. Productivity can be increased where low quality data causes lost revenues, unproductive re-work, downtime, redundant data entry, and the cost of data inspection. Comfortable Flight, a large U.S. airline company, inadvertently corrupted its database of passenger reservations while installing some new software that turned out to have bugs. Programmers fixed the software, but they didnít correct the false reservations it had made. As a result, planes for several months were taking off partly empty because of phantom bookings, impacting the bottom line significantly. More attention to the handling of the database could have prevented such a problem.

The above examples illustrate that data quality can affect both our personal and organizational lives. We now move on to describe the analogy between product manufacturing and data manufacturing.

2 From Product Manufacturing to Data Manufacturing

Organizations have learned that in order to deliver a quality product or service, they need to implement quality programs. Many corporations have devoted significant time and energy in a variety of quality initiatives such as inter-functional teams, reliability engineering, and statistical quality control. Much work on quality management for corporate productivity has been conducted in the field of manufacturing. Few organizations, however, have the processes, skills, or systems in place for managing the quality of their data.

It is interesting to note that fundamental analogies exist between quality issues in a manufacturing environment and those in an information systems environment. Manufacturing can be viewed as a processing system that acts on input material to produce output material. Analogously, an information system can be viewed as a processing system acting on input data to produce output data. Figure 1 illustrates this analogy.

Figure 1 The analogy between product and data manufacturing

From this, we can borrow principles of quality management established for product manufacturing to data manufacturing. For example, product manufacturing concepts such as customer satisfaction, conformance to specification, and zero-defect product can each be applied to data manufacturing.

In general, manufacturing-related activities consist of two parts. One is the design and implementation of the manufacturing line, including engineering analysis, engineering specifications, and the implementation and deployment of hardware and software. This is typically performed by a manufacturing design and engineering group. The second is the production and distribution of the product. Typically, the people responsible for production inherit the line from or work with the designers and engineers, and put the line to use. Product quality is a function of both of these, the manufacturing machinery as well as the methods and skills applied at production time.

These two concepts have exact analogs in the data manufacturing domain, discussed next.

2.1 The Data Systems Life Cycle

Before we begin to think about the improvement of data quality, we need to develop a basic conceptual framework. We use our manufacturing analogy to consider two domains of activity. The first is the design and implementation of the information system, the second is the production and distribution of the data. We refer to these as the Data Systems Life Cycle and the Data Product Value Chain.

The Data Systems Life Cycle (Figure 2) focuses on the activities that go into the design, development, testing, and deployment to the user community of the information system. Because the data that results from an information system is a function of all of these activities, each one must be considered a potential target area for data quality improvement. Consider the following three examples of quality issues related to the Data Systems Life Cycle: (1) the data designed into the system is not the data required by the users (requirements analysis), (2) the testing of software and database functionality is incomplete and corrupted data results (software QA), (3) the user community is inadequately trained in the input and retrieval of data from the system (training).

Figure 2: The Data Systems Life Cycle (The Data Manufacturing System)

2.2 The Data Product Value Chain

The Data Product Value Chain (Figure 3) highlights different aspects of data production. This value chain represents a three-way division of labor involving the handling of data.

Figure 3: The Data Product Value Chain (Data Production)

Data originators generate data having value to others; for example, supermarkets collect and resell point-of-sale data. Data distributors purchase data from the originators and resell it to consuming organizations; for example, Information Resources, Inc. (IRI) purchases point-of-sale data from supermarkets, analyzes and processes it, and then resells it to consumer marketing firms such as General Mills. Finally, consuming organizations are those which acquire data generated externally. For example, banks buy credit data from distributors such as TRW and Dun & Bradstreet.

With the exception of distributors, most companies do not belong solely to one group or another. In fact, most organizations are vertically integrated with respect to data flows, with different departments each performing different functions. For example, a marketing organization may consume data (e.g., on customer buying) generated by the finance organization. As a result of vertical integration, most IS organizations have responsibility for each function: data origination, internal distribution, and consumption of both internally and externally generated data.

In summary, Data Manufacturing consists of two components: the data manufacturing system and the data production process. The data manufacturing system corresponds to the Data Systems Life Cycle and the data production process corresponds to the Data Product Value Chain.

In the next section we introduce a framework for Total Data Quality Management (TDQM). This framework outlines the scope of issues and concepts related to (a) quality data as a product and (b) data quality improvement as an organizational process.

3 A Framework for Total Data Quality Management

Quality control and management have become competitive necessities for most businesses today and there is a rich experience on the topic of quality that dates back several decades. Approaches range from technical, such as statistical process control, to managerial, such as quality circles. An analogous experience base is needed for data quality. But how do we reduce the myriad of complex issues around data quality into manageable concepts and tractable solutions? We introduce a TDQM framework, shown in Figure 4, to deal with these issues.

Figure 4: The TDQM Schematic - a perspective of Data Quality

This framework includes two primary inter-related dimensions:

ï three components of the continuous data quality enhancement process: Measurement, Analysis, and Improvement, and

ï three perspectives on which to base solutions: Economics, Technology, and Organizations.

It suggests that at the highest level, data quality efforts can be motivated by needs for quality business operations. Different steps of the business value chain may have different types of data requirements. The next two sections describe these dimensions in more detail.

3.1 Continuous Measurement, Analysis, and Improvement

The three components of the quality enhancement process are: measurement and definition of data quality, analysis of the economic impact on the business based on data quality, and improvement of data quality through both technical and managerial solutions. They are diagrammed in Figure 5, and are discussed in more detail below.

Figure 5: The MAI perspective of Data Quality

Measurement of Data Quality. Although the notion of ìdata qualityî sounds intuitive, most notions of data quality in current practice are not well defined. Most people, when asked about data quality, respond with such terms as Accuracy and Timeliness, but when asked for definitions of these terms come to realize the subtlety of the concepts involved. A consensus and clear definition of data quality is needed.

Our studies have revealed that data consumers use multiple dimensions when they think about data quality, such as Accuracy, Believability, Relevance, Timeliness, and Completeness. In fact, even the most obvious dimension, Accuracy, does not have a sufficiently robust definition. Thus any attempt to create quality data is limited by our lack of understanding of its basic ingredients. The measurement component addresses issues of data quality definition, measurement, and derivations.

Analysis of Data Quality Impact on Business. This component addresses the value chain relationship between high quality data and the successful operation of the business; and, alternatively, how low quality data impacts the business.

For example, in a transportation company it was determined that poor data quality and usage was the cause of 77% of the delivery misses which in turn was the major reason for an estimated loss of about $1 billion in sales, based on lost market share. Analysis techniques relate data quality to key business parameters, such as sales, customer satisfaction, and profitability.

Improvement of Data Quality This component addresses various methods for improving data quality. These methods can be grouped into three interrelated categories:

ï business redesign,

ï data quality motivation, and

ï the use of new technologies.

Business redesign attempts to simplify and streamline the operation to minimize the opportunity for data errors to occur. Data quality motivation focuses on how rewards, benefits, and perceptions may encourage improved data handling by members of the organization. New technologies focus on improving procedures for data capture and processing through techniques such as data entry in remote or mobile situations, direct inter-computer communications, and computer-assisted quality control.

For example, in the transportation company example, radio frequency-based data entry devices (for equipment and cargo inventories data capture) were introduced in mobile moving vehicles which scanned up and down the container yards. This introduced both a new technology and a business redesign, resulting in more accurate and timely data.

In the next section we describe the second dimension which focuses on perspectives, rather than activities related to data quality enhancement.

3.2 Economic, Technical, and Organizational Perspectives

The second dimension focuses less on how to implement a data quality enhancement process, and more on defining perspectives relevant to its analysis. Next we describe how each of these perspectives, economic, technical, and organizational, are related to data quality.

The economic perspective deals with issues of valuation such as how much would it cost to ìachieveî quality data, how much is it worth, and how do we allocate resources for data quality? Under constrained resources, clearly not all data quality problems or opportunities can be simultaneously addressed.

The technology perspective deals with defining and measuring data quality as well as systems and methods for the enhancement of data quality. This is a fairly broad component in that there are applications for technologies at virtually every point on the Data Systems Life Cycle and the Data Product Value Chain.

For example, tools that support data quality requirements analysis, statistical process control for data quality, and advanced data acquisition devices can be used in various capacities. These tools will facilitate the Data Quality Administration function.

The organizational perspective deals with the implications of data quality on the social fabric of the firm. It considers issues related to the motivation of and responses of both individuals and groups towards organizational change and the proper handling of data. This may include the process of developing organizational commitment to data quality, the technology infrastructure development, modifications of incentives towards data handling, and the overall institutionalization of methods for enhancing data quality.

We have discussed the analogy between product manufacturing and data manufacturing, outlined relevant perspectives (economics, technology, and organization) and components (measurement, analysis, and improvement) of data quality enhancement that we believe are fundamental to TDQM.

The concepts we have developed so far in this chapter do not address improvement of data quality in an organizational setting. After going through the exercise of defining relevant aspects of data quality, likely next questions include: Where from here? What should change in my organization today? Who will carry out these changes? Below we provide a managerial perspective to address these questions.

4 Managing Data Quality

Implementing a data quality improvement program requires significant organizational change as well as the adoption of new management techniques and technologies. Following Tribus, an authority on the implementation of Demming's quality management principles, we group the required organizational changes into five categories:

(1). Clearly articulate a data quality vision in business terms.

(2). Establish central responsibility for data quality within IS groups and functional groups interacting with IS.

(3). Educate project and systems managers.

(4). Teach new data quality skills to the entire IS organization.

(5). Institutionalize continuous data quality improvement.

Each of these are discussed individually in the next five sub-sections.

4.1 Clearly Articulate a Data Quality Vision

In order to improve quality, one must first set quality standards. Such standards should be expressed by users in business terms. For example, Mayflower Bank, a large U.S. bank, states in a Data Administration Task Force Report that ìcustomer service and decision making will be unconstrained by the availability, accessibility, or accuracy of data held in automated form on any strategic platform.î

Since leadership is crucial in the early stages of any quality improvement program, the data quality vision must be clearly identified with top level management in IS. The chief information officer (CIO) must make data quality a priority for the entire organization.

4.2 Establish Central Responsibility for Data Quality Within IS

Once a vision has been articulated, the organization needs to establish responsibility for data quality. Ultimately, this responsibility rests with the CIO, but another person, reporting directly to the CIO, needs to be given day to day responsibility for data quality. Some organizations proclaim that quality is "everybody's responsibility", but in practice this often leads to confusion and inaction. For these reasons, a Data Administrator (DA) must be given responsibility and authority for assuring data quality explicitly.

Where the Data Base Administrator (DBA) tends to be systems-oriented, the DA should be more managerial and analytical. The DA is responsible for making sure that data resources are managed to meet business needs. For larger organizations, the DA should head a data administration staff which serves as a center of expertise on the application of quality management within the IS organization.

In most organizations today, data administration is restricted to a fairly low level function concerned primarily with the development of data models. In the future, organizations will need to enhance the power and prestige of data administration in order to provide a credible and effective center of responsibility for data quality management.

We broadly consider three aspects of the data quality improvement process: breakthroughs, iterative improvement, and maintenance. These represent the complexity of the data quality-related innovation as well as the level of effort and change entailed by the improvement effort. Figure 6 indicates that the data administrator has responsibilities spread across breakthroughs and iterative improvements.

Figure 6: Allocation of Responsibility for Data Quality Improvements

In the area of breakthroughs, the data administrator coordinates work with the CIO and senior level management to identify systems redesign projects and new technologies which could have a large impact on the organization's management of data quality. In terms of less radical, iterative improvements, the data administrator serves as a central source of information and guidance which project and systems managers can access regarding data quality matters. Maintenance involves more traditional activities associated with data applications, where attention to data quality would be applied to the installed base of systems.

Our case studies illustrate the variety of approaches organizations are taking to assigning responsibility for data quality. For example, Mayflower Bank outlined a breakthrough technological initiative centered around the creation of a data administrator position. The data administrator will be responsible for the development and installation of a data delivery utility architecture for corporate data. As the corporation's official source of data, this system's primary function is to serve as a regulated, central repository for data storage and standards enforcement. Updating and accessing stored data will occur via a set of coordinated technologies designed to ensure data quality.

In a contrasting example, Materials Manufacturing, a multi-national corporation, has chosen not to centralize data quality administration, and is instead pushing responsibility back to the sources of the data. This is in line with their corporate goals of ensuring quality at the source and minimizing inspection products or processes, as well as their corporate requirements for distributed systems.

Thus we see centralization and decentralization as two fundamentally different approaches to data quality improvement. Each may carry its own rewards and pitfalls, and must be considered in the context of a given data quality improvement effort. A hybrid approach may be the most reasonable in the short term given the immaturity of data warehouse technology as well as rapid change in the areas of in client/server and distributed computing .

4.3 Educate Project and Systems Managers

Once central responsibility for data quality management has been established, the stage is set to begin educating those people in the organization who will take charge of improvements in data quality. Within IS, these are the project and systems managers. These managers must learn the relationship between quality and productivity so that they consider investing the time and resources appropriate to improving data quality. Beyond this, they must learn specific methods of data quality improvement that are relevant to their projects or systems. For project development managers, this means learning to view data quality as a fundamental objective. For systems managers, it means learning to apply quality principles to monitor and improve data-handling systems.

4.4 Teach New Data Quality Skills to the Entire IS Organization

Responsibility for the successful implementation and maintenance of quality programs belongs to the entire organization. Hence, the entire IS organization may need new skills in order to put new programs into place.

In general, data quality responsibilities will fall into one or more of the following three categories: inspection and data entry, process control, and systems design. Below we discuss the three categories of data quality responsibility and the relevant skills required for each.

Inspection and Data Entry. Inspection and data entry involve responsibility for the accuracy of data as it is transcribed into a system or at interim points during processing of the data by the system. Current practice for data inspection remains mostly manual, although interactive and forms-based user interfaces can be used to filter out or detect errors. Mayflower Bank has established corporate policies on data entry urging that:

ï Data should be entered into machine form only once (e.g., not copied from computer to paper to computer),

ï Data should be obtained as close as possible to the point of data origin, and

ï Newly entered data should be subject to automated edits, consistency checks, and audits as appropriate.

Process Control. Process control involves maintaining and monitoring the performance of systems with respect to data quality management. In addition to statistical quality control, the training required here involves the use of auditability tools for tracking down the source of data quality problems. In one data quality survey, over 50% of respondents expressed difficulty in tracking down the sources of data quality problems. In addition, people with process control responsibilities frequently need training in procedures for the uploading and downloading of data. Mayflower determined that any uploading of data to the mainframe should require the same editing and consistency checks required of newly entered data.

Systems Design. Finally, systems design involves building new systems or upgrading existing applications with data quality management as a primary design goal. In this area there are a host of tools and techniques which professional IS developers should learn in order to design systems which are compatible with data quality goals (e.g., CASE tools, data modeling, intelligent user interface design, data warehouses, and auditing tools).

For example, with respect to systems design, Mayflower is in the process of developing and installing a data warehouse. Achieving this will require corporate IS to define which data is needed from the divisions, how often to upload it, and where it should reside. In this manner, the data warehouse addresses Interpretability, Availability, and Timeliness as well as Accuracy.

(Note that Inspection and Data Entry would fall under the Data Product Value Chain; Systems Design would fall under the Data Systems Life Cycle, and Process Control may fall under both.)

4.5 Institutionalize Continuous Data Quality Improvement

Once the entire organization has received the necessary training, and data quality improvement plans have been put into action, it is necessary for top management to ensure that the data quality improvement process becomes institutionalized. For example, regular meetings, presentations, and reporting structures should be established to track the organization's progress in meeting data quality goals. Additionally, data quality improvement projects need to become part of the budgetary process.

4.6 Operationalizing Data Quality Management

In order to define continuous improvement projects, organizations should focus on critical success factors in order to identify operational objectives critical to the successful management of data quality. Based on interviews and surveys, five critical success factors have been identified: (1) certify existing corporate data, (2) standardize data definitions, (3) certify external data sources, (4) control internal data, and (5) provide data auditability.

Figure 7 illustrates the systems and data sources these five critical success factors impact. Certifying existing corporate data implies providing a guarantee that the corporate data, depicted in the center, satisfies the quality requirements of existing applications. Standardizing data definitions ensures that all data flows (indicated by arrows), among systems adhering to the standard can be implemented in a straightforward manner. Certifying external data sources involves ensuring that the sources depicted in the outer ring have acceptably low error rates. Similarly, controlling internal data implies certifying all of the applications depicted in the inner circle, as well as their interfaces with the corporate data. Finally, providing data auditability implies that when data quality problems are detected in the corporate data, they can be traced to the source, whether it be internal or external.

Figure 7: External and Internal Sources for Data Quality Management

5 Concluding Remarks

In this chapter, we have discussed several issues. First we introduced the concept of data quality, providing broad definitions of the term and discussing impacts and motivation for the issue.

Next we presented a fundamental analogy between product manufacturing and data manufacturing that leads to the Data Systems Life Cycle and the Data Product Value Chain as a means of describing the range of activity involved in data production. We provided a framework for Total Data Quality Management describing activities related to continuous data quality measurement, analysis, and improvement from economic, technical, and organizational perspectives. Finally, we discussed the improvement of data quality in an organizational setting.

Following the analogy between manufacturing and information systems, we have argued that there is a significant amount of economic benefit to be gained if data quality can be managed effectively. The issues involved range from technical to managerial, may depend on (a) the nature of the data itself, (b) the application of the data, (c) the information systems involved, and (d) the related organizational and economic super-structures around the information systems.

We are actively conducting research along the following directions: How do we define and measure data quality? What kinds of information technologies can be developed to certify data and to provide data auditability? What kinds of operations management techniques can be applied to develop a foundation for data quality management? Should data originators, data distributors, and data consumers manage data quality problems differently, or is there a single underlying issue? What is the relationship between data quality and corresponding data characteristics? These inquiries will help develop a body of knowledge for data quality management -- a critical issue for the decade to come.