A Framework for Analysis of Data Quality Research

September 1994 TDQM-94-12

Richard Wang

Veda Storey*

Chris Firth**

Richard Wang

Total Data Quality Management (TDQM) Research Program

Room E53-320, Sloan School of Management

Massachusetts Institute of Technology

30 Wadsworth Street, Cambridge, MA 02139

Tel: 617-253-2656, Fax: 617-253-3321

* Veda Storey

Computers and Information Systems

William E. Simon Graduate School of Business Administration

Dewy Hall 238, Rochester, NY 14627

** Chris Firth

Citibank, N.A.

Robina House, 1 Shenton Way, Basement Unit 3

Singapore 0106

© 1994 Richard Wang, Veda Storey, and Chris Firth

Acknowledgments Work reported herein has been supported, in part, by MIT's Total Data Quality Management (TDQM) Research Program, University of Rochester's Simon School, MIT's International Financial Services Research Center (IFSRC), Fujitsu Personal Systems, Inc. and Bull-HN. Thanks are due to Nancy Chin and Terry Wong for their assistance in preparing this manuscript.

A Framework for Analysis of Data Quality Research

1. Introduction

For organizations to be best served by their information systems, a high degree of data quality is needed. The need to ensure data quality in computer systems has been addressed by both researchers and practitioners for some time. In order to further advance research on data quality, however, one must first understand the work that has been conducted to date and identify specific topics that require further investigation.

The objective of this paper is to develop a framework that can be used to analyze existing research on data quality, identify research issues, and provide directions for further research. In selecting the appropriate body of research to examine, two primary criteria were used:

(1) The researchers, themselves, specifically recognize a data quality problem, which they attempt to address in their work; that is, the research is motivated by a data quality issue.

(2) The researchers address a problem that, although not specifically described as a data quality issue (e.g., user satisfaction of information systems), is comprised of components that are related to data quality management. Many of these authors' papers are referenced by research that falls into the first category.

We deliberately exclude research efforts that are from well established research areas (which may have hundreds of research contributions), and that are not focused on issues directly related to data quality. Another reason for the exclusion is that many of these research areas have been investigated. For example, concurrency control has been surveyed by Bernstein & Goodman [1981] , dtabase schema integration by Batini, Lenzirini, & Navathe [1986] , logicaldatabase design by Teorey, Yang, & Fry [1986] , and temporal databases by Tansel, et. al. [1993] . Thu, this paper strives to serve as a focal point for understanding the state-of-the-art in data quality research, and to bridge the gap between research that directly addresses data quality issues and research that is primarily focused on another subject area.

This paper is organized as follows. Section 2 presents a framework for Total Data Quality Management. The elements of this framework are presented in Section 3. Section 4 analyzes existing research and identifies unresolved research issues for each element of the framework. Section 5 summarizes and concludes the paper.

2. A Framework for Data Quality Analysis

2.1 Data Quality Practices

To obtain an overall understanding of the issues involved in managing data quality, it is useful to understand current data quality practices. A pragmatic approach to data quality management has been developed that is based on AT&T's process management and improvement guidelines [AT&T, 1988] . The approach involves the following seven steps: (1) establishing a process owner, (2) describing the current process and identifying customer requirements, (3) establishing a system of measurement, (4) establishing statistical control, (5) identifying improvement opportunities, (6) selecting improvement opportunities to pursue and setting objectives for doing so, and (7) making changes and sustaining gains. This approach has been successfully applied both within and outside AT&T [Huh et al., 1990; Pautke & Redman, 1990; Redman, 1992] .

Another approach to data quality management has been taken in many organizations that entails two phases [Hansen, 1990; McGee & Wang, 1993; Wang & Kon, 1993] . In the first phase, the data quality proponent initiates a data quality project by identifying an area in which the effectiveness of an organization is critically impacted by poor data quality. Successfully completing this phase enables the proponent to gain experience and establish credit in the organization. In the second phase, the proponent strives to become the leader for data quality management in the organization.

Other approaches have also been developed by various firms. Although the implementation details may vary in different approaches and in different organizational settings, most practitioners have benefited from the cumulated body of knowledge on Total Quality Management (TQM) [Crosby, 1984; Deming, 1986; Juran, 1992; Taguchi, 1979] . Managing data quality in this manner has been referred to as Total Data Quality Management (TDQM) [Madnick & Wang, 1992] .

2.2 From Product Manufacturing to Data Manufacturing

The major reason that data quality practitioners benefited from the literature in the TQM area is that an analogy exists between quality issues in a manufacturing environment and those in an information systems environment. Information systems have been compared to production systems wherein data are raw materials and information are outputs [Cooper, 1983; Emery, 1969] . This analogy has also been suggested in discussions of data assembly [Huh et al., 1990] , managing corporate data [Hansen & Wang, 1990] , information as inventory [Ronen & Spiegler, 1991] , information systems quality [Delen & Rijsenbrij, 1992] , and information manufacturing [Arnold, 1992] .

As shown in Figure 1, a product manufacturing system acts on input raw material to produce output material, or physical products. Similarly, an information system can be viewed as a data manufacturing system acting on input raw data (e.g., a single number, a record, a file, a spreadsheet, or a report) to produce output data, or data products (e.g., a sorted file or a corrected mailing list). This data product, in turn, can be treated as raw data in another data manufacturing system. Use of the term data manufacturing encourages researchers and practitioners alike to seek out cross-disciplinary analogies that can facilitate the transfer of knowledge from the field of product quality to the field of data quality. Use of the term data product emphasizes the fact that the data output has value that is transferred to the customer.

Figure 1: An analogy between physical products and data products

2.3 Total Data Quality Management (TDQM)

Since the analogy between physical product manufacturing and data product manufacturing can easily be seen, the literature in the physical product manufacturing area was reviewed to determine its applicability to data product manufacturing. A careful examination of the literature found the International Organization for Standardization's ISO 9000 to be the most appropriate guideline for the purposes of this paper [Firth, 1993] . The objectives of the ISO 9000 are:

(1) to clarify the distinctions and interrelationships among the principal quality concepts, and

(2) to provide guidelines for the selection and use of a series of International Standards on quality systems that can be used for internal quality management purposes (ISO 9004) and for external quality assurance purposes (ISO 9001, ISO 9002 and ISO 9003).

For convenience, the term ISO 9000 is used hereafter to refer to the 9000 series (ISO 9000 to ISO 9004 inclusive). The main strength of the ISO approach is that it is a set of well-established standards and guidelines that has been widely adopted by the international communities. It provides guidance to all organizations for quality management purposes, with a focus on the technical, administrative and human factors affecting the quality of products or services, at all stages of the quality loop from detection of need to customer satisfaction. An emphasis is placed on the satisfaction of the customer's needs, the establishment of functional responsibilities, and the importance of assessing (as far as possible) the potential risks and benefits. All of these aspects are considered in establishing and maintaining an effective quality system.

Terms and definitions used in ISO9000 are described in ISO8402. By rephrasing the five key terms given for product quality in the ISO 8402, we define the following terms which are needed to develop the TDQM framework:

A data quality policy refers to the overall intention and direction of an organization with respect to issues related to the quality of data products. This policy is formally expressed by top management.

Data quality management is the management function that determines and implements the data quality policy.

A data quality system encompasses the organizational structure, responsibilities, procedures, processes, and resources for implementing data quality management.

Data quality control is the set of operational techniques and activities that are used to attain the quality required for a data product.

Data quality assurance includes all those planned and systematic actions necessary to provide adequate confidence that a data product will satisfy a given set of quality requirements.

By adapting the ISO 9000, a framework for the Total Data Quality Management (TDQM) is developed that consists of seven elements: management responsibilities, operation and assurance costs, research and development, production, distribution, personnel management, and legal function, as shown in Figure 2.

Element Description
1.Management Responsibilities Development of a corporate data quality policy

Establishment of a data quality system

2.Operation and Assurance Costs Operating costs include prevention, appraisal, and failure costs

Assurance costs relate to the demonstration and proof of quality as required by customers and management

3.Research and Development Definition of the dimensions of data quality and measurement of their values

Analysis and design of the quality aspects of data products

Design of data manufacturing systems that incorporate data quality aspects

4.Production Quality requirements in the procurement of raw data, components, and assemblies needed for the production of data products

Quality verification of raw data, work-in-progress, and final data products

Identification of non-conforming data items and specifications of corrective actions

5.Distribution Storage, identification, packaging, installation, delivery, and after-sales servicing of data products

Quality documentation and records for data products

6.Personnel Management Employee awareness of issues related to data quality

Motivation of employees to product high-quality data products

Measurement of employees' data quality achievement

7.Legal Function Data product safety and liability

Figure 2: A Framework for Total Data Quality Management (TDQM)

3. Elements of the TDQM Framework

This section describes each of the elements in the Total Data Quality Management framework.

3.1 Management Responsibilities

Top management must be responsible for developing a corporate data quality policy. This policy should be implemented and maintained consistently with other policies. Management should also identify critical data quality requirements and establish a data quality system that applies to all phases of the production of data products.

3.2 Operation and Assurance Costs

Unlike most raw materials, data are not consumed when processed and, therefore, may be reused repeatedly. Although the cost of data "waste" may be zero, the cost of using inaccurate data certainly may be large [Ballou et al., 1993; Paradice & Fuerst, 1991] . The impact of data product quality can be highly significant, particularly in the long term. It is, therefore, important that the costs of a data quality system be regularly reported to and monitored by management and related to other cost management. These costs can broadly be divided into operating costs and assurance costs. Operating costs include prevention, appraisal, and failure costs. Assurance costs relate to the demonstration and proof required by customers and management.

3.3 Research and Development

The research and development department should work with the marketing department in establishing quality requirements for data products. Together, they should determine the need for a data product, define the market demand, and determine customer requirements regarding the quality of a data product. A customer's data quality requirements include various dimensions such as accuracy, timeliness, and completeness.

The specification and design function translates the quality requirements of data products into technical specifications for the raw data, data manufacturing process, and data products. This function should be such that the data product can be produced, verified, and controlled under the proposed processing conditions. The quality aspects of the design should be unambiguous and adequately define characteristics important to data product quality, such as acceptance and rejection criteria. Tests and reviews should be conducted to verify the design.

3.4 Production

In terms of production, data quality begins with the procurement of raw data. A data quality system for procurement must include the selection of qualified raw data suppliers, an agreement on quality assurance, and verification methods. All raw data must conform to data quality standards before being introduced into the data manufacturing system. When traceability is important, appropriate identification should be maintained throughout the data manufacturing process to ensure that the original identification of the raw data and its quality status can be attained. The identification, in the form of tags, should distinguish between verified and unverified raw data. Verification of the quality status of data should be carried out at important points during the data manufacturing process. The verifications should relate directly to the data product specifications.

Suspected non-conforming data items (including raw data, data product in-progress, and final data products) should be identified, reviewed, and recorded. Corrective action begins when a non-conforming data item is detected. The significance of a problem affecting quality should be evaluated in terms of its potential impact on production costs, quality costs, customer satisfaction, and so forth. Cause and effect relationships should be identified and preventative action initiated to prevent a reoccurrence of non-conformity. It is well accepted by practitioners in the data quality management area that it is more beneficial to check and, if appropriate, modify the process that caused the data quality problem than to correct the non-conforming data items.

3.5 Distribution

The handling of data products requires proper planning and control. The marking and labeling of data products should be legible and durable, and should remain intact from initial receipt to final delivery. The data quality system should establish, and require the maintenance and after-sales servicing of data products. In addition, it should provide a means of identification, collection, indexing, storage, retrieval, and disposition of documentation and records regarding the data products produced by a data manufacturing system.

3.6 Personnel Management

Personnel management falls into three categories: training, qualification (formal qualification, experience, and skill) and motivation. An effort should be made to raise employees' awareness of issues related to data quality. Management should measure and reward data quality achievement.

3.7 Legal Function

The safety aspect of data products should be identified to enhance product safety and minimize product liability. Doing so includes identifying relevant safety standards, carrying out design evaluation and prototype testing, providing data quality indicators to the user, and developing a traceability system to facilitate data product recall.

4. Analysis of Data Quality Research

The TDQM framework is employed in this section to analyze the articles we have identified as being relevant to data quality research. The section is divided into seven subsections, reflecting the seven elements of the framework. To serve as a reference for the reader, related research efforts are listed at the beginning of each subsection. The list also reflects whether these research efforts address issues related to other elements of the framework. For example, the list in the beginning of Section 4.1 (Table 1) shows that Huh, et al. [1990] also address issues related to Sections 4.3.3 and 4.4, as indicated by the symbol 'Ã'. This convention is used throughout Section 4.

4.1 Management Responsibilities

The importance of top management's involvement in attaining high quality data has been recognized; see, for example, Bailey & Pearson [1983] and Halloran et al. [1978] . However, as can be seen from Table 1, very little research has been conducted to investigate what constitutes a data quality policy or how to establish a data quality system.

Table 1: Data Quality Literature Related to Management Responsibilities
Section
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Research
4.2.1
4.2.2
4.2.3
4.3.1
4.3.2
4.3.3
[Huh et al., 1990]
Ã
Ã
Ã
[McGee & Wang, 1993]
Ã
[Oman & Ayers, 1988]
Ã
Ã
Ã
[Pautke & Redman, 1990]
Ã
Ã
Ã
[Redman, 1992]
Ã
Ã
Ã
[Wang & Kon, 1993]
Ã

Oman, & Ayers [1988] focus on the organizational side of implementing data quality systems. They use data quality survey results as a tool to improve the quality of data in the U.S. government. The survey results were sent back to the people directly responsible for data quality on a monthly basis over a period of several years.

AT&T has established a data quality system to control and improve the quality of the data it uses to operate its worldwide intelligent network [Huh et al., 1990; Pautke & Redman, 1990; Redman, 1992] . This data quality system includes seven steps based on AT&T's process management and improvement guidelines, as mentioned earlier.

Finally, a high level management perspective of data quality has been proposed [McGee & Wang, 1993; Wang & Kon, 1993] in which fundamental problems such as the need to define, measure, analyze, and improve data quality are identified. Guidelines are provided to show how total data quality management may be attained. In addition, challenging research issues for each problem area are identified.

Analysis of Research

The lack of research on what constitutes a data quality policy and how to establish a quality system for data contrasts sharply with the growing anecdotal evidence that organizations are increasingly aware of the need to develop a corporate policy for data quality management [Bulkeley, 1992; Cronin, 1993; Gartner, 1993; Liepins, 1989; Liepins & Uppuluri, 1990; Sparhawk, 1993] . Researchers and practitioners alike need to demonstrate convincingly to top management that data quality is critical to the survival of an organization in the rapidly changing global environment. Research is also needed to develop methodologies that will assist management in identifying data quality factors that affect a company's position. Such a methodology will allow management to analyze the impact of data quality more directly. In addition, case studies are needed to document successful endeavors in developing a corporate data quality policy and in establishing a data quality system. From this, hypotheses for a corporate data quality policy and system guidelines could be developed. As more and more companies implement a corporate data quality policy and system, empirical analysis could be performed to test the hypotheses formulated above and to identify the most successful approaches. The results would form a basis for the development of a body of knowledge on this topic.

4.2 Operation and Assurance Costs

The costs of a data quality system are divided into: (1) operating costs consisting of prevention, appraisal, and failure costs, and (2) assurance costs dealing with the demonstration and proof of quality as required by both management and customers. The research that addresses cost issues is identified in Table 2. Although the literature does not specifically distinguish operating and assurance costs, it is possible to classify the work that has been carried out by the areas in which it occurs, namely, information systems, databases, and accounting.

Table 2: Data Quality Literature Related to Operation and Assurance Costs
Section
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Research
4.2.1
4.2.2
4.2.3
4.3.1
4.3.2
4.3.3
[Amer et al., 1987]
Ã
[Ballou & Pazer, 1987]
Ã
Ã
Ã
[Ballou & Tayi, 1989]
Ã
Ã
Ã
[Bodnar, 1975]
Ã
[Bowen, 1993]
Ã
Ã
[Burns & Loebbecke, 1975]
Ã
[Cushing, 1974]
Ã
[Feltham, 1968]
Ã
Ã
[Fields et al., 1986]
Ã
[Groomer & Murthy, 1989]
Ã
[Hamlen, 1980]
Ã
[Hansen, 1983]
Ã
[Johnson et al., 1981]
Ã
[Mendelson & Saharia, 1986]
Ã
[Nichols, 1987]
Ã
[Stratton, 1981]
Ã
[Wand & Weber, 1989]
Ã
[Yu & Neter, 1973]
Ã

The information systems literature contains work that performs cost and quality tradeoffs for control procedures in information systems, and develops a methodology for allocating resources for data quality enhancement. In the database area, research has been conducted that investigates the cost and impact of incomplete information on database design. Research articles in these areas are examined below from the perspective of data quality research. Finally, a body of research has been established in the accounting literature that places a specific emphasis on internal control systems and audits. An implicit assumption underlying this research is that, by incorporating a set of internal controls in a financial information system to enhance the system's reliability, a high probability of preventing, detecting, and eliminating data errors, irregularities, and fraud in the financial information system can be maintained. The demonstrated reliability of the system can provide evidence of the quality of the data products produced by the system.

4.2.1. Information Systems

Ballou & Pazer [1987] present a framework for studying the cost/quality tradeoffs of internal control scenarios that are designed to ensure the quality of the output from an information system. The model used includes parameters such as: the cost and quality of processing activities, the cost and quality of corrective procedures, and a penalty cost that is incurred by failure to detect and correct errors. Ballou & Tayi [1989] use parameters including the cost of undetected errors, the stored data error rate, and the effectiveness of data repair procedures in an integer program designed to allocate available resources in an optimal manner. The researchers note in particular the need to obtain better estimates for the penalty costs of poor information quality. A heuristic is developed that assesses whether it is better to identify and correct few, serious errors in one database or more widespread, but less severe, quality problems in another database.

4.2.2. Database

In analyzing the impact of incomplete information on database design, Mendelson & Saharia [1986] present a decision-theoretic approach. An implicit assumption in this work is that the database will reflect the real world as far as the relevant schemas are concerned; in other words, for each schema, the database will contain the complete set of instances at any given point of time and that all the instance values are accurate. The premise of this research is that a database designer must make a tradeoff between the cost of incomplete information (i.e., the exclusion of relevant entities or attributes) and data-related costs (i.e., the cost of data collection, manipulation, storage, and retrieval). That is, a database designer has two choices. The first choice is to design a system that will contain a complete set of schemas and attributes for the intended application domain (so the data consumer will obtain information with certainty without a penalty cost) but incur more data-related costs. The second choice is a system that will contain an incomplete set of schemas (e.g., with certain attributes omitted) but incur less data-related costs. In this case, the data consumer may pay a penalty due to incomplete information. From the data quality perspective, this work is important because it addresses the relationship between prevention costs and failure costs within the context of the key data quality dimension of completeness. Here, prevention costs refer to costs due to the occurrence of more data-related costs because of complete schemas whereas failure costs refer to costs due to erroneous decisions made because of incomplete information.

A cumulated body of research has been developed in the database area that would reduce CPU usage time and query response time. Other research aims at minimizing communications costs with a goal of increasing data availability; see, for example, [Kumar & Segev, 1993] . Although this type of research can clearly be associated with operation costs, for reasons mentioned in Section 1, it is not discussed further.

4.2.3. Accounting

Feltham [1968] identifies relevance, timeliness, and accuracy as the three dimensions of data quality, and analyzes their relationship with the value-in-excess-of-cost criterion within the context of the data consumer. A payoff function is developed in terms of future events and prediction of these events. Yu & Neter [1973] propose one of the earliest stochastic models of an internal control system that could serve as the basis for an objective, quantitative evaluation of the reliability of an internal control system. They also discuss implementation problems for the proposed model and possible approaches to tying the output from the proposed model to substantive tests of an account balance. Cushing [1974] independently develops a simple stochastic model for the analysis and design of an internal control system by adapting concepts from the field of reliability engineering. The model uses reliability and cost parameters to provide a means to compute the reliability of a process, that is, the probability of completion with no errors. The cost parameters include the cost of performing the control procedure, the average cost of searching and correcting for signaled errors, and the average cost of an undetected error.

Burns & Loebbecke [1975] focus on internal control evaluation in terms of a tolerable data error compliance level, but do not address issues related to cost. Bodnar [1975] expands Cushing's discussion of the applicability of reliability engineering techniques to internal control systems analysis. Hamlen [1980] proposes a model for the design of an internal control system that minimizes system costs subject to management-set error reduction probability goals for certain types of errors. Stratton [1981] demonstrates how reliability models can be used by management, or independent auditors, to analyze accounting internal control systems.

Other research investigates error characteristics [Bowen, 1993; Johnson et al., 1981] , audit concerns in distributed processing systems [Hansen, 1983] , models of the internal control process [Fields et al., 1986; Nichols, 1987] , information systems research as it applied to accounting and auditing [Amer et al., 1987] , data quality as it relates to audit populations [Groomer & Murthy, 1989] , and how auditors need to examine only those parts of a system where structural changes made at the system level induce structural changes at the subsystem level [Wand & Weber, 1989] .

Analysis of Research

A significant body of research exists that addresses issues associated with costs and data quality, mostly data errors. At a theoretical level, this body of research provides a basis for studying the relationship between prevention costs and failure costs. At a more pragmatic level, however, there is still a critical need for the development of methodologies and tools that help practitioners to determine operation and assurance costs when implementing a data quality system. Such methodologies and tools should allow practitioners to determine prevention, appraisal, and failure costs along data quality dimensions such as accuracy and timeliness.

There does not appear to be any research that deals with the economics of external failure, although anecdotes abound; see, for example, Liepins & Uppuluri [1990] . Furthermore, there does not seem to be any work that attempts to estimate costs of data quality assurance. Applying quality control techniques to data has been largely overlooked because the economic consequences of data errors are less apparent than manufacturing non-conformities. But data errors are costly --- the cost is simply harder to quantify [Liepins, 1989] . In some cases, it can be catastrophically high.

In addition, persuasive case studies of the impact of data quality upon the profit-and-loss statement are needed, similar to those that have been conducted in the physical product manufacturing area. For example, when examining product or service quality, Crosby [1979] estimates that manufacturing companies spend over 25% of sales doing things wrong and service companies spend more than 40% of their operating costs on wasteful practices. As another example, Garvin [1983] demonstrates that among businesses with less than 12% of the market, those with inferior product quality averaged a return on investment (ROI) of 4.5%, those with average product quality an ROI of 10.4%, and those with superior product quality, an ROI of 17.4%. Those businesses that improved in quality during the 1970s increased their market share five to six times faster than those that declined -- and three times faster than those whose quality remained unchanged. Research in the data quality area analogous to the examples illustrated above will contribute not only to the development of tools and methodologies in estimating operation and assurance costs, but also help convince top management to implement a corporate data quality policy.

4.3 Research and Development

A significant amount of work can be classified under research and development, although the original work might not have been identified as such. From the TDQM perspective, there are three main issues involved in research and development: (1) analysis and design of the data quality aspects of data products, (2) design of data manufacturing systems that incorporate data quality aspects, and (3) definition of data quality dimensions and measurement of their values. The following three subsections examine the literature, as summarized in Table 3, that addresses these issues.

Table 3: Data Quality Literature Related to Research and Development
Section
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Research
4.2.1
4.2.2
4.2.3
4.3.1
4.3.2
4.3.3
[Agmon & Ahituv, 1987]
Ã
[Ahituv, 1980]
Ã
[Bailey & Pearson, 1983]
Ã
Ã
[Ballou & Pazer, 1982]
Ã
[Ballou & Pazer, 1985]
Ã
Ã
Ã
[Ballou & Pazer, 1987]
Ã
Ã
Ã
[Ballou & Tayi, 1989]
Ã
Ã
Ã
[Ballou et al., 1993]
Ã
Ã
[Baroudi & Orlikowski, 1988]
Ã
[Blaylock & Rees, 1984]
Ã
[Bowen, 1993]
Ã
Ã
[Brodie, 1980]
Ã
Ã
[Chen, 1993]
Ã
[Delone & McLean, 1992]
Ã
[Feltham, 1968]
Ã
Ã
[Halloran et al., 1978]
Ã
Ã
[Hamilton & Chervany, 1981]
Ã
[Huh et al., 1990]
Ã
Ã
Ã
[Ives et al., 1983]
Ã
[Ives & Olson, 1984]
Ã
[Jang et al., 1992]
Ã
Ã
[Janson, 1988]
Ã
Ã
[Jones & McLeod, 1986]
Ã
[Kim, 1989]
Ã
[King & Epstein, 1983]
Ã
[Kriebel, 1979]
Ã
[Larcker & Lessig, 1980]
Ã
[Iivari & Koskela, 1987]
Ã
[Madnick & Wang, 1992]
Ã
[Melone, 1990]
Ã
[Miller & Doyle, 1987]
Ã
[Morey, 1982]
Ã
Ã
[Paradice & Fuerst, 1991]
Ã
Ã
[Pautke & Redman, 1990]
Ã
Ã
Ã
[Redman, 1992]
Ã
Ã
Ã
[Svanks, 1984]
Ã
Ã
[Wang & Madnick, 1990]
Ã
[Wang et al., 1993]
Ã
Ã
[Wang et al., 1992]
Ã
[Zmud, 1978]
Ã

4.3.1. Analysis and design of the quality aspects of data products

Brodie [1980] places the role of data quality within the life cycle framework with an emphasis on database constraints. Data quality is defined as a measure of the extent to which a database accurately represents the essential properties of the intended application and has three distinct properties: (1) data reliability, (2) logical or semantic integrity, and (3) physical integrity (the correctness of implementation details). A semantic integrity subsystem to improve data quality is proposed that consists of five parts: (1) a constraint language to express the constraints, (2) a constraint verifier, (3) a constraint database management system, (4) a constraint validation system, and (5) a violation-action processor.

Svanks [1984] reports on the actual development of an integrity analysis system that has been tested on a case study. His approach consists of seven steps: (1) defining database constraints, (2) selecting statistical techniques for sampling the database, (3) selecting integrity analysis to be performed, (4) defining suitable quality measures, (5) specifying outputs to be produced from a defect file, (6) developing and testing program code, and (7) executing the integrity analysis system.

In the conceptual modeling area, most data models, including the Entity-Relationship (ER) model [Chen, 1976] , are aimed at capturing the content of data (such as which entities or attributes to be included for the intended application domain) and do not deal explicitly with the data quality aspect. Chen, who first proposed the ER model recommends that a methodology be developed to incorporate quality aspects into the ER model in order to enable database designers to deal systematically with data quality issues at an early design stage [Chen, 1993] . To extend the ER model, a methodology for data quality requirements collection and documentation is proposed to include data quality specifications as an integral component of the database design process [Wang et al., 1993] . The methodology includes a step-by-step procedure for defining and documenting the data quality parameters (e.g., timeliness or credibility) that are important to users. The subjective data quality parameters are then translated into more objective data quality indicators (e.g., data source, creation time, and collection method) that should be tagged to the data items.

Analysis of Research

In the past, researchers have focused mainly on the accuracy requirements for data products which are represented by semantic integrity constraints. However, since data quality is a multi-faceted concept that includes not only accuracy, but also other dimensions such as timeliness and completeness, much more research is needed on the other dimensions as well. In short, research on this topic is still in its formative stage.

Much work is needed to develop a formal design methodology that can be used by database designers to gather systematically and translate "soft" customer data quality requirements into "hard" design specifications. Research issues such as the following need to be addressed: (1) What differentiates a data quality attribute from a regular entity attribute? (2) How does one relate quality attributes to entities, relationships, and their attributes? (3) How does one determine which quality attributes are appropriate for a given application domain? (4) Under what circumstances should an attribute be categorized as a quality attribute as opposed to an entity attribute? (5) What are the criteria for trading off the quality aspects versus other design factors such as cost when determining which quality attributes to incorporate into a database system?

4.3.2. Design of data manufacturing systems that incorporate data quality

Similar to the quality-by-design concept that is advocated by leaders in the TQM area [Juran, 1992; Taguchi, 1979] , the quality aspects of data products should be designed into data manufacturing systems in order to attain quality-data-product-by-design. The literature that addresses this topic can be classified into two categories: (1) analytical models that study how data manufacturing systems can be developed so that data quality requirements can be met (e.g., acceptable error rate) subject to certain constraints (e.g., minimal cost), and (2) system technologies that can be designed into data manufacturing systems to ensure that data products will meet the specified quality.

4.3.2.1. Analytical Models

Most of the research that addresses issues related to operation and assurance costs falls into this category. This is because the research approaches are analytic, and their primary concern is how to enhance a data manufacturing system's quality in such a way that a high probability of preventing, detecting, and eliminating data quality problems in the system can be maintained.

In addition, Ballou and Pazer [1985; 1987] describe a model that produces expressions for the magnitude of errors for selected, terminal outputs. The model is intended for analysts to compare alternative quality control strategies and is useful in assessing the impact of errors in existing systems. The researchers also develop an operations research model for analyzing the effect and efficacy of using data quality control blocks in managing production systems. Ballou et al. [1993] further propose a data manufacturing model in order to determine data product quality. A set of ideas, concepts, models, and procedures appropriate to data manufacturing systems that can be used to determine the quality impacts of data products delivered, or transferred, to data customers is presented. To measure the timeliness, quality, and cost of data products, the model systematically tracks relevant parameters. This is facilitated through a data manufacturing analysis matrix that relates data units to various system components.

4.3.2.2. Systems Technologies

A data tracking technique that employs a combination of statistical control and manual identification of errors and their sources has been developed [Huh et al., 1990; Pautke & Redman, 1990; Redman, 1992] . Underlying the data tracking technique is the idea that processes which create data are often highly redundant. Data tracking uses the redundancy to determine pairs of steps in the overall process that yield inconsistent data. Changes that arise during data tracking are classified as normalization, translation, or spurious-operational. Spurious-operational changes occur when fields are changed during one sub-process; they indicate an error somewhere in the process. This allows the cause of errors to be systematically located.

An attribute-based model that can be used to incorporate quality aspects of data products has also been developed [Jang et al., 1992; Madnick & Wang, 1992; Wang et al., 1993; Wang et al., 1992] . Underlying this model is the idea that objective data quality indicators (such as source, time, and collection method) can be designed into a system in such a way that data products will be delivered along with these indicators. As a result, data consumers can judge the quality of the data product according to their own chosen criteria. These data quality indicators also help to trace the supplier of raw data to ensure that the supplier meets the quality requirements. Also introduced in the attribute-based model is the notion of a quality database management system; that is, one that supports data quality related capabilities (for example, an SQL that includes a facility for data quality indicators). Much like the ER model [Chen, 1976] which is widely adopted by the industry as a tool for database design, the attribute-based model can be used as a guideline for database designers to incorporate data quality aspects into their systems.

Finally, research by Brodie [1980] and Svanks [1984] on database constraints could be further extended to help design and produce data products that will meet the specified quality.

Analysis of Research

A significant number of analytic models have been developed for the design of data manufacturing systems, most of which focus primarily on data accuracy. Therefore, future research in this area should be directed toward other data quality dimensions. Moreover, many of these mathematical models make assumptions that require further work in order to be applicable in practice. The assumptions underlying these models typically include inputs for the mathematical models, the topology of the system, the cost and data quality information, and the utility function of the data consumers. In practice, obtaining this information can be very challenging. There is also a gap between understanding these models and applying them to the design of industrial-strength data manufacturing systems.

To exploit fully the potential of these mathematical models, computer-aided tools and methodologies based on extensions of these mathematical models need to be developed. They would allow designers to explore more systematically the design alternatives for data manufacturing systems, much like the Computer-Aided Software Engineering tools that have been developed, based on variants of the ER model, for database designers to explore design alternatives.

The system technologies research represents one of the promising areas that can have short-term as well as long-term benefits to organizations. Some synergy might be generated among the data tracking technique, the attribute-based approach, and the database constraints work. Future data manufacturing systems should incorporate data quality indicators that are critical to the application domain with all the database constraints specified to prevent data from integrity problems, and with a data tracking mechanism built into the system to detect possible areas of data quality problems.

4.3.3. Dimensions of data quality and measurement of their values

The three primary groups of researchers who have attempted to identify appropriate dimensions of data quality are those in the areas of: (1) data quality, (2) information systems success and user satisfaction, and (3) accounting and auditing.

In the data quality area, a method is proposed by Morey [1982] that estimates the "true" stored error rate. Ballou et al. [1982; 1985; 1987; 1989; 1993] define: (1) accuracy which occurs when the recorded value is in conformity with the actual value, (2) timeliness which occurs when the recorded value is not out of date, (3) completeness which occurs when all values for a certain variable are recorded, and (4) consistency which occurs when the representation of the data value is the same in all cases. Other dimensions that have been identified include: data validation, availability, traceability, and credibility [Janson, 1988; Loebl, 1990] . Redman et al. [1990; 1990; 1992] identify more than twenty dimensions of data quality, including accuracy, completeness, consistency, and cycle time. Finally, Paradice & Fuerst [1991] develop a quantitative measure of data quality by formulating the error rate of MIS records which are classified as being either "correct" or "erroneous".

A cumulated body of research on evaluating information systems success from the user's view has appeared in the information systems field. Halloran et al. [1978] propose various factors such as usability, reliability, independence, and so forth. Zmud [1978] conducts a survey to establish important user requirements of data quality. The results of Zmud's work reveal some of the users' intuition about the dimensions of data quality. In evaluating the quality of information systems, Kriebel [1979] identifies attributes such as accuracy, timeliness, precision, reliability, completeness, and relevancy. In assessing the value of an information system, Ahituv [1980] proposes a multi-attribute utility function, and suggests relevant attributes such as timeliness, accuracy, and reliability.

User satisfaction studies have identified as important dimensions: accuracy, timeliness, precision, reliability and completeness [Bailey & Pearson, 1983] . Other work on user satisfaction and user involvement that identified data quality attributes can be found in Ives, Olson, & Baroudi [1983] , Ives & Olson [1984] , Baroudi & Orlikowski [1988] , Kim [1989] , and Melone [1990] . Work also has been carried out on information systems value [Delone & McLean, 1992; King & Epstein, 1983] .

Agmon & Ahituv [1987] apply reliability concepts from the field of quality control to information systems. Three measures for data reliability are developed: (1) internal reliability (the "commonly accepted" characteristics of data items), (2) relative reliability (the compliance of data to user requirements), and 3) absolute reliability (the level of resemblance of data items to reality). Jang, Kon, & Wang [1992] propose a data quality reasoner that provides both an automated form of judging data quality and an objective measure of overall data quality.

Other, perhaps less directly related research includes: development of an instrument for perceived usefulness of information [Larcker & Lessig, 1980] , analysis of approaches for evaluating system effectiveness [Hamilton & Chervany, 1981] , examination of the "usefulness" of information in relationship to cognitive style [Blaylock & Rees, 1984] , evaluation of the structure of executive information system and their relationship to decision making [Jones & McLeod, 1986], measurement of the quality of information systems [Iivari & Koskela, 1987] , and measurement of information systems effectiveness as applied to financial services [Miller & Doyle, 1987] .

In accounting and auditing where internal control systems require maximum reliability with minimum cost, the key data quality dimension used is accuracy, which is defined in terms of the frequency, size, and distribution of errors in data. In assessing the value of information, Feltham [1968] further identifies relevance and timeliness as desirable attributes of information.

Analysis of Research

A number of data quality dimensions have been identified, although there is a lack of consensus both on what constitutes a set of "good" data quality dimensions and what an appropriate definition is for each. In fact, even a relatively obvious dimension, such as accuracy, does not have a well established definition. Most of the research efforts have assumed that a record is accurate if it conforms with the actual value. This, however, would lead one to ask several questions: First, what is the "actual value"? Do values in all fields of a record need to conform with the "actual value" in order to be considered as "accurate"? Should a record with one inaccurate field value be defined as more accurate than a record that has two inaccurate field values? Finally, would a file with all of its records 99% accurate (1% off from the actual value) be more accurate than a file with 99% of its records 100% accurate but 1% of its records off by an order of magnitude?

Two avenues could be pursued in the establishment of data quality dimensions: (1) Develop a scientifically grounded theory to rigorously define dimensions of data quality, and separate the dimensions into those intrinsic to an information system from those external to the system. An ontological-based approach [Bunge, 1977; Bunge, 1979; Wand, 1989; Wand & Weber, 1990], for example, would identify the data deficiencies that exists when mapping the real world to an information system, and therefore, could potentially offer a rigorous basis for the definition of dimensions of data quality. (2) Establish a pragmatic approach to operationalize the definition of data quality through the formation of a data quality standard technical committee, consisting of key participants from the government, industry, and research institutions. The responsibility of this committee will be the recommendation of a set of operational definitions for data quality dimensions.

4.4 Production

The previous section dealt with issues related to the research and development of data manufacturing systems that will enable the data producer to manufacture data products with the specified quality demanded by the data consumer. This section focuses on how to ensure that a data product is manufactured according to its given data quality specifications. In producing data products, three main issues are involved: (1) quality requirements in the procurement of raw data, components, and assemblies needed for the production of data products, (2) quality verification of raw data, work-in-progress, and final data products, and (3) nonconformity and corrective action for data products that do not conform to their specifications.

Table 4 summarizes the data quality research related to production. Most of the work on the research and development of data products can be employed at production time to help understand how to address the inter-related issues of procurement, verification, nonconformity, and corrective action. In addition, many research efforts address these issues directly at production time, as discussed below.

Table 4: Data Quality Literature Related to Production
Section
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Research
4.2.1
4.2.2
4.2.3
4.3.1
4.3.2
4.3.3
[Ballou & Tayi, 1989]
Ã
Ã
Ã
[Brodie, 1980]
Ã
Ã
[Fellegi & Holt, 1976]
Ã
[Garfinkel et al., 1986]
Ã
[Huh et al., 1990]
Ã
Ã
Ã
[Jang et al., 1992]
Ã
Ã
Ã
[Janson, 1988]
Ã
Ã
[Jaro, 1985]
Ã
[Liepens et al., 1982]
Ã
[Little & Smith, 1987]
Ã
[McKeown, 1984]
Ã
[Morey, 1982]
Ã
Ã
[Oman & Ayers, 1988]
Ã
Ã
Ã
[Paradice & Fuerst, 1991]
Ã
Ã
[Pautke & Redman, 1990]
Ã
Ã
Ã
[Redman, 1992]
Ã
Ã
Ã
[Svanks, 1984]
Ã
Ã
[Strong, 1988]
Ã
[Strong, 1993]
Ã
[Strong & Miller, 1994]
Ã
[Wang et al., 1993]
Ã
Ã

Morey [1982] focuses on applications that are updated periodically or whenever changes to the record are reported, and examined: (1) the portion of incoming transactions that fail, (2) the portion of incoming transactions that are truly in error, and (3) the probability that the stored MIS record is in error for any reason. It is implicitly assumed that a piece of data (e.g., birth date, mother's maiden name, and rank) is accurate if it reflects the truth in the real world. A key result in this research is a mathematical relationship for the stored MIS record nonconformity rate as a function of the quality of the incoming data, the various processing times, the likelihood of Type I and II errors, and the probability distribution for the inter-transaction time. It is shown how the mathematical result can be used to forecast the level of improvement in the accuracy of the MIS record if corrective actions are taken.

Focusing on verification processes and building upon Morey's work, Paradice & Fuerst [1991] develop a verification mechanism based on statistical classification theory [Johnson & Wichern, 1988] whereby MIS records are classified as being either "correct" or "erroneous." Acceptance as a "correct" record by the verification mechanism occurs when the mechanism determines that the record is more similar to correct records, whereas rejection as an "erroneous" record occurs when the mechanism determines that the record is more similar to erroneous records. They use this as a benchmark for comparing actual error rates with the theoretically smallest attainable error rates and suggest a method for assessing an organization's data quality. They also offer guidelines for deriving values for parameters of their data quality model.

Janson [1988] demonstrates that exploratory statistical techniques [Tukey, 1977] can significantly improve data quality during all phases of data validation. It is argued that data validation, if it is to be done successfully, requires knowledge of the underlying data structures. This is particularly crucial when the data are collected without any prior involvement by the analyst. Exploratory statistical techniques are well suited to the data validation effort because they can aid in (1) identifying cases with data items that are suspect and likely to be erroneous, and (2) exploring the existence of functional relationships or patterns that can provide a basis for data cleaning.

Strong [1988] examines how expert systems can be used to enhance data quality by inspecting and correcting non-conforming data items. A model is developed to evaluate the quality of the data products produced from a data production process [Strong, 1993]. It is also used to evaluate the process itself and thus provides information to the research and development role for improving the data production process [Strong & Miller, 1994]. The process analysis indicated that it can be difficult to determine the quality of raw data inputs, data processing, and data products because what may be viewed as data and process irregularities by management may be viewed as necessary data production process flexibility by producers and consumers of data.

With some similarities to the above work, there are a number of studies that focus on the data editing function of survey data and the changing of data when errors are detected [Fellegi & Holt, 1976; Garfinkel et al., 1986; Jaro, 1985; Liepens et al., 1982; Little & Smith, 1987; McKeown, 1984]. These studies address data verification and correction through anomalies resolution and missing-values imputation in questionnaire data prior to processing. McKeown [1984], for example, establishes probabilities that selected data fields were correct and contends that "data editing and imputation were separate and distinct." Garfinkel et al. [1986] use experts to establish feasibility constraints, and develop algorithms that are "particularly relevant to problems in which data of every record is indispensable."

Oman and Ayers [1988] provide a feedback loop for the correction of non-conforming data items. The method used tabulates the volume of data reported, counts the number of errors, and divides the number of correct data by the total volume of data to produce a statistic on "percent correct" which is the "bottom line" statistic to be used in scoring data quality and in providing feedback to reporting organizations. The analysis of the measurement shows marked improvement in the first half year of the effort and slow but steady progress overall.

The data tracking technique [Huh et al., 1990; Pautke & Redman, 1990; Redman, 1992] is another verification mechanism of the quality of data in the data-product manufacturing process. A combination of statistical control and manual identification of errors and their sources is used which allows the cause of errors to be systematically located, and therefore, offers a framework for production verification. The data tracking technique, in its present form, focuses primarily on certain data items and their consistency at different stages of their life cycle, and involves a significant amount of man-machine interaction.

The attribute-based approach [Jang et al., 1992; Wang et al., 1993; Wang et al., 1992; Wang & Madnick, 1990] can also be applied to verify the quality of data. By examining data quality indicators (such as source, time, and procurement method) or data quality parameters (such as timeliness, credibility, and completeness), the producer can verify, at different stages of the data product manufacturing process, whether the data conform with the specified requirements. When non-conforming data are identified, their data quality indicators can be used to trace back to the source of the problem so that appropriate corrective actions may be taken.

Finally, as mentioned earlier, Svanks [1984] reports on the actual development of an integrity analysis system that consists of seven steps. These steps, in some sense, cover the overall quality aspects in data production. Work on auditing can also be applied to the verification process.

Analysis of Research

A body of research has been developed that can be related to the production of data products. Still, much more research is needed because there is a significant gap between the current state-of-the-art of data product production and the level of maturity required to establish a set of guidelines and standards in the data quality area, similar to those established in the ISO 9000. Possible future research directions include: (1) developing a "standard" data quality metric or acceptance test that could be used at the design stage or during the production process, (2) establishing criteria for the selection of qualified data suppliers, (3) developing a mechanism or tool to manage data quality shelf-life (out-of-date data) and deterioration control (data corruption), (4) studying the process, data flow and human or automated procedures, (5) developing a method for positive identification and control of all non-conforming material (poor quality data), and (6) investigating the link between poor data quality problems and revision of the procedures used to detect and eliminate problems.

4.5 Distribution

No research has been identified that directly addresses issues related to the distribution of data products. This is somewhat surprising when one considers the number of information providers in the marketplace selling data products ranging from real-time data to historical data. Researchers could examine the literature in the area of physical product distribution to identify if and how that body of knowledge can be adapted to data product distribution. Establishing an analogy, for example, could be a research goal. However, data products are not as tangible as physical products, and additional copies of data products can be produced at almost negligible cost when compared to physical products. Thus, it may not be straightforward to adapt the knowledge in physical product distribution to data product distribution. A thorough examination of the need, role, and implementation strategies of data quality documentation is also required. In addition, researchers need to establish a customer feedback system for data quality and to define accurately customer requirements. The ultimate research goal in this area is to ensure that a data product delivered to data consumers will meet the quality requirements specified for the data product.

We note in passing that an important concept in the object-oriented field is that of encapsulation where data and procedures are "packaged" together. It might be interesting to apply the encapsulation concept to the packaging of data products. In addition, self-describing data files and metadata management have been proposed at the schema level to annotate the data contained in the file [McCarthy, 1982; McCarthy, 1984; McCarthy, 1988]. A similar annotation could be used to package data products. This would indicate, for example, what the data product is (identification), how it should be installed, etc. A direct analogy to physical products can easily be seen. For physical products, labels are required to describe product features such as expiration date (for food and drug) or safety information (for electronic components). Finally, research could be pursued to extend the current database management system capabilities to handle identification, packaging, installation, delivery, and after-sales servicing of data products as well as documentation and records for data products.

4.6 Personnel Management

There have only been a few attempts by researchers to either address or analyze personnel issues within the context of data quality. These are summarized below in Table 6.

Table 6: Data Quality Literature Related to Personnel Management
Section
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Research
4.2.1
4.2.2
4.2.3
4.3.1
4.3.2
4.3.3
[Maxwell, 1989]
Ã
Ã
[Oman & Ayers, 1988]
Ã
Ã
Ã
[Spirig, 1987]
Ã
[Te'eni, 1993]
Ã

Te'eni [1993] proposes a general framework for understanding data production that incorporates the person-environment fit and the effect of an employee's ability and motivation. The research concentrates on the people who produce the data, examining the processes they employ and how the processes lead to the production of high quality data. The belief is that some of the problems in producing effective data are more likely when one worker creates data and another uses it; that is, when data production is separated from data use. Data production problems are postulated to arise when there is a poor fit between: (1) the data producers needs and resources, and (2) the organization's demands and resources.

Maxwell [1989] recognizes the need to improve the quality of data in human information systems databases as an important personnel issue. Many well-known examples of poor quality data in personnel databases are often cited; for example, a person's name appears as "F. Ross" in one entry of a database and as "Forea K. Ross" in another. "Base salary" might include overtime premiums for one set of employees, but not for another. Maxwell proposes that, to improve human resource data quality, three issues need to be examined: (1) data ownership and origination, (2) accuracy-level requirements of specific data, and (3) procedural problems correctable through training and improved communications. A similar observation has been made on data ownership by Spirig [1987]. He addresses some of the issues involved in interfacing payroll and personnel systems, and suggests that, when data ownership becomes separated from the data originator, no system can retain data quality for very long.

Finally, Oman & Ayers [1988] report on a case study of how a company's employees identified the need to improve data quality and brought it to management's attention. This organizational awareness resulted in action by both the employees involved and top management to raise the level of data quality in a large MIS database that obtained its raw data from over twenty subordinate reporting organizations. Data quality was improved by users searching for other sources of information and by using competing systems.

Analysis of Research

There is some recognition of the awareness and motivational issues involved in obtaining high quality data. We do not see an immediate need for research in this area as it is likely that existing TQM techniques for personnel could be applied. Furthermore, it is more important to address management responsibilities first. After top management develops a data quality policy, personnel management issues related to data quality can be addressed. In the long term, research on personnel could include the following: (1) The development of an incentive plan to motivate employees to strive for high data quality. Case studies could be used to gain insights into what kinds of incentive structures are appropriate. (2) The development of a set of measures that could be used for the organization to monitor the level of quality of the data obtained and generated by employees with a feedback mechanism to give to the employees. For example, if one department is evaluated by the number of transactions that occurred and another by daily balances, these discrepancies need to be understood. (3) The analysis of existing, successful compensation and reward structures in various types of companies could be carried out to understand how they work so that successful approaches could be adapted to data quality.

4.7 Legal Function

Within the context of the TDQM framework, the legal issues surrounding data product include enhancing data product safety and minimizing product liability. The research efforts in this area are summarized in Table 7.

Table 7: Data Quality Literature Related to Legal Function
Section
4.1
4.2
4.3
4.4
4.5
4.6
4.7
Research
4.2.1
4.2.2
4.2.3
4.3.1
4.3.2
4.3.3
[Laudon, 1986]
Ã
[Maxwell, 1989]
Ã
Ã
[Wright, 1992]
Ã

Laudon [1986] examines the quality of data in the criminal-record system of the United States (a large inter-organizational system). Samples of records that deal with criminal history and warrant records are compared to hard-copy original documents of known quality. The results show that a large number of people could be at risk of being falsely detained, thus, affecting people's constitutional rights. Maxwell [1989] identifies legal requirements as an important reason for assuring the accuracy of employee-originated data. For example, Section 89 of the Internal Revenue Code requires that, unless 80% of a company's non-highly compensated employees are covered by a plan, then the plan must pass a series of tests on eligibility. This, in turn, places a priority on the need for accuracy of employee data. From the legal viewpoint, the data products produced by human resources information systems may be accurate, otherwise the company would be breaking the law. Wright [1992], working in the Electronic Data Interchange (EDI) area, examines the need to legally prove the origin and content of an electronic message.

Analysis of Research

No work is known on safety or liability limitations of data products. It is evident that data products are increasingly available in the (electronic) market. Some of them are produced by a single information provider (e.g., mailing lists, value-added information services, etc.) and others are produced through inter-organizational information systems that involve more than one legal entity. The legal ramifications of data products will become increasingly important given the trends toward information highways. The Commission of the European Community published a proposal for a Council Directive concerning the protection of individuals in relation to the processing of personal data [Chalton, 1991]. The Directive contains chapters on rights of data subjects, data quality, and provisions relating to liability and sanctions. From the data product liability viewpoint, research needs to be pursued that would investigate the ramifications of failed data products produced by an inter-organizational system on the parties involved. This includes everyone from the suppliers of the raw data to the parties who will either be using the data product directly or affected by it. In addition, there is a requirement for methods on how to avoid liability issues.

5. Concluding Remarks

The traditional database literature refers to data quality management as ensuring: (1) syntactic correctness (e.g., constraints enforcement that prevents "garbage data" from being entered into the database) and (2) semantic correctness (data in the database truthfully reflect the real world situation). This approach of data quality management leads to techniques such as integrity constraints, schema integration, and concurrency control. Although critical to data quality management, this traditional approach fails to address issues that are important from the user's perspective. Consequently, many organizational databases are either plagued with erroneous data or data that do not meet the user's task at hand. This approach also fails to categorize much of the literature that we have reviewed.

To overcome these problems, we took a practitioner's perspective, and developed a Total Data Quality Management (TDQM) framework for identifying and studying organizational data quality problems. This framework consists of seven elements: management responsibilities, operation and assurance costs, research and development, production, distribution, personnel management, and legal function.

The core findings of our analysis of the existing data quality literature, based on this framework, can be distilled as follows. First, within the context of what constitutes a data quality policy or how to establish a data quality system, there is a clear need to develop techniques that help management to deliver quality data products. Second, the economics of external data quality failure and the complementary costs of quality assurance need to be evaluated. Finally, there is a need to study the link between poor data quality problems and revision of the procedures used to detect and eliminate problems. A fundamental technical need is a data quality metric and a way to express the quality aspects of a data product design rigorously.

This framework has proven to be effective in recognizing and organizing the literature in data quality management. Such a framework provides a vocabulary for discussing the various aspects of data quality that organizations increasingly experience. It also pointed to directions where research should be conducted if the ultimate goal of organizational information systems is to serve the needs of their users.

6. References

[1] Agmon, N. & Ahituv, N. (1987). Assessing Data Reliability in an Information Systems. Journal of Management Information Systems, 4(2), 34-44.

[2] Ahituv, N. (1980). A Systematic Approach Toward Assessing the Value of an Information System. MIS Quarterly, 4(4), 61-75.

[3] Amer, T., et al. (1987). A review of the computer information systems research related to accounting and auditing. The Journal of Information Systems, 2(1), 3-28.

[4] Arnold, S. E. (1992). Information manufacturing: the road to database quality. Database, 15(5), 32.

[5] AT&T (1988). Process Management & Improvement Guidelines, Issue 1.1. (No. Select Code 500-049). AT&T.

[6] Bailey, J. E. & Pearson, S. W. (1983). Development of a Tool for Measuring and Analyzing Computer User Satisfaction. Management Science, 29(5), 530-545.

[7] Bailey, R. (1983). Human Error in Computer Systems. Englewood Cliffs: Prentice-Hall, Inc.

[8] Ballou, D. P. & Pazer, H. L. (1982). The Impact of Inspector Fallibility on the Inspection Policy in Serial Production System. Management Science, 28(4), 387-399.

[9] Ballou, D. P. & Pazer, H. L. (1985). Modeling Data and Process Quality in Multi-input, Multi-output Information Systems. Management Science, 31(2), 150-162.

[10] Ballou, D. P. & Pazer, H. L. (1987). Cost/Quality Tradeoffs for Control Procedures in Information Systems. OMEGA: International Journal of Management Science, 15(6), 509-521.

[11] Ballou, D. P. & Tayi, K. G. (1989). Methodology for Allocating Resources for Data Quality Enhancement. Communications of the ACM, 32(3), 320-329.

[12] Ballou, D. P., et al. (1993). Modeling Data Manufacturing Systems to Determine Data Product Quality. Submitted for publication.

[13] Baroudi, J. J. & Orlikowski, W. J. (1988). A Short-Form Measure of User Information Satisfaction: A Psychometric Evaluation and Notes on Use. Journal of Management Information Systems, 4(4).

[14] Batini, C., et al. (1986). A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4), 323-364.

[15] Bernstein, P. A. & Goodman, N. (1981). Concurrency Control in Distributed Database Systems. Computing Surveys, 13(2), 185-221.

[16] Blaylock, B. & Rees, L. (1984). Cognitive Style and the Usefulness of Information. Decision Sciences, 15(1), 74-91.

[17] Bodnar, G. (1975). Reliability Modeling of Internal Control Systems. The Accounting Review, 50(4), 747-757.

[18] Bowen, P. (1993). Managing Data Quality in Accounting Information Systems: A Stochastic Clearing System Approach. Auburn University.

[19] Brodie, M. L. (1980). Data Quality in Information Systems. Information and Management(3), 245-258.

[20] Bulkeley, W. (1992, May 26). Databases are plagued by reign of error. Wall Street Journal, p. B6.

[21] Bunge, M. (1977). Ontology I: The Furniture of the World. Boston: D. Reidel Publishing Company.

[22] Bunge, M. (1979). Ontology II: A World of Systems. Boston: D. Reidel Publishing Company.

[23] Burns, D. & Loebbecke, J. (1975). Internal Control Evaluation: How the Computer Can Help. Journal of Accountancy, 140(2), 60-70.

[24] Chalton, S. (1991). The Draft Directive on data protection: an overview and progress to date. International Computer Law Adviser, 6(1), 6-12.

[25] Chen, P. P. (1976). The Entity-Relationship Model - Toward a Unified View of Data. ACM Transactions on Database Systems, 1, 166-193.

[26] Chen, P. S. (1993). The Entity-Relationship Approach. In Information Technology in Action: Trends and Perspectives. (pp. 13-36). Englewood Cliffs: Prentice Hall.

[27] Cooper, R. B. (1983). Decision production - a step toward a theory of managerial information requirements. In the Proceedings of the Fourth International Conference on Information Systems, (pp. 215-268) Houston, TX.

[28] Cronin, P. (1993). Close the Data quality Gap through Total Data Quality Management (TDQM). MIT Management (June).

[29] Crosby, P. B. (1979). Quality is Free. New York: McGraw-Hill.

[30] Crosby, P. B. (1984). Quality Without Tears. New York: McGraw-Hill Book Company.

[31] Cushing, B. E. (1974). A Mathematical Approach to the Analysis and Design of Internal Control Systems. Accounting Review, 49(1), 24-41.

[32] Delen, G. P. A. & Rijsenbrij, B. B. (1992). The Specification, Engineering, and Measurement of Information Systems Quality. Journal of Systems Software, 17(3), 205-217.

[33] Delone, W. H. & McLean, E. R. (1992). Information Systems Success: The Quest for the Dependent Variable. Information Systems Research, 3(1), 60-95.

[34] Deming, E. W. (1986). Out of the Crisis. Cambridge: Center for Advanced Engineering Study, Massachusetts Institute of Technology.

[35] Emery, J. C. (1969). Organizational Planning and Control Systems: Theory and Technology. New York: Macmillan.

[36] Fellegi, I. P. & Holt, D. (1976). A systematic approach to automatic edit and imputation. Journal of the American Statistical Association, 71(353), 17-35.

[37] Feltham, G. (1968). The value of information. The Accounting Review, 43(4), 684-696.

[38] Fields, K. T., et al. (1986). Quantification of the auditor's evaluation of internal control in data base systems. The Journal of Information Systems, 1(1), 24-77.

[39] Firth, C. (1993). Management of the Information Product. Masters Thesis, Management of Technology Program, MIT Sloan School of Management.

[40] Garfinkel, R. S., et al. (1986). Optimal imputation of erroneous data: Categorical data, general edits. Operations Research, 34(5), 744-751.

[41] Gartner (1993). Data Pollution Can Choke Business Process Re-engineering. Gartner Group Inside Industry Service's, 1.

[42] Garvin, D. A. (1983). Quality on the line. Harvard Business Review, 61(5), 65-75.

[43] Groomer, S. M. & Murthy, U. S. (1989). Continuous auditing of database applications: An embedded audit module approach. The Journal of Information Systems, 3(2), 53-69.

[44] Halloran, D., et al. (1978). Systems Development Quality Control. MIS Quarterly, 2(4), 1-12.

[45] Hamilton, S. & Chervany, N. (1981). Evaluating Information System Effectiveness -- Part I: Comparing Evaluation Approaches. MIS Quarterly, 5(3), 55-69.

[46] Hamlen, S. S. (1980). A Chance Constrained Mixed Integer Programming Model for Internal Control Design. Accounting Review, 55(4), 578-593.

[47] Hansen, J. V. (1983). Audit Considerations in Distributed Processing Systems. Communications of the ACM, 26(5), 562-569.

[48] Hansen, M. & Wang, Y. R. (1990). Managing Data Quality: A Critical Issue for the Decade to Come. (No. CISL-91-05). Composite Information Systems Laboratory, MIT Sloan School of Management.

[49] Hansen, M. D. (1990). Zero Defect Data: Tackling the Corporate Data Quality Problem. Masters Thesis, MIT Sloan School of Management.

[50] Huh, Y. U., et al. (1990). Data Quality. Information and Software Technology, 32(8), 559-565.

[51] Iivari, J. & Koskela, E. (1987). The PIOCO Model for Information Systems Design. MIS Quarterly, 11(3), 401-419.

[52] Ives, B. & Olson, M. (1984). User involvement and MIS success: a review of research. Management Science, 30(5), 586-603.

[53] Ives, B., et al. (1983). The Measurement of User Information Satisfaction. Communications of the ACM, 26(10), 785-793.

[54] Jang, Y., et al. (1992). A Data Consumer-Based Approach to Data Quality Judgment. V. S. &. A. Whinston (Ed.), In the Second Annual Workshop on Information Technologies and Systems (WITS-92), (pp. 179-188) Dallas, Texas.

[55] Janson, M. (1988). Data Quality: The Achilles Heel of End-User Computing. Omega Journal of Management Science, 16(5), 491-502.

[56] Jaro, M. A. (1985). Current record linkage research. Proceedings of the American Statistical Association, 140-143.

[57] Johnson, J. R., et al. (1981). Characteristics of Errors in Accounts Receivable and Inventory Audits. Accounting Review, 56(2), 270-293.

[58] Johnson, R. A. & Wichern, D. W. (1988). Applied Multivariate Statistical Analysis. Englewood Cliffs: Prentice Hall.

[59] Jones, J. W. & McLeod, R., Jr. (1986). The Structure of Executive Information Systems: An Exploratory Analysis. Decision Sciences, 17, 220-249.

[60] Juran, J. M. (1992). Juran on Quality by Design: The New Steps for Planning Quality into Goods and Services. New York: Free Press.

[61] Kim, K. K. (1989). User satisfaction: a synthesis of three different perspectives. Journal of Information Systems, 4(1), 1-12.

[62] King, W. & Epstein, B. J. (1983). Assessing Information System Value: An Experiment Study. Decision Sciences, 14(1), 34-45.

[63] Kriebel, C. H. (1979). Evaluating the Quality of Information Systems. In Design and Implementation of Computer Based Information Systems. (pp. 29-43). Germantown: Sijthtoff & Noordhoff.

[64] Kumar, A. & Segev, A. (1993). Cost and Availability Tradeoffs in Replicated Data Concurrency Control. ACM Transactions on Database Systems, 18(1), 102-131.

[65] Larcker, D. F. & Lessig, V. P. (1980). Perceived Usefulness of Information: A Psychological Examination. Decision Sciences, 11(1), 121-134.

[66] Laudon, K. C. (1986). Data Quality and Due Process in Large Interorganizational Record Systems. Communications of the ACM, 29(1), 4-11.

[67] Liepens, G. E., et al. (1982). Error localization for erroneous data: A survey. TIMS/Studies in the Management Science, 19, 205-219.

[68] Liepins, G. E. (1989). Sound Data Are a Sound Investment. Quality Progress, 22(9), 61-64.

[69] Liepins, G. E. & Uppuluri, V. R. R. (Ed.). (1990). Data Quality Control: Theory and Pragmatics. New York: Marcel Dekker, Inc.

[70] Little, R. J. A. & Smith, P. J. (1987). Editing and imputation for quantitative survey data. Journal of the American Statistical Association, 82(397), 56-68.

[71] Loebl, A. S. (Ed.). (1990). Accuracy and Relevance and the Quality of Data. New York: Marcel Dekker, Inc.

[72] Madnick, S. & Wang, R. Y. (1992). Introduction to Total Data Quality Management (TDQM) Research Program. TDQM-92-01, The Total Data Quality Management (TDQM) Research Program, MIT Sloan School of Management.

[73] Maxwell, B. S. (1989). Beyond "Data Validity": Improving the Quality of HRIS Data. Personnel, 66(4), 48-58.

[74] McCarthy, J. L. (1982). Metadata Management for Large Statistical Databases. In the Proceedings of the 8th International Conference on Very Large Data bases (VLDB), (pp. 234-243) Mexico City.

[75] McCarthy, J. L. (1984). Scientific Information = Data + Meta-data. In the Proceedings of a Workshop on Database Management, California.

[76] McCarthy, J. L. (1988). The Automated Data Thesaurus: A New Tool for Scientific Information. In the Proceedings of the 11th International Codata Conference, Karlsruhe, Germany.

[77] McGee, A. M. & Wang, R. Y. (1993). Total Data Quality Management (TDQM): Zero Defect Data Capture. In The Chief Information Officer (CIO) Perspectives Conference, Tucson, Arizona: The CIO Publication.

[78] McKeown, P. G. (1984). Editing of continuous survey data. SIAM Journal of Scientific and Statistical Computing, 784-797.

[79] Melone, N. (1990). A Theoretical Assessment of the User-Satisfaction Construct in Information Systems Research. Management Science, 36(1), 598-613.

[80] Mendelson, H. & Saharia, A. (1986). Incomplete Information Costs and Database Design. ACM Transactions on Database Systems, 11(2), June.

[81] Miller, J. & Doyle, B. A. (1987). Measuring the Effectiveness of Computer-Based Information Systems in the Financial Services Sector. MIS Quarterly, 11(1), 107-124.

[82] Morey, R. C. (1982). Estimating and Improving the Quality of Information in the MIS. Communications of the ACM, 25(5), 337-342.

[83] Nichols, D. R. (1987). A Model of auditor's preliminary evaluations of internal control from audit data. The Accounting Review, 62(1), 183-190.

[84] Oman, R. C. & Ayers, T. B. (1988). Improving Data Quality. Journal of Systems Management, 39(5), 31-35.

[85] Paradice, D. B. & Fuerst, W. L. (1991). An MIS data quality methodology based on optimal error detection. Journal of Information Systems, 1(1), 48-66.

[86] Pautke, R. W. & Redman, T. C. (1990). Techniques to control and improve quality of data in large databases. In the Proceedings of Statistics Canada Symposium 90, (pp. 319-333) Canada.

[87] Redman, T. C. (1992). Data Quality: Management and Technology. New York: Bantam Books.

[88] Ronen, B. & Spiegler, I. (1991). Information as inventory: A new conceptual view. Information & Management, 21, 239-247.

[89] Sparhawk, J. C., Jr (1993). How does the Fed data garden grow? By deeply sowing the seeds of TQM. Government Computer News(January 18).

[90] Spirig, J. (1987). Compensation: The Up-Front Issues of Payroll and HRIS Interface. Personnel Journal, 66(10), 124-129.

[91] Stratton, W. O. (1981). Accounting systems: The reliability approach to internal control evaluation. Decision Sciences, 12(1), 51-67.

[92] Strong, D. M. (1988). Design and Evaluation of Information Handling Processes. (No. Carnegie Mellon University.

[93] Strong, D. M. (1993). Modeling Exception Handling and Quality Control in Information Processes. (No. WP 92-36). School of Management, Boston University.

[94] Strong, D. M. & Miller, S. M. (1994). Exceptions and Exception Handling in Computerized Information Processes. To appear in the ACM Transactions on Information Systems.

[95] Svanks, M. I. (1984). Integrity analysis: methods for automating data quality assurance. EDP Auditors Foundation, Inc., 30(10), 595-605.

[96] Taguchi, G. (1979). Introduction to Off-line Quality Control. Magaya, Japan: Central Japan Quality Control Association.

[97] Tansel, A. U., et al. (1993). Temporal Databases: Theory, Design, and Implementation. Redwood City: The Benjamin/Cummings Publishing Company, Inc.

[98] Te'eni, D. (1993). Behavioral Aspects of Data Production and Their Impact on Data Quality. Journal of Database Management, 4(2), 30-38.

[99] Teorey, T. J., et al. (1986). A logical design methodology for relational databases using the extended entity-relationship model. ACM Computing Surveys, 18(2), 197-222.

[100] Tukey, J. W. (1977). Exploratory Data analysis. Reading: Addison-Wesley.

[101] Wand, Y. (1989). A Proposal for a Formal Model of Objects. In Object-Oriented Concepts, Databases, and Applications. (pp. 602). New York: ACM Press.

[102] Wand, Y. & Weber, R. (1989). A Model of Control and Audit Procedure Changes in Evolving Data Processing Systems. The Accounting Review, 64(1), 87-107.

[103] Wand, Y. & Weber, R. (1990). An Ontological Model of an Information System. IEEE Transactions of Software Engineering, 16(11), 1282-1292.

[104] Wang, R. Y. & Kon, H. B. (1993). Towards Total Data Quality Management (TDQM). In Information Technology in Action: Trends and Perspectives. (pp. 179-197). Englewood Cliffs, NJ: Prentice Hall.

[105] Wang, R. Y., et al. (1993). Data Quality Requirements Analysis and Modeling. In the Proceedings of the 9th International Conference on Data Engineering, (pp. 670-677) Vienna: IEEE Computer Society Press.

[106] Wang, R. Y., et al. (1992). Toward Quality Data: An Attribute-based Approach. To appear in the Journal of Decision Support Systems (DSS).

[107] Wang, Y. R. & Madnick, S. E. (1990). A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. In the Proceedings of the 16th International Conference on Very Large Data bases (VLDB), (pp. 519-538) Brisbane, Australia.

[108] Wright, B. (1992). Authenticating EDI: the case for internal record keeping. EDI Forum, 82-4.

[109] Yu, S. & Neter, J. (1973). A Stochastic Model of the Internal Control System. Journal of Accounting Research, 1(3), 273-295.

[110] Zmud, R. (1978). Concepts, Theories and Techniques: An Empirical Investigation of the Dimensionality of the Concept of Information. Decision Sciences, 9(2), 187-195.