Data Quality

 

These are personal notes. You are welcome to read them.

Index page

Top-level home page  
Management hints
HTML, javascript
Other
     
More pages here ("technical" pages) Ham radio pages Contents of current page

 


Contents

 

 


Definition of Data Quality

Data has quality if "it satisfies the requirement of its intended use"[1]. The main aspect of data quality is data accuracy. The other aspects such as relevance, completeness, accessibility, ... are secondary to accuracy. If the primary aspect of accuracy is not addressed, then any discussion about the secondary aspects is useless.

Issues of data quality can be raised at first by users. This leads to investigations about the data.
Issues of data quality can also be raised as a result of data profiling.

Accuaracy issues:

Data profiling only finds inaccurate data that violates rules. It is not possible to find all inaccurate data, but it is possible to reduce the number of inaccuracies.

 

Quality model (ISO/IEC 9126-1)

Three types of quality: internal, external and in use

Internal and external quality

Quality In Use

 

Improving data quality is done with validation, enhancement and improvement of data using cleansing :

Working with addresses involves four processes:

 

Some more definitions

 

Information
Data with context and definition. There is a lot of data available. But people want information.
Knowledge could be seen as information from the human mind. The difficulty is how to express and structure the knowledge. [Michael Scofield]
Data Quality Reporting
data consistency measures (across several sources) and data quality measures (within a source, in comparison to business needs). Should be understandable and actionable for management.
Data Migration and Movement
Moving data from one system to another.
Data Transformation
Field level requirements and record level requirements. Technical requirements to conform to a receiving system. Business requirements in response to business needs.
Data Policies and Procedures
Define standards for
Master Data Management
Defines and manages the reference data that is commonly used across applications of the enterprise,
establishes how to distribute this data to applications (syndication)
ensures consistency for commonly used information.

 

Semantic compatibility has two levels: the "database" level and the "logical" level. At the database level, semantic compatibility refers to equivalent field formats, table structures, same case (upper, mixed, ...), unit of measure, etc. At the logical level, semantic compatibility refers to the equivalence of the meaning behind the entities: when given two tables labeled "employees", do they cover the same thing. What about temporary employees, contractors, part-time employees, past employees?

Though names can be generally parsed out into 5 fields and addresses into 8 fields, other definitions are needed, in particular about case (upper, mixed, ...) and about abbreviations of street types, states and countries.

 

Dimension of Data Quality

Various authors have differing ways of classifying data quality dimensions. Here is an attempt to at least define the terms:

Completeness
The proportion of data stored against the potential for 100%.
Consistency
The absence of difference, when comparing two or more representations of a thing against a definition.
Validity
Data is valid if it conforms to the syntax (format, type, range) of its definition.
Accuracy
The degree to which data correctly describes the ‘real world’ object or event being described.
Uniqueness
No entity instance (thing) will be recorded more than once based upon how that thing is identified.
Timeliness
The degree to which data represent reality from the required point in time.
Usability
Is the data understandable, simple, relevant, accessible

 

Small Steps for Data Quality

A lot of the data necessary for analytics is not part of normal data capture. Collecting good data for analytics involves modifying the business process, which can take a long time. However, a few short steps can improve data quality in the meantime:

Some elements taken from Kimball Design Tip #117 "Dealing with Data Quality: Don't Just SitThere, Do Something!" [4]

 

One of the roles of Data Governance is to allow people to trust the data.

 


Data Quality and Roles

This section draws from an article by Yang W. Lee and Diane M. Strong (see [2] in bibliography below).

Knowing-what is knowing the data
Knowing-how is knowing the processes for data collection and manipulation
Knowing-why: knowing about the "business", which aids IT to better understand the data being stored.

The three key processes within a data production process:
collection of raw data
storage and maintenance of data in computer systems,
and user retrieval and manipulation of data

The five dimensions of data quality in this research: accessibility (available, retrievable), relevancy (applicable, useful), timeliness (up-to-date), completeness (detailed, not missing), and accuracy (correct, free of error). [2]

The discovery that data custodians' knowing-why is not associated with producing high-quality data disconfirms the conventional notion that data custodians working as IS professionals must understand the contextual mode of knowledge to perform their task. Specifically, data custodian's knowing-why is not significantly associated with achieving high data quality. Data consumer's knowing-why is highly associated with selected dimensions of data quality (relevancy).

Data collectors ask why do people need these data; they collect relevant, accurate, and complete data.

Data custodians ask what data they should be storing; they are involved in storing complete data and in maintaining them to be timely and accurate.

Data consumers ask how to use these data. They are involved in identifying relevant data and they identify issues in accessibility of data.

  Accessibility Relevancy Timeliness Completeness Accuracy
Data Collector  
x
 
x
x
Data Custodian    
x
x
x
Data Consumer
x
x
     

Notice that custodians and consumers are linked to separate dimensions of data quality: custodians are associated with timeliness, completeness and accuracy dimensions whereas consumers are associated with accessibility and relevance. Therefore, data collectors — as those associated with dimensions from both "sides" — are the people that can best be the link between IT and data consumers.

  Knowing What Knowing How Knowing Why
Data Collector    
x
Data Custodian
x
   
Data Consumer  
x
 

At minimum, data collectors must know what and how to collect data, but mostly why to collect the data and why people need the data.
Data custodians must know how and why to store the data, but mostly they must know what to store and what people need. In fact, the impact of knowing why is small for IT staff.
Data consumers must know what and why to use the data, but mostly they must know how to use the data and how others use the data.

Data custodians' knowledge about other work processes does not contribute to producing high quality data, but data collector's does. The custodians mostly influence completeness, in particular by instituting controls for the data collection process to improve completeness of filling out of forms. Data quality is mostly a result of data collectors role. Hence the importance of not hiring college students for data entry. Consumers are important for determining the relevancy of data.

 


The Main Steps

Jack Olson [1] defines steps to detect inaccuracies and to address resulting issues.

The "inside-out" approach involves data profiling so as to detection inaccuracies.

These steps are generally accomplished in the order shown, in a "bottom-up" approach.

The core of the data analysis is data profiling (see section below).

In each of the steps, Jack Olson describes how to use existing meta-data and observation of the existing data to not only find inaccuracies but also improve the meta-data. In each of the steps listed above, the process involves gathering the meta-data, analyzing the data, which includes data profiling, verification, and validation.

1) Gather information about the data: this is the meta-data. Sources are:

The analyst cannot assume that the meta-data is accurate or complete. The data profiling process will produce accurate meta-data through reverse-engineering.

2) Discovery. The analyst extracts facts about the data.

Generally, the data is extracted from the source to as not to interfere with operational databases. The extraction is done in 3rd normal form. Issues such as field overloading (mutliple facts stored in one column), redefined and repeating arrays (OCCURS). The discovery process can be done on a sample of the data if the source database is large.

3) Verification. The results of the discovery process are compared to the meta-data to determine what is inaccurate data. Because the meta-data could be wrong, any differences should be addressed as issues and raised with the data quality group. Accurate meta-data should be the outcome of this process.

4) Validation: Once the meta-data is determined to be accurate, the data is validated. The output is a list of inaccuracies i.e. data that does not correspond to the meta-data. Here, the validation is performed against the entire data source, as opposed to the descovery process that sometimes uses samples.

Next Steps in Improvement of Data Quality

This concludes the data profiling process. The resulting "quality facts" are grouped into issues. However, the job of the data quality analyst does not stop here. The issues must be moved into action so as to improve the data quality. In the following steps, the data profiling analyst works in collaboration with the data quality group.

1) Track issues in a tracking system. An issue groups the "quality facts" found in the data profiling processes. Ideally, the tracking system should be able to record:

2) Assess the impact on the corporation. Generally, a precise impact cannot be determined. Estimations are bases on:

3) Investigate the causes. Causes are one or more of the following:

4) Develop a remedy

5) Implement the remedy and monitor after implementation so as to:

 


Element/Column Property Analysis

In this step, the analyst concentrates on the columns in isolation.

1) Get the meta-data from the following sources:

2) Discovery

See notes on data profiling below.

3) Verification: the analyst compares the meta-data and the discovery results. Any differences are raised as issues with the data profiling group.

4) Validation: the list of column properties from the improved meta-data is used to validate the all of the existing data. The output is a list of row IDs for which the data does not match the column properties as described in the meta-data. Store the list in a repository.

 


Structural Analysis

 

Analyze business objects. One or more tables that describe an entity constitute a business object.

Structural analysis concentrates on looking for functional dependancies and synonyms. Basically, functional dependancies are candidates for normalizations, i.e. they are data dependancies for which it would make sense to move the columns into separate tables. Synonyms are columns that repeat the same information (exact equivalent or transformed data).

1) Gather metadata

Sources:

2) Discovery

2a) Functional dependancies:

List the candidate keys (columns that could be identifiers). Natural keys are not as easy to find; they need knowledge of the business objects.

2b) Synoyms

Look for and list the candidates in the repository.

3) Test

Test using samples. If a dependancy exists, it will also exist in the samples. In fact, dependancies must exist in all samples. Using samples will result in too many candidates (false positives), but this number is reduced as other samples are taken. Store the cancidate keys in the repository and keep track of the results of the discovery and testing.

 


Data Rule Analysis

Data rules are business rules that apply to data. Business rules also include process rules, which are not the subject of data quality. Some rules are "soft" rules in the sense that valid exceptions to the rules exist.


1) Gather metadata
Possible sources: business procedures, application source code, stored procedures in database
Also: speculation (with input from data profiling group)

2) No discovery is reasonably possible for data rules

3) Testing
Test the rules gathered in step 1.
Note that rule violations do not indicate inaccuracies because it is not possbile to know which of the values is wrong. For example, if the rule is that the entry date is before the closed date, then a violation indicates an inaccuracy, but we do not know if it is the entry date or the closed date that is wrong.
Store the row IDs of violations.

4) Validation
Run validation for all of the source. Store row IDs of the violations.
Because some validations take a lot of computer resources, not all tests can be done. Classify rules according to importance so as to be able to take a decision on which rules to validate or not.

Note that it may be more useful to check and therefore correct rules directly in the source systems. It is less useful to check the rules when loading the data warehouse because the data can only be corrected in the source.

A system of record originates and validates data. A system of reference is a validated reference for other systems, but not a system for origination.

 

 


Complex Data Rule Analysis

 

 

 


Value Inspection

1) Gather metadata
Rules come from the business analsysts or the user community. Generally, they are simple rules that check that the "data makes sense".
Often, these rules involve aggregation, counts (cardinality), distinct values or frequence of values (most frequent or least frequent).

 

 

 


Data Profiling

 

Who? 
The analyst performs the data profiling, and raises the resulting issues with a group composed of:

The potential results of the data profiling exercise are:

 

 


Metadata

 

From [3] Won Kim

Types of metadata

Existing standards:

 

Within the data warehouse world, two aspects of metadata are often disinguished: the "back-room metadata", linked to the ETL processes and the "front-room metadata" linked to the reporting tools. the back-room metadata defines the data types, the sources of each piece of data the moment when it was loaded and eventually some statistics about loading (errors, number of rows, ...). The front-room metadata is a kind of dictionaly defining the meaning of each piece of data.

 

Single Version of the Truth

Central metadata management generally exists in parallel with a staging area or a common data warehouse. Separate stove-pipe data marts may lead to inconsistent reporting due to the lack of a single version of the truth. It is easier to build separate data marts, sometimes in a matter of weeks, than to coordinate definitions of terms across the company.

 

Notes

Note that some define a data warehouse as collection of data (of course), data model (this should be published to be useful), and metadata (often sorely missing).

 


Quality Assurance for Business Analysts

Quality is the degree to which the requirements are met. These requirements may be stated, unknown or implicit.

Quality may be attained by:

Verification and validation:

Three principles of quality:

Quality assurance includes internal and external activities:

An important part in overall data quality is defining the "Single Source of the Truth."

 

Inspections / Peer Reviews

These inspections are done during all phases of a project, from requirements analysis to design to development. Only defect detection is discussed; the repairs are done later, off-line, but the author (it is shown that this is the most efficient). Use constructive criticism. The work is being judged, not the person. The discussion is amoung peers, without issues of hierarchy.

 

Metrics

Metrics are useful for trending over time or for comparing to an industry standard. Gather metrics to use them, and not as busy-work.

Examples of metrics:

 

Testing

Test for:

Types of tests:

 

 

My Red Flag List

  • Incoherence between environments (production differs from QA)
  • Inconsist processes for similar tasks. Similar tasks should be done by similar processes.
  • Documenation exists, but it is unclear. It may be badly written, or it may indicate the the underlying process is not thought out well
  • People admit it is complicated when it should not be
  • Lack of a coherent architecture. Complex process to do what appears to be a simple task
  • Each module should do a coherent sub-task. Take out the anything that does not fit logically
  • Watch for overlapping unique IDs from two different systems
  • "World", "Public" or individual user IDs granted access. Instead, implement a group-based security policy: privileges should be granted only to groups and an individual's access should be inherited through the respective security group.
Red Flag

 


Miscellaeous

Bibliography