Data Quality has a grayscale, and data quality engineers can continually improve data quality. Continual quality improvement is a process to achieve data quality excellence.
Dirty data may refer to several things: Redundant, Incomplete, Inaccurate, Inconsistent, Missing Lineage, Non-analyzable, and Insecure.
- Redundant: A Person’s address data may be redundant across data sources. So, the collection of data from these multiple data sources will result in duplicates.
- Incomplete: A Person’s address record may not have Pin Code (Zip Code) information. There could also be cases where the data may be structurally complete but semantically incomplete.
- Inaccurate: A Person’s address record may have the wrong city and state combination (E.g., [City: Mumbai, State: Karnataka], [City: Salt Lake City, State: California])
- Inconsistent: A Person’s middle name in one record is different from the middle name in another record. Inconsistency happens due to redundancy.
- Missing Lineage (and Provenance): A Person’s address record may not reflect the current address as the user may not have updated it. It’s an issue of freshness.
- Non-analyzable: A Person’s email record may be encrypted.
- Insecure: A Person’s bank account number is available but not accessible due to privacy regulations.
The opposite of Dirty is Clean. Cleansing data is the art of correcting data after it is collected. Commonly used techniques are enrichment, de-duplication, validation, meta-information capture, and imputation.
- Enrichment is a mitigation technique for incomplete data. A data engineer enriches a person’s address record by adding country information by mapping the (city, state) tuple to a country.
- De-Duplication is a mitigation technique for redundant data. The data system identifies and drops duplicates using data identities. Inconsistencies caused by redundancies require use-case-specific mitigations.
- Validation is a mitigation technique that applies domain rules to verify correctness. An email address can be verified for syntactical correctness by using a regular expression (\A[\w!#$%&’+/=?
{|}~^-]+(?:\.[\w!#$%&'*+/=?{|}~^-]+)@(?:[A-Z0-9-]+.)+[A-Z]{2,6}\Z). Data may be accepted or rejected based on validations. - Lineage and Provenance capture is a mitigation technique for data where source or freshness is critical. An image grouping application will require meta-data about an image series (video) collected like phone type and captured date.
- Imputation is a mitigation technique for incomplete data (data with information gaps due to poor collection techniques). A heartrate time-series data may be dirty with missing data in minutes 1 and 12. Using data with holes may lead to failures, so a data imputation may use the previous or next value to fill the gap.
These are cleansing techniques to reduce data dirtiness after data is collected. However, data dirtiness originates at creation time, collection time, and correction time. So, a data cleansing process may not always result in non-dirty data.
A great way to start with data quality is to describe the attributes of good quality data and related measures. Once we have a description of good quality data, incrementally/iteratively use techniques like CAPA (corrective action, preventive action) with a continual quality improvement process. Once we are confident about data quality given current measures, the data engineer can introduce new KPIs or set new targets for existing ones.
Example: A research study requires collecting stroke imaging data. A description of quality attributes would be:
| Data Quality Attribute | Description |
| Data Lineage & Provenance | – Countries: {India} – Imaging Types: {CT} – Source: {Stroke Centers, Emergency} – Method – Patient Position: supine – Method – Scan extent: C2-2-vertex – Method – Scan direction: caudocranial – Method – Respiration: suspended – Method – Acquisition-type: volumetric – Method – Contrast: {Non-contrast CT, PCT with contrast} |
| Redundancy | Multiple scans of the same patient are acceptable but need to be separated by one week. |
| Completeness | Each imaging scan should be accompanied by a radiology report that describes these features of the stroke: – Time from onset: { early hyperacute (0-6H), late hyperacute (6-24H), acute (1-7D), sub-acute (1-3W), chronic (3W+) } – CBV (Cerebral Blood Volume) in ml/100g of brain tissue – CBF (Cerebral Blood Flow) in ml/min/100g of brain tissue – Type of Stroke: {Hemorrhagic-Intracerebral, Hemorrhagic-subarachnoid, Ischemic-Embolic, Ischemic-Thrombotic} |
| Accuracy | Three reads of the image by separate radiologists to circumvent human errors and bias. Anonymized Patient history is sent to the radiologist. |
| Security and Privacy | Patient PII is not leaked to the radiologist interpreting the result or the researcher analyzing the data. |
As you can see from the table of attributes for CT Stroke imaging data, the quality description is data-specific and use-specific.
Data engineers compute attribute-specific metrics using data attribute descriptions on a data sample to measure overall data quality. These attribute descriptions are the N* to pursue excellence in data quality.
Summary: The creation, collection, and correction improve over some time when measured using criteria. There will always be data quality blind spots and leakages. Hence, data engineers report data quality on a grayscale with multiple attribute-specific metrics.









