In my first job after graduate school I wrote a white paper of sorts detailing data quality as I understood it then. When I moved to GitLab I added it to our Data Team handbook via this merge request. It was then dramatically changed via this merge request (and probably some others). 🙃
Mainly because I don’t want to go digging through merge requests again, I’ve reproduced it here without changing it.
What is data quality?
Data quality is a measurement of the value of data to the business, meaning it is dependent on the task trying to be accomplished. High quality data in one context can be low in another. However, there are common dimensions that span contexts and these are detailed here. Also discussed is the relation of Data Integrity to Data Quality, the kinds of Data Errors, and how to improve Data Quality by decreasing the number of errors.
Dimensions
Data Accuracy
Data accuracy is a measure of how well the values in the database match the real-world values they represent. Inaccurate data would be having an Opportunity list the wrong value of a contract1.
Information Completeness
Information completeness refers to the extent to which data are of sufficient breadth, depth, and scope for the task at hand. There are absolute measures of breadth, depth, and scope that will return relative differences in completeness depending on the usage.
Time Dimensions
The time-related dimensions of data quality have three components: currency, volatility, and timeliness.
Data Currency2 refers to how promptly the data are updated when an external value changes. A common and useful measure of data currency is a last-updated and/or last-reviewed metadata tag.
Data Volatility refers to how frequently the real world values are updated. Company names are generally of low volatility while product offerings are highly volatile.
Data Timeliness is a measure of how current the data is for a given task. This is highly context dependent but can usually be expressed as a combination of currency and volatility along with specific data usage.
Data Consistency
Data consistency, also known as coherence and validity, is a measure of whether the data are violating or adhering to semantic rules. That is, given a set of values or ranges for a particular field, do the data adhere to the rules. An example of this would be Turn Around Time3. Common values would be 0 to 365 days. That is, we would expect some amount of time to pass before a result is returned ( > 0) but more than a year seems unreasonable. Zero is an indication of no value for that field. With more experience we might realize that 0 to 180 might be more reasonable.
Data Uniqueness
Data uniqueness, also known as data deduplication, is a measure of whether or not real-world entities are represented multiple times in the same dataset. This could likewise apply to data features within a database entry. This can also have implications for reconciling data across databases (i.e. curated products database vs ordered products database.)
Other dimensions may include precision, accessibility, credibility, traceability, and confidentiality. I have chosen to exclude these dimensions because they are either sufficiently covered by the other dimensions (precision and credibility) or not particularly relevant to the broader quality conversation (accessibility, traceability, and confidentiality).
Data Integrity
Data Integrity is the trustworthiness of the data, typically built upon Consistency and Accuracy. Therefore, Data Integrity is a result of Data Quality. Data with a high degree of integrity can be of low quality if it’s not suited for the task at hand and does not provide value to the business.4
Data Errors
Errors occur, by definition, when data is inaccurate, incomplete, out of date, inconsistent, or duplicated.
Data Quality Improvement
Data Quality improvement is separated into three categories: Prevention, Detection, and Repair of errors. Data errors are different for each of the above data quality dimensions, but they can be avoided in all cases using these three techniques.
Prevention - Data error prevention means slightly different things depending on the dimension. In general, however, it means putting rules in place during data input and curation so that common semantic errors are avoided.
Detection - Data error detection means using a combination of automated and manual checks to find issues within specific dimensions. Automated processes include data dimension-specific scripts that find common errors as well as audits of particular data by an expert curator.
Repair - Data error repair is simply correcting an error once it has been identified via the prevention or detection methodologies.
Clearly this line was added while at GitLab as it was my first adventure with Salesforce data.
I like this definition because it explicitly highlights the gap between the real world and its representation in the database.
This is a fun artifcat of my time at Concert Genetics, née NextGxDx. Turn around time was the expected time for a laboratory to return a result.
Note that referential integrity isn’t discussed here.
I was so ready to dunk on yet another data quality definition, but I think this is the first one I have seen that talks about fitness for purpose — and doesn’t drone on about the number of nulls.
Fun pet peeves with data quality:
- masked missing values, where instead of a null you have some plausible default value, is so much worse than nulls
- imputation of missing values aimed at one use case can make the data unusable for other purposes
I think it is weird that the discussions on data quality rarely mention the original purpose of the data or understanding the process that generates it.