An introduction to data quality
According to a survey from Information Week Analytics, data quality is still the « No. 1 “barrier to success” cited by both Business Intelligence and analytics types and information management professionals».
The data contained in a company’s databases describe facts from the real world at the time they’re entered in the information system. But for how long do they stay up to date since reality keeps moving on? As time goes by, a « distance » is created between the data stored in databases and the reality they represent.
This distance between the « reality » and the data which describe it constitutes a business risk of more or less importance depending on the significance of the data compared to the « business » needs. Thus, as an example, let’s take the case of customers’ mailing addresses. How does the company manage the address changes? What consequences holds a « out of date » address for the company?
This risk gets bigger when the data are used to feed the decision processes as part of Business Intelligence or are traded with external partners to the company (customer, supplier, administration…). How do you measure the impact of « wrong » data transmitted to tax administration, by instance?
For Rever Data Engineers, as far as “data quality” is concerned, you must distinguish two essential aspects:
- Data timeliness: the objective is to guarantee data compliance in relation to the reality of the facts it represents (e.g. the customers’ addresses in a CRM), this necessitates organisational procedures within the company in order to guarantee correctness (role of the data steward, for example)
- Data accuracy within information systems aims at guaranteeing that the recorded data respect the rules defined by the « business » and that they are not contradictory. This aspect of quality can be controled by tools. A couple of examples of incoherent or contradictory data: a nonexisting date (31 June or an impossible leap year) or the number of children of a customer which wouldn’t be the same number from one database to another (Does Mrs. X have 2 or 3 children?)
Data timeliness is a result of the company’s and its employees’ work, while data accuracy can be managed by applications and is related to the IT field:
Matching between databases and the real world
A company has to measure data quality from a « risk » perspective in line with its business goals. The efforts required to reach an acceptable level of data quality must be proportional to the risks incurred… In this way, it is useless – even possibly quite costly – to maintain mailing addresses up to date if the company does not use them, for example…
It’s up to the « business », and according to the importance of each of its missions, to define and enforce a certain level of data quality requirement, since it’s understood that data quality cannot be 100% certified.
- Separating the technical part from the organizational part
- The detection of data inconsistencies in databases inside an application or among several applications
- The identification of data related to the critical activity of the « business » and focusing all efforts towards constant improvement of data quality for that particular activity
Source: Analytics & BI Survey