An important and often underrated skill for data professionals is the skill of ‘investigation’. For all data professionals need a Sherlock-esque ability to look at issues in their data and solve that ‘crime’. For getting data / insights wrong is as criminal as it can get in data profession.
Clean data is the holy grail for all things data. It is what all data teams strive towards. It is the foundation on which analysis / algorithms can add value. Ability to get clean data will make or break any data-related project one undertakes. As a data professional, when you encounter bad data, it is not enough to just treat the symptoms, you have to find the cure as well. Only when the cure for bad data is found will you have a continuous supply of clean data.
Treating these symptoms and finding a cure is no less than an investigation. This investigation is an integral part of any data project. Here, based on my experiences I’ve tried to create a blueprint for this investigation.
Data Lifecycle
To understand this blueprint, it is important identify the data lifecycle, i.e., the life of data from its origin to consumption. The context of data generation, processing and consumption all play a role in the investigation. Issues in each step of data lifecycle can lead to bad data, and they exhibit unique characteristics that give us clues for improving data quality.
A simple Instagram post, goes through multiple modes of Data Generation (front or rear camera), Data Processing (filters applied), Data Storage (first on phone and then on Instagram servers) and Data Consumption (photo / video). It is extremely important to maintain data integrity through these cycles – You want the pic you captured along with the filters you applied to appear on Instagram, Instagram would have no value without this integrity.
The following section helps better understand each stage of the data lifecycle.
Data Generation
Individuals and institutions combined, generate obscene amounts of data today. Data captured is the interactions between and among these parties. Digital technologies (computers and internet) have caused paradigm shift in data collection. Earlier data was captured in offline ledgers and analog devices, now electronic devices not only capture external data, but also generate their own system data.
Data generated by offline systems – ledgers, documents etc – is digitised by data entry software. There is usually a user who is engaged in digitising such data. Because of the involvement of humans, this process is prone to mistakes made by the users of the software. These usually manifest as random spelling mistakes, missing fields in your data, etc. These issues usually do not have a pattern to them. They are as unique as the user using the software and they impact every element in data in a different way.
Data generated and collected from electronic systems usually exhibit a pattern. This is because these systems are programmed to behave in a particular way based on rules and algorithms. As a result the causes for bad data also have a rule / algorithmic pattern to them. In a way, the electronic systems lack the ability to generate / capture bad data in a random manner.
Data Processing
Data processing comes at a cost. It is an investment that is made in the context of a larger institutional goal for eg: revenue planning for business. It is important to understand the context in which data processing is employed by institutions. This helps understand the motivation for data processing and interpretation of processed data. Data issues because of data processing is as much a function of incorrect interpretation as it is of bad data. Capturing the details of how to process as well as how to interpret processed data is a very important step for institutions.
Data processing includes all business rules, heuristics, and analytical operations that are applied to data. The issues in this stage are similar to systemic issues we see in data generation. A simple and often overlooked cure for this is extensive documentation. It is a good idea to overcommunicate on data processing steps and interpretation with the consumers of such data.
Data Storage
Digitization has caused an obscene amount data generation and consumption. Disruptive technologies enable us to capture, process and consume almost any event that is happening in the world. Data storage systems need to support billions of interactions for companies like Google, Facebook and Youtube, and also support phone calls, SMS / MMS, apps and software on mobile devices. All of this while maintaining data security and integrity.
It is not surprising then that data storage technologies have increased exponentially to suit specific purposes. Along with storing data, we need these systems to integrate well with eachother too. Many problems in data quality arise due to an incomplete understanding of the interaction between different data storage systems. For eg: data type definitions can differ between systems which leads to data issues. Along with these, loss of data is a common issue with data storage. It is always a good idea to backup your data and have a backup for the backup.
Data Consumption
The data consumption phase is the most important part of any data project. This is where data becomes value for all parties involved. Data quality issues that occur at this stage (assuming no issues in the previous steps) happen most frequently due to miscommunication. Individuals consume data in the form of content, little can go wrong here if the previous phases are covered. However, institutions consume data in the form of insights and statistics. Context for the stats and insights is extremely important. It is paramount to define, document, and communicate analysis very clearly. It is a good idea to overcommunicate the processes followed in this phase.
The investigation
With the different issues at play, it is often tricky to identify / diagnose the exact cause of the issue. More often than not, there are multiple problems affecting our data. Efforts required to diagnose them increase exponentially with each additional issue. However, there are some techniques that can help. They can be applied universally and they always provide a clue.
- Find an alternative source of truth and check for consistency – Whenever you have data that does not make sense, always look for an alternative source for the same data. This could be a report from a different team, similar data captured by a different process, etc. There is almost always another source for this data that you can validate against.
- Check the data lineage – Trace back the data issues to their source. Where is the first time the data issue is seen in the ‘Data Lifecycle’?
- Make an informed opinion about the data and look for evidence that proves you correct or wrong. This step is especially useful when one is not sure about what is the correct behavior of data. Having opinions about data is crucial to becoming a data detective.
- If you can’t find the source, look for symptoms – Make an assumption about the cause of the data issue and check other related data elements. More often than not, the cause of data issues impact more than one behavior, and looking for other symptoms will help zero down on the issue.
A combination of these steps has almost always helped me identify data issues and improve data quality. Hopefully, they are useful to you as well.
A few quotes about data that have always been true for me 🙂
- Anything that can go wrong about your data will go wrong.
- Don’t ever assume that your data is correct. Trust, but verify.
- If you think it’s too good to be true, you’re right.