In my previous post about data quality in the Big Data era, we've seen some of the challenges raised by the recently born data operating system that came with Hadoop 2.0 and YARN . In Part 2 of this series, I'd like to explore how this new framework changes the traditional landscape of the data quality dimensions.
Let's review first two more changes brought by Big Data systems compared to usual databases:
- The lambda architecture and the immutability of data
- How data quality dimensions are impacted by these changes
In order to resolve an important issue of latency with the Hadoop system, a new architecture appeared that deals with the large amounts of data at high velocity. This is called the lambda architecture, and was developed by Nathan Marz while at Twitter.
Although there a load of details and benefits about the lambda architecture (check out this book for full detail). The 3 main benefits are as follows:
- The tolerance to human errors
- The tolerance to hardware crashes
- Scalability and quick response time
The lambda architecture itself is composed of 3 layers:
- Batch layer which stores all data in HDFS
- Serving layer which contains "batch views" on the data
- Speed layer which provides low latency access to the data
The Lambda Architecture (cred. http://lambda-architecture.net/)
Digging into the Lambda Architecture Layers
The batch layer stores all the data with no constraint on the schema. The schema-on-read is built in the batch views in the serving layer. Creating schema-on-read views requires algorithms to parse the data from the batch layer and convert them in a readable way. This allows input data to freely evolve as there is no constraint on their structure. But then, the algorithm that builds the view is responsible to manage the structural change in order to still deliver the same view as expected.
This shows a coupling between the data and the algorithms used for serving the data. Focusing on data quality is therefore not enough and we may ask the question of the algorithm quality. As the system lives and evolves, the algorithms may become more and more complex. These algorithms must not be regarded as black boxes, but a clear understanding of what they are doing is important if we want to have a good data governance. Moreover, during the batch view creation, data quality transformations could be done so as to provide data of better quality to the consumer of the views.
The Speed layer handles the data that are not already delivered in the batch view because of the latency of the batch layer. It handles the most recent data in order to provide a complete view of the data to the user by creating real-time views. The combination of the batch views and real-time views allows the user to get up-to-data information from the system.
That's a good point for the data quality dimensions called data freshness and availability. But this architecture implies the use of some probabilistic algorithms when computing the real-time views. Then, the accuracy dimension is undermined so as to get a better availability. (see a reminder of the data quality dimensions below).
Immutability of Data
In the lambda architecture, the data are immutable. Therefore, a change of a value does not require an update of the data in the batch layer, but a new event is stored with the most recent value of the data.
Every data must therefore keep a timestamp. And the algorithms that create the batch views must manage the multiple versions of the data correctly. This principle again makes the algorithm more complex than what we're used to with traditional databases. But keeping a timestamp for all data is a good thing as it makes it possible to track changes and have a better data governance. It may also help to provide a better answer to data privacy constraints, such deleting the data after some time.
Retention policies must be defined not only because it's meaningless to keep all data, but also because of legal reasons that concerns data privacy. Again, data governance programs must be implemented in order to apply these retention policies.
Changes in Data Quality Dimensions Significance
In traditional data quality, six dimensions are often used:
- Accuracy: the degree to which the data correctly describes real world objects.
- Completeness: the degree to which the data is not missing.
- Consistency: the degree to which the data is presented in the same format.
- Timeliness: the degree to which the data is up-to-date.
- Accessibility: the degree to which the data is available, or easily and quickly retrievable.
- Validity: the degree to which the data is conform to some syntax constraints
With the lambda architecture some data quality dimensions become less important and new dimensions raise.
Accuracy is a dimension that is often hard to obtain with a big data architecture as we've seen. And related to this dimension, the credibility dimension takes a particular importance because the data lake often aggregates data external to the companies that may not be reliable.
Completeness loses weight in some cases as there are so many data available that even when removing a subset of incomplete data, statistical analysis or machine learning can still remain meaningful.
The consistency is still important to maintain in the batch view, but the raw data stored in the batch layer may not be so consistent as long as the algorithms that create the batch views is able to fix their inconsistencies, i.e. to manage the different versions of the raw data. It is worth noticing that the quality of the algorithms used in Big Data system is entangled with the data quality.
The temporal dimensions, such as timeliness, are becoming more important, as every data should be given a timestamp. The speed layer of the lambda architecture is expected to partially address this data quality dimension as it should work with recent and updated data.
The accessibility is probably the dimension that benefits from big data systems as all data are now available in a single system, and can be accessed through a diverse set of tools (HDFS, Hive, Spark,...).
The validity is still an important dimension that requires to be checked. The difference between traditional information systems and big data systems is that the check of the validity is done at different times in the data life cycle. Indeed, raw data in a data lake does not need to be validated at the entry of the system.
The volatility of data, the time during which the data are valid, is a dimension that requires care and that has a great value for big data governance.
The freshness of the data is mostly a concern in the batch views, but given the lambda architecture, the up-to-date information is provided by a combination of the batch views and the real-time views. So, the significance of this dimension will depends on where (on which layer of the lambda architecture) a data quality requirement is placed.
Of course, data quality always depends on what you want to do with the data. The best definition of data quality is "fit for purpose". The above considerations highlight some data quality dimensions that require a special care given the architecture of Big Data environments, but in all cases, the data quality dimensions must be defined with respect to its final usage.
Where Data Quality Fits in the Lambda Architecture
Big data systems collect data from various sources, that can be internal to the company or external like social data. In the lambda architecture, data quality dimensions can be measured at different stages. At the time data enters the system, the origin of the data is often a criteria to decide whether the data is credible or not. In the batch layer and the batch view, there is a clear issue with the freshness of the data. This issue is compensated by the speed layer which will provide the latest data.
The system must assure that the data is coherent when it comes from the aggregation of the batch views and real-time views informations.
The real-time views may use probabilistic algorithms to return fast results. This may come at the price of the accuracy.
And of course, all usual data quality dimensions still apply to the results of the user queries.
The Need for a Big Data Governance
Success in traditional data governance is already a difficult goal to achieve with the usual data systems because of the existence of data silos. Today, the big data systems provide some solution to this problem because all data can be found at one place.
But, the data retrieved by the users may not exactly be the data stored in the system because of the schema-on-read principle. The variety of data available in the data reservoir and the numerous data transformations done before they are rendered to the user makes it more complex to have a good data governance. Big data governance must provide an overview on the data life cycle, monitor the transformations on data from the input until the restitution to the user. A lot a work remains to be done in this domain in order to be able to track the lineage of a data and assess its different data quality dimensions, such as its credibility, freshness, consistency... The Apache community is working on a data governance framework called Apache Falcon. This project will be a big a first step toward achieving success in Big Data governance.
Credits to Exqi working group. Part of this work has been done with the "Data Quality and Big Data" working group of the French data quality association Exqi.