I wrote in a previous post about the fallacy of the word big in the phrase “Big Data”. This catch phrase that has been associated with “everything having to do with the analysis of data” is a poisonous co-opting of the analytics space to fit a vendor’s needs. It creates a major barrier to adoption for analytics and confusion over exactly what data analytics can and can’t do with its reliance on capturing massive amounts of data. If you accept the premise that “Big Data” is a lie, and that anyone can benefit from analyzing data then the next logical question is, “now what?”
You already have a LOT of data. We’ve all seen the examples of “how much data is created every second/minute” floating around (here is one if you have not) but this is for web-scale, massively consumed platforms (and a huge part of the Big Data Fallacy I might add). Even still, any reasonably sized organization is likely curating anywhere from 10 to 100 TB of just “data” with most organizations generating around 50% again year over year. That data can be in many forms and formats; RDBMS (databases), Files, Logs, Images, Video, Audio, source code, web pages, emails, forums and Intranet sites, and a bunch of other ones I am not thinking of so early in the morning. The value in analytics comes from linking data together, and in a world full of data and most critically data TYPES how do you actually get started?
There are two “big bucket” schools of thought for this:
1) The “Big Data Fallacy” school: Put all the data in one place. It is the only way possible to attempt to capture all the information you have in a useable format. This is not just the storage vendor view (think of the term “Data Lake”) this is also the view of common analytics tools like Hadoop: “Dump everything in here and then you can analyze it”.
2) The Data Integration school: Leave it where it is, catalog it, and connect to it on demand when you want to pull some data in for an actual analysis “job”. This is a newer space and not saturated with a lot of noise yet, but it also exposes the underlying challenge of delivering data value through analytics in a much more impactful way.
The underlying challenge both of these schools of data collection and connection ignore is that if you don’t know what data you have in the first place it is REALLY hard to actually analyze it. Data Integration tools have tended to focus on the mechanical – HOW to connect to the data, and the “Big Data” tools focus on the “put it all in one place and it does not matter what it is” approach. Both of these methods overlook the fact that most of the time spent in analytics is actually spent in identification and “cleansing” of data, not in actual analysis. So what I really need is a way to classify and catalog data, up front, before I start my analysis so that I know what I have to work with.
This is the biggest problem in the data analytics world, and it is solved through a concept called Data Governance. This combines 3 key facets:
1) A Data Catalog (what it is, where is it, and how do I get to it if I can)
2) Security and Permissions (because sometimes data shouldn’t be exposed)
3) Data Quality (Is my data suitable for the purposes of my analysis)
Taken all together these 3 facets form the core of a comprehensive Data Governance strategy. Fortunately, new tools are emerging that combine some or all of these functions to speed the collection and curation of data within even the most complex and tangled data sets.
In this series of posts, “Governing Analytics – Making Data Valuable, Viable, and Virtuous” I will dig into the “3 Other V’s” of analytics challenges more deeply and look at ways to architect a comprehensive system that handles them without compromising the ability of your analysts to get the data they need.