The Interwebs are, ahem, flooded with information and excitement about data lakes* (and the less hydrological sounding data hubs).
MapR already has quite a bit of good information on the subject and I've provided some links below. All good stuff.
So, what can I add to the discussion? Perhaps a bit of perspective based on what our customers are doing and thinking. So, here are some things to think about:
It's the most common use case
Almost 25% of our customers cite data lake/data hub as one of their primary use cases. It's value lies in providing the critical substrate on top of which many/most other use cases and analytics depend.
It's the first use case
For a majority of our customers, the data lake is often their first use case and its construction is a key part of the process of establishing big data technologies and practices in an IT organization. By it's nature, it is meant to span organizational or technological silos, which introduces some interesting new possibilities. The data lake is often an essential first step to a customer/citizen/patient 360 use case.
Once established, we've seen our customers gradually change their perspective of the data lake, often renaming it to be their "analytics platform" or "data platform" or even their "fraud detection platform". This is due to their corresponding rise in experience and comfort with "big data thinking". Which brings me to my main point.
It's not the lake, it's the water
While traditional technology vendors and press breathlessly promote the data lake as an end unto itself, most customers who actually implement a data lake soon realize that it really is just the beginning. I don't mean to demean or devalue the data lake; to many (especially larger) companies it is an essential step. But there is a noticeable trend among our more advanced customers that suggests that there is a significant knee in the big data maturity curve that follows the successful implementation of a data lake.
Once deployed, the sheer ability to have greater visibility into all available data sources begins to shine a light on the many relevant possibilities to the business. New correlations can be made as developers and data scientists now have access to data on customer behavior, market activity, supply chain, service desk, marketing campaigns and so on.
This is the point where the focus changes from the lake to the water.
Here are some relevant links on MapR.com:
- A page on MapR.com devoted to them including case studies, videos and papers including...
- The Definitive Guide to Data Lakes by Radiant Advisors
- and a shorter Solution Brief
- Check out what Cisco, comScore and HP are doing around data lakes
"A data lake is a collection of storage instances of various data assets additional to the originating data sources. These assets are stored in a near-exact, or even exact, copy of the source format."
Nick Heudecker (@nheudecker) from Gartner speaks well on the subject.