As a follow up to my recent post on data lakes, I thought people might find it helpful to look at the data pipeline that underlies some data lakes that are in production today. For many big data projects, the data pipeline is an essential strategic piece to understand and get right. At the highest level, most data lake pipelines look like this:
So, an ever expanding list of internal and external data sources, a set of loading and (perhaps optional) transformation processes, a collection of processing engines, a set of connections and integrations with data warehouses and/or data marts, and the visualizations, dashboards and reporting that informs the attending humans.
So, now we can take a look at a data lake architecture that is in production today (slightly disguised as I'd like to keep my job).
This example is a financial services company in Asia that started - like many data lakes - with a data warehouse offload use case. The customer started its big data journey with a proof of value around costs savings from a legacy DW platform.
As they expanded the potential list of data sources, they ended up fashioning a pretty sophisticated and complex ingestion stage. Data sources include RDBMS favorites from Oracle and MSFT, as well as SFDC and social data. Data "acquisition" occurs using Apacke Kafka, Sqoop (from SQL sources), pig scripts for extraction from Salesforce, and Flume for log data and social data ingest.
The storage layer consists of data stored in Hive (to provide them with a DW paradigm on Hadoop) and on the MapR Filesystem. Their query engine, powered by Apache Drill and Apache Spark feeds a fairly rich set of analytics services that gives them the flexibility of using familiar BI and Analytic tools (Tableau, R, SAS) as well as REST APIs for developers and ODBC/JDBC for SQL users.
Add to that an elegant (and disguised) "control cockpit" and that rounds out their vision for their data lake. Rather than resembling a data dumping ground, this example illustrates a massively flexible and inclusive analytics platform.
Not bad for a data warehouse offload.