Data wareshouse on big data

Document created by maprcommunity Employee on May 23, 2018
Version 1Show Document
  • View in full screen mode



I would like to know whether have these solutions being deployed with MapR?


1. SpliceMachine

2. Apache Kylin

3. ATScale


Thanks in advance





Hi Teik Hooi Beh,

I know kylin definitely has been deployed by community members but I am not sure of the other two.


Splice requires HBase and relies on co-processors. So you can run it on MapR, however I wouldn't recommend it.


Really, Big Data is not really a good place to run a relational warehouse.  I wouldn't recommend it.  Joins are very expensive in a distributed environment.


A number of customers are using AtScale that I'm aware off. Other can comment on the other 2 options you listed.


I disagree somewhat with the above statement in regards to joins in distributed environments, as some of the largest legacy DWs are deployed in distributed compute environments. Joins and query planning just needs more care in distributed environments.


It may be good to review the actual use cases you have in mind to see what technologies and options would be feasible.


Andries Engelbrecht,

In a hierarchical model, you don't have joins.  (Unless you're joining two different sets of data. )  So its cheaper to pull the data.  When you pull data from your legacy systems you really shouldn't just drop the model in place. Try doing a hive query with 29 table joins.  If you don't like Hive, then try Spark.  Or Tez.   Now compare that with storing the data in a hierarchical model or record based model.  Hmmm like MapR-DB JSON tables.  While there's more overhead in transforming the JSON structure into a tree and vice-versa, its still going to be more efficient when it comes to querying the data.  (Note: If your query isn't based on your primary key, you'll have to join against your index unless the system supports secondary indexing which will do this for you.  )


There's a bit more to it... like the primary keys are cluster indexes, and there are other tricks like secondary indexing, but the cost of the table joins is expensive.  Remember that in Big Data, disk is cheap and it was designed around using primary keys and filters.

I'm not suggesting that you will never have to do a join, you do, however you walk away from a relational model.


There are many failed Hadoop projects where the expectation was that they could just drop in a relational model on Hadoop and run queries against it and it would perform as well as their existing Oracle or DB2 distributed systems.

Even Tez has issues. (Note: AFAIK MapR doesn't support Tez w LLAP)


There are a lot of considerations that you need to think about when choosing a tool set. Including how you plan to do compaction as well as combining fast and slow data.


I agree, you need to review your use case, however, its also important to not fall in to the trap of thinking you can just plop in a DW tool and it will replace your legacy system.  Those projects never end well.






Thanks Michael Segel and Andries Engelbrecht for the explanations and concerns.


Myself have seen some similar issues during my days of working on PivotalHD and HAWQ and promise myself never to propose  a like for like from old state (DWH) to new state (data lake). But some recent tools like Dremio and SnappyData which does provide optimization in queries to handle OLAP modelling get me to wonder whether they provide efficient SQL join queries (of course not those that span 3 to 4 pages) or could they actually provide 'virtual data warehouse' environment in data lake fit for specific use case.




The interesting thing about Dremio is that you have the ability to simultaneously query multiple sources within the same query. So as a tool, you're able to democratize the data without building a single monolithic data lake.


The point I was trying to make is that you need to rethink your data models when you build a data lake.  You have to consider your primary access pattern and then you need to store the data using the appropriate rowkey.  Use secondary indexing to access the data against attributes and again, keep joins to a minimum.


Michael Segel 

Follow what you are saying and agree that a direct port of a relational model is very challenging, and there are various ways to address the use cases in the Big Data space. It was just the blanket statement about joins in distributed systems that I don't agree within a general context. RDBMS' typically keep a lot of statistics that help the optimizer plan a query in various situations, this is something that gets much more challenging in the flexible big data space and the approaches you suggest are very valuable.


Companies like AtScale and Arcadia Data are using different ways of building out OLAP data structures to deal with these challenges as well (utilizing some of the advantages available on large flexible platforms).


Using MapR-DB JSON however you can gain a lot of flexibility in the data structure, but have to revisit the use case, structures and design; and as Michael stated not a direct replacement model. Especially leveraging secondary indexes for certain uses. It has the inherit flexibility to deal with schema changes, that can become a huge pain with legacy DWs.


This document was generated from the following discussion: Data warehouse on big data