I want to ask your thoughts on this potential client. This actually is a pitch from one of our sales rep, and we just bring up the MapR on the road since we think it is inline to be the solution.
Basically they wanted to make the ETL process much faster (drastically) that is currently sitting on Data Warehouse (MS Sql). Because of their increasing Database size, ETL process takes much longer. Current size 4TB+, 100+ stores, and takes up to 6~12Hrs. Here is a simple diagram I made base on what I understand on the meeting.
So from here they are dumping all the CSV files to the File server from their transactional system. After all data are dumped to the file server, these files are then loaded to MS SQL on the data warehouse server and finally processing data(Generating new views, transformation, stores, KPI generation,etc.) to be able to view on the BI tools that they used.
They didn't mentioned yet what BI tool they are using so lets leave it blank for now.
What we proposed is to replace the DataWarehouse itself or to offload some data going to MapR Cluster and connecting all BI tools to MapR itself.
Well, their priority and need is to speed up the whole process, thus they are having us to present some Benchmarks on ETL processes. All I provided is IOZone benchmarks that only shows random read/write and IOPS (Raw/NFS).
*Can MapR provide some benchmarks on data processing (ETL) ?
*Can MapR Increase BI tool performance (e.g: using filters) ?
Also they are a bit close on using MapR since they have a lot of SQL scripts to be converted. Im not pretty sure if all those scripts are going to work with Drill. My greatest question would be.
*Is this doable?
*If so, What would be the best approach/practice on executing these?
Hope to have some inputs. I really want to get back on them on this idea since its a pretty big client. Hope to see your thoughts on this guys. I wanted to communicate in them on friday.