AnsweredAssumed Answered

Drill + MapR-DB vs Drill + MapR-FS

Question asked by john.humphreys on May 30, 2017
Latest reply on Jun 5, 2017 by MichaelSegel



I have another thread going here WIll Phoenix be able to run on MapR-DB? which has basically evolved into discussing OpenTSDB vs Drill (or using both which looks like the likely solution).


In this case, I'm interested in knowing the optimal data layout and backing store for querying large amounts of time series data with drill assuming I'm moving forward with it.


Data Set Information

  • Gathering metrics data at minute level (1,440 points per metric per server per day) - can do streaming.
  • 40,000 servers being tracked.
  • Each reports up to hundreds of metrics a day.
  • Initial requirements ~20 billion data points a day (current system does this/is being replaced).  This will grow.
  • 99% of data is numeric, but there is some string data.
  • Storage duration is up for debate, but let's say 3-6 months of history at minute level for fun.
  • E.g. timestamp = 123456789 metric = user.cpu, host = myhostname, value = 12345 [+ maybe a few tags/additional info]


My Questions:

  1. Is it better to store data in MapR-DB or in MapR-FS as parquet files or in MapR-FS as JSON files?  Is there another, better option I failed to mention?
  2. Why is this the better option?
  3. What data layout should I use?  E.g. if MapR-DB, how should my JSON look? Does how I write my parquet file matter (in order, timetsamp in first column or host name in first column, etc)?