Sorry for long post, thanks in advance for any answers and suggestions.
We are planning to ingest and provide data in geographically remote locations (Europe, Asia, America).
Our primary requirements are:
1. part of the data is "geographically locked" and cannot be stored/processed outside of reception location
2. part of the data, in particular aggregated data should be available globally,
3. aggregated reports should include whole of data from all locations
I do not have much experience in Hadoop, but I understand creating a single cluster over WAN is a bad idea as clarified in this post, correct?
Current plan is to create multiple clusters each running HBase/Phoenix on MaprFS. Geographical lock should include data in HBase and MaprFS.
Our first idea is to create Mapr storage volumes on each cluster:
a. Hbase volume
b. Geographically locked file volume
c. Geographically free file volume
On replication of above data:
a. Replication of HBase data is planned to be done using HBase functionality and only "geographically free" separate tables will be replicated. Additionally HBase can be replicated in master-master way allowing computation on all clusters and putting aggregated data into HBase to satisfy requirement 3.
b. This obviously will not be replicated
c. Raises questions what is the best way to perform continuous replication of MaprFS volume? Best solution I found is remote volume mirroring) but that is scheduled (not continuous) operation...
1. Is there an easy way to repeat mirroring immediately after previous mirror ends?
2. Do I understand correctly mirroring is delta operation and should not overload the traffic? What is network the overhead - in typical scenario files are not deleted or moved, just data is added to existing files (or new files are created)?
3. What is the atomic operation on MaprFS regarding mirror related snapshot? Scenario: open Parquet file is undergoing larger modification when scheduled mirror is starting and snapshot is created. Is it possible that corrupted file will be mirrored (in the middle of write)?
4. How big will the performance/storage drop related to constant mirroring (and related snapshots) be?
5. Is there a better way to do this type of filesystem replication? Maybe it is better to put files (hopefully they are not huge) into HBase and drop filesystem replication? What would be the best solution in our case?
We would be grateful for any additional comments and suggestions.