Some background info:
For my job, I have been placed on a team working on getting a Data Lake up and running and using MapR. I have been tasked with handling the data loads into the MapR FS and Hive Metastore, from various sources such as Oracle, MySQL, and SQL Server. For this I have been using Sqoop jobs. I had zero prior experience in this space, so everything is new to me.
So far, I have learned quite a bit about Sqoop, and have jobs for the initial data load of tables, a job for incremental data loads based on lastmodified date, a job for codegen for the classes and jars, and a job for merging the new incremental data being pulled in.
What I would like to discuss and get feedback from the community on, is what is the best practice for performing and managing incremental imports in the MapR FS as a start.
My initial thoughts (for a given table) are as follows:
- Run first job to import table data into MapR directory. ex /mapr/development/sandbox/sqooptest/
This first import will place its partitions in a /src directory.
- Schedule the incremental job with Atomic. It will place(append) its partitions in a /latest directory.
- After each incremental import, run the sqoop-merge job to then merge the contents of /latest onto /src and have its output placed in the /merged directory.
- From there, overwrite the /src contents with the updated /merged contents and delete the /merged directory for the next incremental load.
I have been trying to search around online for some sort of "best practices" guide about this subject, but I have not found anything. I did download and read through the "apache-sqoop-cookbook", but that did not have what I was searching for.
So, is my thoughts on track with what might be done with a production system, or am I way off base?
Looking for any advice from those more experienced than I.