Sqoop Practices for Managing Incremental Imports

Discussion created by cpritchard on May 29, 2018
Latest reply on May 29, 2018 by cpritchard

Hello All,


Some background info:

      For my job, I have been placed on a team working on getting a Data Lake up and running and using MapR. I have been tasked with handling the data loads into the MapR FS and Hive Metastore, from various sources such as Oracle, MySQL, and SQL Server. For this I have been using Sqoop jobs. I had zero prior experience in this space, so everything is new to me. 

      So far, I have learned quite a bit about Sqoop, and have jobs for the initial data load of tables, a job for incremental data loads based on lastmodified date, a job for codegen for the classes and jars, and a job for merging the new incremental data being pulled in. 

      What I would like to discuss and get feedback from the community on, is what is the best practice for performing and managing incremental imports in the MapR FS as a start. 


My initial thoughts (for a given table) are as follows:

  • Run first job to import table data into MapR directory. ex /mapr/development/sandbox/sqooptest/
    This first import will place its partitions in a /src directory. 
  • Schedule the incremental job with Atomic. It will place(append) its partitions in a /latest directory.
  • After each incremental import, run the sqoop-merge job to then merge the contents of /latest onto /src and have its output placed in the /merged directory. 
  • From there, overwrite the /src contents with the updated /merged contents and delete the /merged directory for the next incremental load. 


I have been trying to search around online for some sort of "best practices" guide about this subject, but I have not found anything. I did download and read through the "apache-sqoop-cookbook", but that did not have what I was searching for. 


So, is my thoughts on track with what might be done with a production system, or am I way off base?

Looking for any advice from those more experienced than I.