I want to migrate historical data from a RDBMS to MapR-FS one-time.
What would be the best practice, and steps to do so.
Hi Shankar Mukherjee,
You can use Sqoop to transfer data from a relational database management system (RDBMS) such as MySQL or Oracle into MapR-FS and use MapReduce on the transferred data. Please review Sqoop1.
Thanks for your reply.
what are the other possble methods of ingestion to MapR-FS from RDBMS. say MS SQL SERVER
Can I export the RDBMS tables as JSON format and move to MapR-FS.?
Recommend you to check out https://community.mapr.com/thread/22037-from-mssql-to-mapr and feel free to ask any additional questions you may have.
Don't use sqoop.
Sqoop is a map/reduce job that makes multiple connections to your live RDBMs. So if its a production DW, you may impact performance.
Also for a 'one time dump' this isn't the fastest or the best way to do it. E.g. What happens if your JDBC connection is dropped, along with a lot more overhead and network traffic.
The better way for a 'one time dump' is to unload the data and xfer the files over to MapR. You can NFS mount the cluster and do a CP or you could do a SCP command.
Then you just have to work on the data from your landing zone to how you want to store it for use.
Thanks for the reply.
Extending my question, if I were to migrate the data from MapR-FS to MapR-DB what would be the best practice. ?
Also can I port directly to MapR-DB from an external data source (RDBMS)?
First off i work for MapR but i would consider using Streamsets here as i could take data directly from the RDBMS into MapR FS and MapRDB. Easy to use and just works. How to Install StreamSets Data Collector on the MapR Sandbox
Data Drift? Is that like Tokyo Drift ? (The fast and the furbiest: Tokyo Drift) ;-P
Sorry. I kid. I just hate when people create terms which aren't really well defined.
Its an interesting product, but lets go back to basics first. ;-)
What I am about to say applies mainly to larger enterprise shops. YMMV applies.
If you have capable developers on hand, this really isn't a hard problem to solve. It just takes some thought and of course you're going to have to do most of the thought process already because you have to deal with your data governance team. (Assuming you have thought about data security and data governance because of some data protection law or another... )
In terms of ingestion, you already have CDC and Golden Gate for moving data from IBM and Oracle. (CDC 'Change Data Capture' is from IBM and 'Golden Gate' is Oracle) .
So if you have these tools or similar tools already in place. You will want to stream the data or if its your first dump from the source, you're going to want to unload the table as a file or set of files. (This is more efficient.)
Here you will want to do any data transformations and data cleansing, and performing a bulk load of the data.
You can then turn on the stream and upload the data in to MapRDB.
Note: While you may not want to run SQL over MapRDB for performance reasons, you will want to stage the data in MapRDB as an easy way to dedupe the data. If you need to capture the change history, you're going to want to also store the CDC/Golden Gate updates as well.
This would be best practice.
Now you can use a tool like StreamSets, however you would need to consider the value against the cost of licensing and use. YMMV.
You received a lot of excellent recommendations. Please let us know if you need further assistance before I close this discussion. Also don't forget to thank the members who took the time to help you. Check out: How to Show Your Appreciation to Members
Retrieving data ...