I am new to MapR and AWS and would like to know ways to extract/copy files from AWS S3 (stored in JSON format) and load into MapR FS/Hive tables on local MapR cluster. Any help appreciated.
Thanks in advance.
Viral You could easily explore JSON on S3 using S3 Storage Plugin - Apache Drill .
Do you have the ability to SCP data from the AWS instances to your MapR cluster? Is your MapR cluster on premises or in the cloud as well?
As far as the JSON to MapR-FS, there are plenty of tools that can work with the JSON data (Hive, Spark, Drill, etc.). What are you looking to do with the data?
No ability to SCP. MapR cluster is on premises. We want to download JSON files from AWS S3 to local MapR FS node and then we will use those files with HIVE. So looking for tool that can connect to AWS S3 and can download files to on premises nodes.
As Mufeed Usman mentioned, you can use Drill to work with the files directly from S3 without moving them into MapRFS. In this case, the data stays out of MapR FS. You continue to use S3 as your persistent layer and use Drill for interactive analysis.
If you do want to bring it into MapR FS, presumably for further processing or other use cases not covered by Drill, here is one way to do it.
Use AWS Command Line Interface (CLI) to copy the data from your AWS S3 bucket into one of the nodes (or clients) that can talk to the cluster. Once you install the AWS CLI tools and set up your secret keys to access S3, the command to copy data is as simple as "aws s3 cp s3://my-bucket/path/MyFile.txt MyFile.txt".
Once you have the data locally, you can just do a hadoop fs Put/CopyFromLocal to copy the data into MapR FS. You can then create Hive tables on top of those files based on the schema of the files.
The requirement is to bring files from S3 to local. So AWS CLI approach sounds better. So this AWS CLI commands can be executed from shell script or any java program? This process needs to be automated and will run frequently in a day. The other thing I am looking into is Amazon Data Pipeline.
Retrieving data ...