Products & Services
MapR Book Club
to create and rate content, and to follow, bookmark, and share content with other members.
Best Practice - Loading a Cluster
Question asked by
on Jul 5, 2012
on Jul 20, 2012 by chriscurtin
Show 0 Likes
I have flat files that get loaded to a directory 24 times a day. I want to be able to take all the files, combine them, and append the data to a Cluster under a single file name. What is the best practice way of doing this? Should I use Flume?
No one else has this question
Mark as assumed answered
This content has been marked as final.
Show 1 comment
(Required, will not be published)
Jul 20, 2012 12:40 AM
Ours isn't 24 times a day, rather a day per month that we combine monthly. We ended up writing our own merge process that copies the files out of HDFS, combines them and puts back the monthly file. Not at all ideal, but it works.
I just added a question about append via NFS or Hadoop (
) if the answer turns out to be 'yes you can' then we'll probably start doing direct appends via Hadoop since our data is the results of a processing step.
Thinking about your needs, I'm assuming the data is from a log file or something else that rotates hourly, I'd look at holding the data in a MapR volume then just unix cat'ing them all together with the result going to the final destination volume. (if the answer to my question is yes to NFS but no to Hadoop this is probably what we'll do via a cron job)
Hope this helps,
Show 0 Likes
Retrieving data ...
What's the right way to keep consumers alive and prevent them getting kicked out of the consumer group?
spark.kafka.poll.time in MapR stream consumers
MapR Spark Certification - MCSD version 2.1-Start date
Questions on MCSD
Reading data from mapR json table through python