How to Install StreamSets Data Collector on the MapR Sandbox

Document created by rupal on Jun 27, 2017Last modified by rupal on Jun 28, 2017
Version 10Show Document
  • View in full screen mode

StreamSets Data Collector is used for building continuous ingestion pipelines.  It's an open source platform and connects to a wide variety of filesystems, databases, web services, hadoop ecosystems, no-sql platforms, etc (the full list is here).  Plus, StreamSets provides MapR-specific connectors that leverage the performance and security advantages offered by the converged platform.  This document provides step-by-step instructions for installing it on the MapR Sandbox for Hadoop so that you can try it out within the context of a VM on your local machine.

Two notes before you begin:

  • I'm assuming that you're using the MapR 5.2 Sandbox for Hadoop.  You'll want to use this version - or an even more recent one, if available - to ensure compatibility with the MapR-specific connectors provided with StreamSets.  Any virtualization environment (VMware, VirtualBox, etc.) should work fine.
  • The instructions here are based on the Installation Guide and MapR Prerequisites in StreamSets Documentation and assume that you are connected to the MapR Sandbox via ssh as user mapr (the default password for that user is mapr, in case you weren't aware).

If all goes well, you should be done with these instructions in roughly 10-15 minutes, depending on how long it takes to download the StreamSets software to your MapR Sandbox.

 

Let's get started!

 

  • Once you've ssh'd into your Sandbox as mapr you'll need to change to root:

# su -

Password: mapr

  • Now, check to make sure that StreamSets hasn't already been installed:

# yum list installed | grep streamsets

The above command shouldn't return anything, thus indicating that StreamSets hasn't yet been installed.

  • Next, check the free space available using the df -h command, like so:

# df -h

Filesystem                      Size  Used Avail Use% Mounted on

/dev/mapper/vg_maprdemo-lv_root 8.4G  8.0G 0 100%  /

tmpfs                           2.9G 0  2.9G   0%  /dev/shm

/dev/sda1                       477M   41M  411M  10%  /boot

localhost:/mapr                 100G 0  100G   0%  /mapr

localhost:/mapr/demo.mapr.com/user  15G  5.2G  9.8G  35%  /user

 

The highlighted area above shows the root volume on my MapR Sandbox.  It's full, so I can't install anything else.  If you see the same thing, then you'll need to extend the storage space of this volume - Please complete the steps in the How To Extend The MapR Sandbox VM's Storage Space document before proceeding with the instructions below.  You'll need at least 3GB of available space on the root volume for a successful installation.

   # wget https://archives.streamsets.com/datacollector/2.6.0.1/rpm/streamsets-datacollector-2.6.0.1-all-rpms.tgz

   # tar -xf streamsets-datacollector-2.6.0.1-all-rpms.tgz

  • You can install all packages if you’d like, but I highly doubt you would need all in your environment. Proceed with only installing the following 2 and then you can always add more packages as required later on:

   # yum localinstall streamsets-datacollector-2.6.0.1-1.noarch.rpm, streamsets-datacollector-mapr_5_2-lib-2.6.0.1-1.noarch.rpm

  • clean up the temporary file that was created

   # rm /var/tmp/yum-*/streamsets-datacollector-*.rpm

  • Configure Data Collector to work with MapR client

   # export SDC_HOME=/opt/streamsets-datacollector

   # export SDC_CONF=/etc/sdc

   # $SDC_HOME/bin/streamsets setup-mapr

 

  • Add the StreamSets Data Collector as a system service so that it starts automatically when the MapR Sandbox boots:

# chkconfig --add sdc

  • Finally, start the StreamSets Data Collector system service:

# service sdc start

INFO: sdc started successfully.

  • You should now be able to access the StreamSets Data Collector console using your host system's browser on port 18630:

http://<your-sandbox-ip-addr>:18630

   For example, your URL might look something like this:

http://192.168.223.156:18630

  • A login window should be displayed, where you can enter admin for both the username and password fields.

Note: If you can't access the UI, have a look at Why I am not able to access StreamSets Data Collector console?  

  • Create a new pipeline, provide a name and click Save. Verify in the Stage Library list on the far right side, you can see MapR stages for the Destinations as follows:

 

 

If you see the MapR stages, then that's it - you're finished with a successful installation!

Where to next?  Here are some suggestions:

Enjoy!

1 person found this helpful

Attachments

    Outcomes