How To Install StreamSets On The MapR Sandbox

Document created by cwarman Employee on Jun 23, 2016Last modified by aalvarez on Jan 26, 2017
Version 9Show Document
  • View in full screen mode

StreamSets is used for building continuous ingestion pipelines.  It's an open source platform, and connects to a wide variety of filesystem, database, and web services (the full list is here).  Plus, StreamSets provides provides MapR-specific connectors that leverage the performance and security advantages offered by our converged platform.  These instructions provide step-by-step instructions for installing it on the MapR Sandbox for Hadoop so that you can try it out within the context of a VM on your local machine.

 

Two notes before you begin:

  • I'm assuming that you're using the MapR 5.1 Sandbox for Hadoop.  You'll want to use this version - or an even more recent one, if available - to ensure compatibility with the MapR-specific connectors provided with StreamSets.  Any virtualization environment (VMware, VirtualBox, etc.) should work fine.
  • My instructions are based on these and these instructions, and assume that you are connected to the MapR Sandbox via ssh as user mapr (the default password for that user is mapr, in case you weren't aware).

 

If all goes well, you should be done with these instructions in roughly 10-15 minutes, depending on how long it takes to download the StreamSets software to your MapR Sandbox.  Let's get started!

 

  • Once you've ssh'd into your Sandbox as mapr you'll need to change to root:

# su -
Password: mapr

 

  • Now, check to make sure that StreamSets hasn't already been installed:

# yum list installed | grep streamsets

The above command shouldn't return anything, thus indicating that StreamSets hasn't yet been installed.

 

  • Next, check the free space available using the df -h command, like so:

# df -h
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/vg_maprdemo-lv_root    8.4G  8.0G     0 100%  /
tmpfs                              2.9G     0  2.9G   0%  /dev/shm
/dev/sda1                          477M   41M  411M  10%  /boot
localhost:/mapr                    100G     0  100G   0%  /mapr
localhost:/mapr/demo.mapr.com/user  15G  5.2G  9.8G  35%  /user

 

The highlighted area above shows the root volume on my MapR Sandbox.  It's full, so I can't install anything else.  If you see the same thing, then you'll need to extend the storage space of this volume - Please complete the steps in my How To Extend The MapR Sandbox VM's Storage Space document before proceeding with the instructions below.  You'll probably want to have at least 2-3GB of available space on the root volume for a successful installation.

 

Determine the RPM URL

 

  • Install the RPM using the URL obtained above, then clean up the temporary file that was created:

# yum install https://archives.streamsets.com/datacollector/1.4.0.0/rpm/streamsets-datacollector-1.4.0.0-1.noarch.rpm
# rm /var/tmp/yum-*/streamsets-datacollector-*.rpm

 

  • Create symbolic links to the MapR libraries: (you can ignore any "File exists" errors here, and note the MapR 5.1-specific references)

# ln -s /opt/mapr/lib/* /opt/streamsets-datacollector/streamsets-libs/streamsets-datacollector-mapr_5_1-lib/lib
# ln -s /opt/mapr/hbase/hbase-1.1.1/lib/* /opt/streamsets-datacollector/streamsets-libs/streamsets-datacollector-mapr_5_1-lib/lib
# ln -s /opt/mapr/hive/hive-1.2/lib/* /opt/streamsets-datacollector/streamsets-libs/streamsets-datacollector-mapr_5_1-lib/lib
# ln -s /opt/mapr/hive/hive-1.2/hcatalog/share/hcatalog/* /opt/streamsets-datacollector/streamsets-libs/streamsets-datacollector-mapr_5_1-lib/lib
# ln -s /opt/mapr/lib/maprfs-5.1.0-mapr.jar /opt/streamsets-datacollector/root-lib

 

  • Edit the StreamSets Data Collector (SDC) configuration file:

# vi /etc/sdc/sdc.properties

 

Remove the MapR stage 5.1 library from the system.stagelibs.blacklist property by deleting the string indicated below:

system.stagelibs.blacklist=streamsets-datacollector-mapr_5_0-lib,streamsets-datacollector-mapr_5_1-lib

 

The resulting system.stagelibs.blacklist should look like this:

system.stagelibs.blacklist=streamsets-datacollector-mapr_5_0-lib

 

Save and exit the file once finished.

 

  • Edit the StreamSets Data Collector security policy file:

# vi /etc/sdc/sdc-security.policy

 

Add a new permissions block for MapR at the end of the file:

// MapR home directory
grant codebase "file:///opt/mapr/-" {
  permission java.security.AllPermission;
};

 

Save and exit the file once finished.

 

  • Add the StreamSets Data Collector as a system service so that it starts automatically when the MapR Sandbox boots:

# chkconfig --add sdc

 

  • Finally, start the StreamSets Data Collector system service:

# service sdc start
INFO: sdc started successfully.

 

You should now be able to access the StreamSets Data Collector console using your host system's browser on port 18630:

http:/<your-sandbox-ip-addr>:18630/

For example, your URL might look something like this:

http:/192.168.223.156:18630/

A login window should be displayed, where you can enter admin for both the username and password fields.

 

That's it - You're finished!  Where to next?  Here are some suggestions:

Enjoy!

7 people found this helpful

Attachments

    Outcomes