Skip navigation
All Places > The Exchange > Blog
1 2 3 Previous Next

The Exchange

130 posts


  1. Install MapR 6.0 Sandbox:
  2. Ensure you have enough space on the Sandbox to install StreamSets Data Collector and StreamSets Data Collector Edge. Keep at least 5GB of space available. To check how much space is available and/or to add more space, follow this guide:


Install StreamSets Data Collector

  1. SSH into the Sandbox and login as root


$ ssh mapr@localhost -p 2222


Last login: Wed Jan 31 21:30:50 2018

Welcome to your Mapr Demo virtual machine.

[mapr@maprdemo ~]$ su -


Last login: Wed Jan 31 21:30:54 PST 2018 on pts/0

[root@maprdemo ~]#


  1. Download the RPM and extract the binaries

Get the latest version to install from

Note: We’ll be using StreamSets Data Collector version





[root@maprdemo ~]# wget


Note: If the download link does not work, use the fully qualified download link:


--2018-02-01 05:37:42--

Resolving (

Connecting to (||:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: 3914629120 (3.6G) [application/x-tar]

Saving to: ‘streamsets-datacollector-’


[root@maprdemo ~]# tar -xf streamsets-datacollector-

[root@maprdemo ~]# ls

anaconda-ks.cfg  config.sandbox original-ks.cfg  streamsets-datacollector-  streamsets-datacollector-

[root@maprdemo ~]#


  1. Remove unrequired stage libraries

StreamSets installs each package as a stage library. You can choose to do a full install with all the stage libraries or selectively install only what’s required. The full install will take ~3.5GB of space. We do not need to do a full install because half the stage libraries will not be required for MapR. Remove these unwanted stage libraries as follows:


[root@maprdemo ~]# cd streamsets-datacollector-

[root@maprdemo streamsets-datacollector-]# rm -rf streamsets-datacollector-cdh* && rm -rf streamsets-datacollector-hdp* && rm -rf streamsets-datacollector-apache-kudu* && rm -rf streamsets-datacollector-mapr_5*


  1. Install

[root@maprdemo streamsets-datacollector-]# pwd


[root@maprdemo streamsets-datacollector-]# yum localinstall streamsets*.rpm

Loaded plugins: fastestmirror, langpacks

Examining streamsets-datacollector- streamsets-datacollector-

Marking streamsets-datacollector- to be installed

Examining streamsets-datacollector-apache-kafka_0_10-lib- streamsets-datacollector-apache-kafka_0_10-lib-

Marking streamsets-datacollector-apache-kafka_0_10-lib- to be installed

Examining streamsets-datacollector-apache-kafka_0_11-lib- streamsets-datacollector-apache-kafka_0_11-lib-

Marking streamsets-datacollector-apache-kafka_0_11-lib- to be installed

Examining streamsets-datacollector-apache-kafka_0_9-lib- streamsets-datacollector-apache-kafka_0_9-lib-

Marking streamsets-datacollector-apache-kafka_0_9-lib- to be installed

Examining streamsets-datacollector-apache-kafka_1_0-lib- streamsets-datacollector-apache-kafka_1_0-lib-

Marking streamsets-datacollector-apache-kafka_1_0-lib- to be installed

Examining streamsets-datacollector-apache-solr_6_1_0-lib- streamsets-datacollector-apache-solr_6_1_0-lib-

Marking streamsets-datacollector-apache-solr_6_1_0-lib- to be installed

Examining streamsets-datacollector-aws-lib- streamsets-datacollector-aws-lib-

Marking streamsets-datacollector-aws-lib- to be installed

Examining streamsets-datacollector-azure-lib- streamsets-datacollector-azure-lib-

Marking streamsets-datacollector-azure-lib- to be installed

Examining streamsets-datacollector-basic-lib- streamsets-datacollector-basic-lib-

Marking streamsets-datacollector-basic-lib- to be installed

Examining streamsets-datacollector-bigtable-lib- streamsets-datacollector-bigtable-lib-

Marking streamsets-datacollector-bigtable-lib- to be installed

Examining streamsets-datacollector-cassandra_3-lib- streamsets-datacollector-cassandra_3-lib-

Marking streamsets-datacollector-cassandra_3-lib- to be installed

Examining streamsets-datacollector-cyberark-credentialstore-lib- streamsets-datacollector-cyberark-credentialstore-lib-

Marking streamsets-datacollector-cyberark-credentialstore-lib- to be installed

Examining streamsets-datacollector-dev-lib- streamsets-datacollector-dev-lib-

Marking streamsets-datacollector-dev-lib- to be installed

Examining streamsets-datacollector-elasticsearch_5-lib- streamsets-datacollector-elasticsearch_5-lib-

Marking streamsets-datacollector-elasticsearch_5-lib- to be installed

Examining streamsets-datacollector-google-cloud-lib- streamsets-datacollector-google-cloud-lib-

Marking streamsets-datacollector-google-cloud-lib- to be installed

Examining streamsets-datacollector-groovy_2_4-lib- streamsets-datacollector-groovy_2_4-lib-

Marking streamsets-datacollector-groovy_2_4-lib- to be installed

Examining streamsets-datacollector-influxdb_0_9-lib- streamsets-datacollector-influxdb_0_9-lib-

Marking streamsets-datacollector-influxdb_0_9-lib- to be installed

Examining streamsets-datacollector-jdbc-lib- streamsets-datacollector-jdbc-lib-

Marking streamsets-datacollector-jdbc-lib- to be installed

Examining streamsets-datacollector-jks-credentialstore-lib- streamsets-datacollector-jks-credentialstore-lib-

Marking streamsets-datacollector-jks-credentialstore-lib- to be installed

Examining streamsets-datacollector-jms-lib- streamsets-datacollector-jms-lib-

Marking streamsets-datacollector-jms-lib- to be installed

Examining streamsets-datacollector-jython_2_7-lib- streamsets-datacollector-jython_2_7-lib-

Marking streamsets-datacollector-jython_2_7-lib- to be installed

Examining streamsets-datacollector-kinetica_6_0-lib- streamsets-datacollector-kinetica_6_0-lib-

Marking streamsets-datacollector-kinetica_6_0-lib- to be installed

Examining streamsets-datacollector-mapr_6_0-lib- streamsets-datacollector-mapr_6_0-lib-

Marking streamsets-datacollector-mapr_6_0-lib- to be installed

Examining streamsets-datacollector-mapr_6_0-mep4-lib- streamsets-datacollector-mapr_6_0-mep4-lib-

Marking streamsets-datacollector-mapr_6_0-mep4-lib- to be installed

Examining streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib- streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib-

Marking streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib- to be installed

Examining streamsets-datacollector-mongodb_3-lib- streamsets-datacollector-mongodb_3-lib-

Marking streamsets-datacollector-mongodb_3-lib- to be installed

Examining streamsets-datacollector-mysql-binlog-lib- streamsets-datacollector-mysql-binlog-lib-

Marking streamsets-datacollector-mysql-binlog-lib- to be installed

Examining streamsets-datacollector-omniture-lib- streamsets-datacollector-omniture-lib-

Marking streamsets-datacollector-omniture-lib- to be installed

Examining streamsets-datacollector-rabbitmq-lib- streamsets-datacollector-rabbitmq-lib-

Marking streamsets-datacollector-rabbitmq-lib- to be installed

Examining streamsets-datacollector-redis-lib- streamsets-datacollector-redis-lib-

Marking streamsets-datacollector-redis-lib- to be installed

Examining streamsets-datacollector-salesforce-lib- streamsets-datacollector-salesforce-lib-

Marking streamsets-datacollector-salesforce-lib- to be installed

Examining streamsets-datacollector-stats-lib- streamsets-datacollector-stats-lib-

Marking streamsets-datacollector-stats-lib- to be installed

Examining streamsets-datacollector-vault-credentialstore-lib- streamsets-datacollector-vault-credentialstore-lib-

Marking streamsets-datacollector-vault-credentialstore-lib- to be installed

Examining streamsets-datacollector-windows-lib- streamsets-datacollector-windows-lib-

Marking streamsets-datacollector-windows-lib- to be installed

Resolving Dependencies

--> Running transaction check

---> Package streamsets-datacollector.noarch 0: will be installed

---> Package streamsets-datacollector-apache-kafka_0_10-lib.noarch 0: will be installed

---> Package streamsets-datacollector-apache-kafka_0_11-lib.noarch 0: will be installed

---> Package streamsets-datacollector-apache-kafka_0_9-lib.noarch 0: will be installed

---> Package streamsets-datacollector-apache-kafka_1_0-lib.noarch 0: will be installed

---> Package streamsets-datacollector-apache-solr_6_1_0-lib.noarch 0: will be installed

---> Package streamsets-datacollector-aws-lib.noarch 0: will be installed

---> Package streamsets-datacollector-azure-lib.noarch 0: will be installed

---> Package streamsets-datacollector-basic-lib.noarch 0: will be installed

---> Package streamsets-datacollector-bigtable-lib.noarch 0: will be installed

---> Package streamsets-datacollector-cassandra_3-lib.noarch 0: will be installed

---> Package streamsets-datacollector-cyberark-credentialstore-lib.noarch 0: will be installed

---> Package streamsets-datacollector-dev-lib.noarch 0: will be installed

---> Package streamsets-datacollector-elasticsearch_5-lib.noarch 0: will be installed

---> Package streamsets-datacollector-google-cloud-lib.noarch 0: will be installed

---> Package streamsets-datacollector-groovy_2_4-lib.noarch 0: will be installed

---> Package streamsets-datacollector-influxdb_0_9-lib.noarch 0: will be installed

---> Package streamsets-datacollector-jdbc-lib.noarch 0: will be installed

---> Package streamsets-datacollector-jks-credentialstore-lib.noarch 0: will be installed

---> Package streamsets-datacollector-jms-lib.noarch 0: will be installed

---> Package streamsets-datacollector-jython_2_7-lib.noarch 0: will be installed

---> Package streamsets-datacollector-kinetica_6_0-lib.noarch 0: will be installed

---> Package streamsets-datacollector-mapr_6_0-lib.noarch 0: will be installed

---> Package streamsets-datacollector-mapr_6_0-mep4-lib.noarch 0: will be installed

---> Package streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib.noarch 0: will be installed

---> Package streamsets-datacollector-mongodb_3-lib.noarch 0: will be installed

---> Package streamsets-datacollector-mysql-binlog-lib.noarch 0: will be installed

---> Package streamsets-datacollector-omniture-lib.noarch 0: will be installed

---> Package streamsets-datacollector-rabbitmq-lib.noarch 0: will be installed

---> Package streamsets-datacollector-redis-lib.noarch 0: will be installed

---> Package streamsets-datacollector-salesforce-lib.noarch 0: will be installed

---> Package streamsets-datacollector-stats-lib.noarch 0: will be installed

---> Package streamsets-datacollector-vault-credentialstore-lib.noarch 0: will be installed

---> Package streamsets-datacollector-windows-lib.noarch 0: will be installed

--> Finished Dependency Resolution

MapR_Core                                                                                                                                                                            | 1.4 kB 00:00:00

MapR_Core/primary                                                                                                                                                                    | 4.7 kB 00:00:00

MapR_Ecosystem                                                                                                                                                                       | 1.4 kB 00:00:00

MapR_Ecosystem/primary                                                                                                                                                               | 14 kB 00:00:00

base/7/x86_64                                                                                                                                                                        | 3.6 kB 00:00:00

base/7/x86_64/group_gz                                                                                                                                                               | 156 kB 00:00:00

base/7/x86_64/primary_db                                                                                                                                                             | 5.7 MB 00:00:02

epel/x86_64/metalink                                                                                                                                                                 | 13 kB 00:00:00

epel/x86_64                                                                                                                                                                          | 4.7 kB 00:00:00

epel/x86_64/group_gz                                                                                                                                                                 | 266 kB 00:00:00

epel/x86_64/updateinfo                                                                                                                                                               | 880 kB 00:00:00

epel/x86_64/primary_db                                                                                                                                                               | 6.2 MB 00:00:01

extras/7/x86_64                                                                                                                                                                      | 3.4 kB 00:00:00

extras/7/x86_64/primary_db                                                                                                                                                           | 166 kB 00:00:00

updates/7/x86_64                                                                                                                                                                     | 3.4 kB 00:00:00

updates/7/x86_64/primary_db                                                                                                                                                          | 6.0 MB 00:00:01


Dependencies Resolved



Package                                                            Arch Version Repository                                                               Size



streamsets-datacollector                                           noarch /streamsets-datacollector-                                           162 M

streamsets-datacollector-apache-kafka_0_10-lib                     noarch /streamsets-datacollector-apache-kafka_0_10-lib-                      38 M

streamsets-datacollector-apache-kafka_0_11-lib                     noarch /streamsets-datacollector-apache-kafka_0_11-lib-                      40 M

streamsets-datacollector-apache-kafka_0_9-lib                      noarch /streamsets-datacollector-apache-kafka_0_9-lib-                       38 M

streamsets-datacollector-apache-kafka_1_0-lib                      noarch /streamsets-datacollector-apache-kafka_1_0-lib-                       40 M

streamsets-datacollector-apache-solr_6_1_0-lib                     noarch /streamsets-datacollector-apache-solr_6_1_0-lib-                      17 M

streamsets-datacollector-aws-lib                                   noarch /streamsets-datacollector-aws-lib-                                    46 M

streamsets-datacollector-azure-lib                                 noarch /streamsets-datacollector-azure-lib-                                  18 M

streamsets-datacollector-basic-lib                                 noarch /streamsets-datacollector-basic-lib-                                  36 M

streamsets-datacollector-bigtable-lib                              noarch /streamsets-datacollector-bigtable-lib-                               55 M

streamsets-datacollector-cassandra_3-lib                           noarch /streamsets-datacollector-cassandra_3-lib-                            17 M

streamsets-datacollector-cyberark-credentialstore-lib              noarch /streamsets-datacollector-cyberark-credentialstore-lib-              5.2 M

streamsets-datacollector-dev-lib                                   noarch /streamsets-datacollector-dev-lib-                                    14 M

streamsets-datacollector-elasticsearch_5-lib                       noarch /streamsets-datacollector-elasticsearch_5-lib-                        18 M

streamsets-datacollector-google-cloud-lib                          noarch /streamsets-datacollector-google-cloud-lib-                           28 M

streamsets-datacollector-groovy_2_4-lib                            noarch /streamsets-datacollector-groovy_2_4-lib-                             19 M

streamsets-datacollector-influxdb_0_9-lib                          noarch /streamsets-datacollector-influxdb_0_9-lib-                           14 M

streamsets-datacollector-jdbc-lib                                  noarch /streamsets-datacollector-jdbc-lib-                                   27 M

streamsets-datacollector-jks-credentialstore-lib                   noarch /streamsets-datacollector-jks-credentialstore-lib-                   2.6 M

streamsets-datacollector-jms-lib                                   noarch /streamsets-datacollector-jms-lib-                                    17 M

streamsets-datacollector-jython_2_7-lib                            noarch /streamsets-datacollector-jython_2_7-lib-                             53 M

streamsets-datacollector-kinetica_6_0-lib                          noarch /streamsets-datacollector-kinetica_6_0-lib-                           32 M

streamsets-datacollector-mapr_6_0-lib                              noarch /streamsets-datacollector-mapr_6_0-lib-                               43 M

streamsets-datacollector-mapr_6_0-mep4-lib                         noarch /streamsets-datacollector-mapr_6_0-mep4-lib-                          94 M

streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib                noarch /streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib-                152 M

streamsets-datacollector-mongodb_3-lib                             noarch /streamsets-datacollector-mongodb_3-lib-                              16 M

streamsets-datacollector-mysql-binlog-lib                          noarch /streamsets-datacollector-mysql-binlog-lib-                           16 M

streamsets-datacollector-omniture-lib                              noarch /streamsets-datacollector-omniture-lib-                               15 M

streamsets-datacollector-rabbitmq-lib                              noarch /streamsets-datacollector-rabbitmq-lib-                               16 M

streamsets-datacollector-redis-lib                                 noarch /streamsets-datacollector-redis-lib-                                  14 M

streamsets-datacollector-salesforce-lib                            noarch /streamsets-datacollector-salesforce-lib-                             20 M

streamsets-datacollector-stats-lib                                 noarch /streamsets-datacollector-stats-lib-                                  32 M

streamsets-datacollector-vault-credentialstore-lib                 noarch /streamsets-datacollector-vault-credentialstore-lib-                 3.8 M

streamsets-datacollector-windows-lib                               noarch /streamsets-datacollector-windows-lib-                                14 M


Transaction Summary


Install  34 Packages


Total size: 1.1 G

Installed size: 1.1 G

Is this ok [y/d/N]: y

Downloading packages:

Running transaction check

Running transaction test

Transaction test succeeded

Running transaction

 Installing : streamsets-datacollector-                                                                                                                                               1/34

 Installing : streamsets-datacollector-salesforce-lib-                                                                                                                                2/34

 Installing : streamsets-datacollector-groovy_2_4-lib-                                                                                                                                3/34

 Installing : streamsets-datacollector-cyberark-credentialstore-lib-                                                                                                                  4/34

 Installing : streamsets-datacollector-aws-lib-                                                                                                                                       5/34

 Installing : streamsets-datacollector-cassandra_3-lib-                                                                                                                               6/34

 Installing : streamsets-datacollector-rabbitmq-lib-                                                                                                                                  7/34

 Installing : streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib-                                                                                                                    8/34

 Installing : streamsets-datacollector-jdbc-lib-                                                                                                                                      9/34

 Installing : streamsets-datacollector-apache-kafka_1_0-lib-                                                                                                                         10/34

 Installing : streamsets-datacollector-dev-lib-                                                                                                                                      11/34

 Installing : streamsets-datacollector-omniture-lib-                                                                                                                                 12/34

 Installing : streamsets-datacollector-mongodb_3-lib-                                                                                                                                13/34

 Installing : streamsets-datacollector-redis-lib-                                                                                                                                    14/34

 Installing : streamsets-datacollector-windows-lib-                                                                                                                                  15/34

 Installing : streamsets-datacollector-jks-credentialstore-lib-                                                                                                                      16/34

 Installing : streamsets-datacollector-jython_2_7-lib-                                                                                                                               17/34

 Installing : streamsets-datacollector-kinetica_6_0-lib-                                                                                                                             18/34

 Installing : streamsets-datacollector-jms-lib-                                                                                                                                      19/34

 Installing : streamsets-datacollector-stats-lib-                                                                                                                                    20/34

 Installing : streamsets-datacollector-elasticsearch_5-lib-                                                                                                                          21/34

 Installing : streamsets-datacollector-apache-solr_6_1_0-lib-                                                                                                                        22/34

 Installing : streamsets-datacollector-apache-kafka_0_11-lib-                                                                                                                        23/34

 Installing : streamsets-datacollector-mapr_6_0-lib-                                                                                                                                 24/34

 Installing : streamsets-datacollector-azure-lib-                                                                                                                                    25/34

 Installing : streamsets-datacollector-mysql-binlog-lib-                                                                                                                             26/34

 Installing : streamsets-datacollector-vault-credentialstore-lib-                                                                                                                    27/34

 Installing : streamsets-datacollector-apache-kafka_0_10-lib-                                                                                                                        28/34

 Installing : streamsets-datacollector-basic-lib-                                                                                                                                    29/34

 Installing : streamsets-datacollector-influxdb_0_9-lib-                                                                                                                             30/34

 Installing : streamsets-datacollector-apache-kafka_0_9-lib-                                                                                                                         31/34

 Installing : streamsets-datacollector-mapr_6_0-mep4-lib-                                                                                                                            32/34

 Installing : streamsets-datacollector-bigtable-lib-                                                                                                                                 33/34

 Installing : streamsets-datacollector-google-cloud-lib-                                                                                                                             34/34

 Verifying  : streamsets-datacollector-salesforce-lib-                                                                                                                                1/34

 Verifying  : streamsets-datacollector-groovy_2_4-lib-                                                                                                                                2/34

 Verifying  : streamsets-datacollector-cyberark-credentialstore-lib-                                                                                                                  3/34

 Verifying  : streamsets-datacollector-aws-lib-                                                                                                                                       4/34

 Verifying  : streamsets-datacollector-cassandra_3-lib-                                                                                                                               5/34

 Verifying  : streamsets-datacollector-rabbitmq-lib-                                                                                                                                  6/34

 Verifying  : streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib-                                                                                                                    7/34

 Verifying  : streamsets-datacollector-jdbc-lib-                                                                                                                                      8/34

 Verifying  : streamsets-datacollector-apache-kafka_1_0-lib-                                                                                                                          9/34

 Verifying  : streamsets-datacollector-dev-lib-                                                                                                                                      10/34

 Verifying  : streamsets-datacollector-omniture-lib-                                                                                                                                 11/34

 Verifying  : streamsets-datacollector-mongodb_3-lib-                                                                                                                                12/34

 Verifying  : streamsets-datacollector-redis-lib-                                                                                                                                    13/34

 Verifying  : streamsets-datacollector-windows-lib-                                                                                                                                  14/34

 Verifying  : streamsets-datacollector-jks-credentialstore-lib-                                                                                                                      15/34

 Verifying  : streamsets-datacollector-jython_2_7-lib-                                                                                                                               16/34

 Verifying  : streamsets-datacollector-kinetica_6_0-lib-                                                                                                                             17/34

 Verifying  : streamsets-datacollector-jms-lib-                                                                                                                                      18/34

 Verifying  : streamsets-datacollector-stats-lib-                                                                                                                                    19/34

 Verifying  : streamsets-datacollector-elasticsearch_5-lib-                                                                                                                          20/34

 Verifying  : streamsets-datacollector-apache-solr_6_1_0-lib-                                                                                                                        21/34

 Verifying  : streamsets-datacollector-apache-kafka_0_11-lib-                                                                                                                        22/34

 Verifying  : streamsets-datacollector-mapr_6_0-lib-                                                                                                                                 23/34

 Verifying  : streamsets-datacollector-azure-lib-                                                                                                                                    24/34

 Verifying  : streamsets-datacollector-mysql-binlog-lib-                                                                                                                             25/34

 Verifying  : streamsets-datacollector-vault-credentialstore-lib-                                                                                                                    26/34

 Verifying  : streamsets-datacollector-apache-kafka_0_10-lib-                                                                                                                        27/34

 Verifying  : streamsets-datacollector-basic-lib-                                                                                                                                    28/34

 Verifying  : streamsets-datacollector-influxdb_0_9-lib-                                                                                                                             29/34

 Verifying  : streamsets-datacollector-apache-kafka_0_9-lib-                                                                                                                         30/34

 Verifying  : streamsets-datacollector-mapr_6_0-mep4-lib-                                                                                                                            31/34

 Verifying  : streamsets-datacollector-bigtable-lib-                                                                                                                                 32/34

 Verifying  : streamsets-datacollector-google-cloud-lib-                                                                                                                             33/34

 Verifying  : streamsets-datacollector-                                                                                                                                              34/34



 streamsets-datacollector.noarch 0:                                                         streamsets-datacollector-apache-kafka_0_10-lib.noarch 0:

 streamsets-datacollector-apache-kafka_0_11-lib.noarch 0:                                   streamsets-datacollector-apache-kafka_0_9-lib.noarch 0:

 streamsets-datacollector-apache-kafka_1_0-lib.noarch 0:                                    streamsets-datacollector-apache-solr_6_1_0-lib.noarch 0:

 streamsets-datacollector-aws-lib.noarch 0:                                                 streamsets-datacollector-azure-lib.noarch 0:

 streamsets-datacollector-basic-lib.noarch 0:                                               streamsets-datacollector-bigtable-lib.noarch 0:

 streamsets-datacollector-cassandra_3-lib.noarch 0:                                         streamsets-datacollector-cyberark-credentialstore-lib.noarch 0:

 streamsets-datacollector-dev-lib.noarch 0:                                                 streamsets-datacollector-elasticsearch_5-lib.noarch 0:

 streamsets-datacollector-google-cloud-lib.noarch 0:                                        streamsets-datacollector-groovy_2_4-lib.noarch 0:

 streamsets-datacollector-influxdb_0_9-lib.noarch 0:                                        streamsets-datacollector-jdbc-lib.noarch 0:

 streamsets-datacollector-jks-credentialstore-lib.noarch 0:                                 streamsets-datacollector-jms-lib.noarch 0:

 streamsets-datacollector-jython_2_7-lib.noarch 0:                                          streamsets-datacollector-kinetica_6_0-lib.noarch 0:

 streamsets-datacollector-mapr_6_0-lib.noarch 0:                                            streamsets-datacollector-mapr_6_0-mep4-lib.noarch 0:

 streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib.noarch 0:                              streamsets-datacollector-mongodb_3-lib.noarch 0:

 streamsets-datacollector-mysql-binlog-lib.noarch 0:                                        streamsets-datacollector-omniture-lib.noarch 0:

 streamsets-datacollector-rabbitmq-lib.noarch 0:                                            streamsets-datacollector-redis-lib.noarch 0:

 streamsets-datacollector-salesforce-lib.noarch 0:                                          streamsets-datacollector-stats-lib.noarch 0:

 streamsets-datacollector-vault-credentialstore-lib.noarch 0:                               streamsets-datacollector-windows-lib.noarch 0:



[root@maprdemo streamsets-datacollector-]#


  1. Setup connectivity to MapR

The command modifies configuration files, creates the required symbolic links, and installs the appropriate MapR stage libraries.

[root@maprdemo streamsets-datacollector-]# cd /opt/streamsets-datacollector/

[root@maprdemo streamsets-datacollector]# ls

api-lib  bin cli-lib  container-lib libexec  libs-common-lib root-lib  sdc-static-web streamsets-libs  user-libs

[root@maprdemo streamsets-datacollector]# export SDC_HOME=/opt/streamsets-datacollector

[root@maprdemo streamsets-datacollector]# export SDC_CONF=/etc/sdc

[root@maprdemo streamsets-datacollector]# export MAPR_MEP_VERSION=4

[root@maprdemo streamsets-datacollector]# $SDC_HOME/bin/streamsets setup-mapr


+ printf 'Done\n'


+ echo Succeeded



  1. Start the service

[root@maprdemo streamsets-datacollector-]# systemctl start sdc


  1. Check Service Status

[root@maprdemo streamsets-datacollector-]# systemctl status sdc

  • sdc.service - StreamSets Data Collector (SDC)

  Loaded: loaded (/usr/lib/systemd/system/sdc.service; static; vendor preset: disabled)

  Active: active (running) since Thu 2018-02-01 06:19:20 PST; 26s ago

Main PID: 31899 (_sdc)

  CGroup: /system.slice/sdc.service

          ├─31899 /bin/bash /opt/streamsets-datacollector/libexec/_sdc -verbose

          └─31939 /usr/bin/java -classpath /opt/streamsets-datacollector/libexec/bootstrap-libs/main/streamsets-datacollector-bootstrap-* -Djava.secu...


Feb 01 06:19:20 maprdemo.local streamsets[31899]: API_CLASSPATH                  : /opt/streamsets-datacollector/api-lib/*.jar

Feb 01 06:19:20 maprdemo.local streamsets[31899]: CONTAINER_CLASSPATH            : /etc/sdc:/opt/streamsets-datacollector/container-lib/*.jar

Feb 01 06:19:20 maprdemo.local streamsets[31899]: LIBS_COMMON_LIB_DIR            : /opt/streamsets-datacollector/libs-common-lib/

Feb 01 06:19:20 maprdemo.local streamsets[31899]: STREAMSETS_LIBRARIES_DIR       : /opt/streamsets-datacollector/streamsets-libs

Feb 01 06:19:20 maprdemo.local streamsets[31899]: STREAMSETS_LIBRARIES_EXTRA_DIR : /opt/streamsets-datacollector/streamsets-libs-extras/

Feb 01 06:19:20 maprdemo.local streamsets[31899]: USER_LIBRARIES_DIR             : /opt/streamsets-datacollector/user-libs/

Feb 01 06:19:20 maprdemo.local streamsets[31899]: JAVA OPTS                      : -Xmx1024m -Xms1024m -s...amsets-dataco

Feb 01 06:19:20 maprdemo.local streamsets[31899]: MAIN CLASS                     : com.streamsets.datacollector.main.DataCollectorMain

Feb 01 06:19:21 maprdemo.local streamsets[31899]: Logging initialized @945ms to org.eclipse.jetty.util.log.Slf4jLog

Feb 01 06:19:34 maprdemo.local streamsets[31899]: Running on URI : 'http://maprdemo:18630'

Hint: Some lines were ellipsized, use -l to show in full.


  1. Enable port forwarding

To access the UI for StreamSets Data Collector, enable port 18630 to be accessible.

Here’s how you’ll do that if you use Virtual Box.


Select Settings for the Sandbox and then click on Network settings



Add an entry for host port 18630


Select OK.


  1. Log into SDC & verify MapR stages are visible

Log into the SDC UI with the following url: http://localhost:18630

Default login is: admin/admin

Verify that you see MapR stages in the UI by first creating a pipeline.


Create a new pipeline


If all goes well, you should be able to see all the MapR stages as shown above.

I came across these series of podcasts hosted by Jim Scott about Kubernetes and I though you might like them!



Happy listening!

What is happening now in machine learning is very much like the homebrew computer movement from a half-century ago? 


Article by Ted Dunning  published on February 2nd 2018 at TDWI.


Can you name a technology that almost all of us have been using for 30 years that is paradoxically now considered to be the Next Big Thing?


That would be machine learning. It has been powering things (such as credit card fraud detection) since the late 1980s, about the same time, banks started widespread use of neural networks for signature and amount verification on checks.


Machine learning has been around a long time, even in very widely used applications. That said, there has been a massive technological revolution over the last 15 years.


This upheaval is normally described in recent news and marketing copy as revolutionary because what appeared to be impossibly hard tasks (such as playing Go, recognizing a wide range of images, or translating text in video on the fly) have suddenly yielded to modern methods tantalizing us with the promise that stunning new products are just around the corner.


The Real Change Isn't What You Think

In some sense, however, the important thing that has changed is a shift, taking machine learning from something that can be used in a few niche applications supported by an army of Ph.D.-qualified mathematicians and programmers into something that can turn a few weekends of effort by an enthusiastic developer into an impressive project. If you happen to have that army of savants, that is all well and good, but the real news is not what you can do with such an army. Instead, the real news is about what you can do without such an army.


Just recently, academic and industrial researchers have started to accompany the publication of their results with working models and the code used to build them. Interestingly, it is commonly possible to start with these models and tweak them just a bit to perform some new task, often taking just a fraction of a percent as much data and compute time to do this retuning. You can take an image-classification program originally trained to recognize images in any of 1,000 categories using tens of millions of images and thousands of hours of high-performance computer time for training and rejigger it to distinguish chickens from blue jays with a few thousand sample images and a few minutes to hours of time on your laptop. Deep learning has, in a few important areas, turned into cheap learning.


Over the last year or two, this change has resulted in an explosion of hobby-level projects where deep learning was used for all kinds of fantastically fun -- but practically pretty much useless -- projects. As fanciful and even as downright silly as these projects have been, they have had a very practical and important effect of building a reservoir of machine learning expertise among developers who have a wild array of different kinds of domain knowledge.


Coming Soon


Those developers who have been building machines to kick bluejays out of a chicken coop, play video games, sort Legos, or track their cat's activities will inevitably be branching out soon to solve problems they see in their everyday life at work. It will be a short walk from building a system for the joy of messing about to building systems that solve real problems.


What is happening now in machine learning is very much like the homebrew computer movement from a half-century ago. The first efforts resulted in systems that only a hacker could love, but before long we had the Apple II and then the Macintosh. What started as a burst of creative energy changed the world.

We stand on the verge of the same level of change.


About the Author

Ted Dunning is chief applications architect at MapR Technologies and a board member for the Apache Software Foundation. He is a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects and a mentor for several incubator projects. He was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud detection systems for ID Analytics (LifeLock). He has a Ph.D. in computing science from the University of Sheffield and 24 issued patents to date. He has co-authored a number of books on big data topics including several published by O’Reilly related to machine learning. Find him on Twitter as @ted_dunning. 

The MapR Music Catalog application by Tug Grall explains the key MapR-DB features, and how to use them to build a complete Web application. Here are the steps to develop, build and run the application:

  1. Introduction
  2. MapR Music Architecture
  3. Setup your environment
  4. Import the Data Set
  5. Discover MapR-DB Shell and Apache Drill
  6. Work with MapR-DB and Java
  7. Add Indexes
  8. Create a REST API
  9. Deploy to Wildfly
  10. Build the Web Application with Angular
  11. Work with JSON Arrays
  12. Change Data Capture
  13. Add Full Text Search to the Application
  14. Build a Recommendation Engine

The source code of the MapR Music Catalog application is available in this GitHub Repository.

Tüpras is a Turkish oil refinery which is the largest industrial company in Turkey and the 7th oil refinery in Europe. The Big Data & Analytics Team from Tüpras was the 2017 Gold Stevie Winner in the category: 'IT Team of the Year'. 

"In bullet-list form, briefly summarize up to ten (10) accomplishments of the nominated team since the beginning of 2016 (up to 150 words).

  • Establishing a new “Big Data platform based on Hadoop, MapR”
  • Integrating daily 300 billion raw process data depended on 200K sensors of 4 refineries into Big Data
  • Historical export of 10 years data integrated into Big Data
  • Reducing the data frequencies of 30s and 60s with the new platform to 1s frequency
  • Developing “Tüpras Historian Database-THD” for access and analysis of all refinery data with a single web-based application
  • “Management Information System-MIS” platform developed for visual analysis of approximately 50K metric/KPI calculated from process data. The platform supports self service reporting and Decision Making Support tools for “proactive monitoring & analysis”.
  • Developing “Engineering Platform” to run fast What-IF scenarios
  • Developing “Alarm Management” system to centralize and analyze DCS (distributed control system) alarms
  • Implementing of “Predictive Maintenance” scenarios based on Machine Learning
  • Developing IOS based mobile applications of MIS and THD"
In the attached powerpoint presentation about the project, you might be interested in particular slide 4 of the platform architecture and also slides about why they selected MapR
Also a couple interesting videos about the company and the project ( in Turkish), 
  • Tupra Big Data and Analytics video: ( 1'58 duration)
  • Tupras Intro Video


During execution, Gateway and Apex generate event log records that provide an audit trail. This can be used to understand the activity of the system and to diagnose problems. Usually, the event log records are stored into a local file system and later can be used for analysis and diagnostic.

Gateway also provides an universal ability to pass and store Gateway and Apex event log records to 3rd party sources. You can use external tools to store the log events and also to query and report. For this you must configure the logger appender in the Gateway configuration files.

Configuring Logger Appenders

Gateway and Apex Client processes run on the machine node where the Gateway instance has been installed. Therefore, you can configure the logger appenders using the regular log4j properties (datatorrent/releases/3.9.0/conf/

Following is an example of log4j properties configuration for Socket Appender:


You can use the regular attribute property “apex.attr.LOGGER_APPENDER” to configure the logger appenders for Apex Application Master and Containers. This can be defined in the configuration file dt-site.xml (global, local, and user) or in the static and runtime application properties.

Use the following syntax to enter the logger appender attribute value:


Following is an example of logger appender attribute configuration for Socket Appender:

  <property>     <name>apex.attr.LOGGER_APPENDER</name>     <value>tcp;,            log4j.appender.tcp.RemoteHost=logstashnode1,            log4j.appender.tcp.Port=5400,            log4j.appender.tcp.ReconnectionDelay=10000,            log4j.appender.tcp.LocationInfo=true     </value>   </property>

Integrating with ElasticSearch and Splunk

You can use different methods to store event log records to an external data source. However, we recommend to use the following method:

Gateway and Apex can be configured to use Socket Appender to send logger events to Logstash and Logstash can deploy event log records to any output data sources. For instance, the following picture shows the integration workflow with ElasticSearch and Splunk.

Following is an example of Logstash configuration:

input {  getting of looger events from Socket Appender   log4j {     mode => "server"     port => 5400     type => "log4j"   } }  Filter{  transformation of looger events to event log records   mutate {     remove_field => [ "@version","path","tags","host","type","logger_name" ]     rename => { "apex.user" => "user" }     rename => { "apex.application" => "application" }     rename => { "apex.containerId" => "containerId" }     rename => { "apex.applicationId" => "applicationId" }     rename => { "apex.node" => "node" }     rename => { "apex.service" => "service" }     rename => { "dt.node" => "node" }     rename => { "dt.service" => "service" }     rename => { "priority" => "level" }     rename => { "timestamp" => "recordTime" }    }    date {     match => [ "recordTime", "UNIX" ]     target => "recordTime"   } }  output {   elasticsearch {  putting of event log records to ElasticSearch cluster   hosts => ["esnode1:9200","esnode2:9200","esnode3:9200"]     index => "apexlogs-%{+YYYY-MM-dd}"     manage_template => false   }    tcp {  putting of event log records to Splunk    host => "splunknode"    mode => "client"    port => 15000    codec => "json_lines"  } }

ElasticSearch users can use Kibana reporting tool for analysis and diagnostic. Splunk users can use Splunkweb.

Links to 3rd party tools:



MapR Persistent Application Client Containers (PACCs) support containerization of existing and new applications by providing containers with persistent data access from anywhere. PACCs are purposely built for connecting to MapR services. They offer secure authentication and connection at the container level, extensible support for the application layer, and can be customized and published in Docker Hub.


Microsoft SQL Server 2017 for Linux offers the flexibility of running MSSQL in a Linux environment. Like all RDBMs, it also needs a robust storage platform to persist in databases, where it is managed and protected securely.


By containerizing MSSQL with MapR PACCs, customers have all the benefits of MSSQL, MapR, and Docker combined. Here, MSSQL offers robust RDBM services that persist data into MapR for disaster recovery and data protection, while leveraging Docker technologies for scalability and agility.


The diagram below shows the architecture for our demonstration:


A MapR Cluster

Before you can deploy the container, you need a MapR cluster for persisting data to. There are multiple ways to deploy a MapR cluster. You can use a sandbox, or you can use MapR Installer for on-premises or cloud deployment. The easiest way to deploy MapR on Azure is through the MapR Azure Marketplace. Once you sign up for Azure, purchase a subscription that has enough quotas, such as CPU cores and storage, and fill out a form to answer some basic questions for the infrastructure and MapR, then off you go at the click of a button. A fully deployed MapR cluster should be at your fingertips within 20 minutes.


A VM with Docker CE/EE Running

Second, you need to spin up a VM in the same VNet or subnet where your MapR cluster is located. Docker CE/EE is required. For information on how to install Docker, follow this link: Docker supports a wide variety of OS platforms. We used CentOS for our demo.

Deploying the MSSQL Container

Once you have the MapR cluster and VM running, you can kick off your container deployment.


Step 1 - Build a Docker Image


Login to your VM as root and run the following command:


curl -L | bash


In a few minutes, you should see a similar message to the one below, indicating a successful build:


Execute the following command to verify the image (mapr-azure/pacc-mssql:latest) is indeed stored in the local Docker repository:

Step 2 – Create a Volume for MSSQL

Before starting up the container, you need to create a volume on the MapR cluster to persist the database into. Login to the MapR cluster as user ‘mapr’ and run the following command to create a volume (e.g., vol1) mounted on path /vol1 in the filesystem:


maprcli volume create –path /vol1 –name vol1


You can get the cluster name by executing this command:


maprcli dashboard info -json | grep name


Step 3 – Start Up the Container

Run the following command to spin up the container with the image we just built in Step 1 above:


# docker run --rm --name pacc-mssql -it \

--cap-add SYS_ADMIN \

--cap-add SYS_RESOURCE \

--device /dev/fuse \

--security-opt apparmor:unconfined \

--memory 0 \

--network=bridge \


-e SA_PASSWORD=m@prr0cks \

-e MAPR_CLUSTER=mapr522 \

-e MSSQL_BASE_DIR=/mapr/mapr522/vol1 \


-e MAPR_MOUNT_PATH=/mapr \

-e MAPR_TZ=Etc/UTC \





-p 1433:1433 \



Note you can replace –it with –d in the first line to place the startup process running in the background.

You can customize the environment variables, colored in red above, to fit your environment. The variable SA_PASSWORD is for the MSSQL admin user. MAPR_CLUSTER is the cluster name. MSSQL_BASE_DIR is the path to MapR-XD, where MSSQL will be persisting its data. The path usually takes the form of /mapr/<cluster name>/<volume name>. MAPR_CLDB_HOSTS is the IP address of the cldb hosts in the MapR cluster. In our case, we only have a single node cluster, so only one IP is used. Finally, the default MSSQL port is 1433. You can use the –p option in Docker to expose it to a port of your choice on the VM host. We selected the same port 1433 in the demo.


There are other environment variables you can pass into MapR PACC. For more information, please refer to this link:


In a few minutes, you should see a message like the one below that indicates the MSSQL server is ready:


2017-11-16 22:54:30.49 spid19s     SQL Server is now ready for client connections. This is an informational message; no user action is required.

Step 4 – Create a Table in MSSQL, and Insert Some Data

Now you are ready to insert some sample data into a test MSSQL database. To do so, find the container ID of the running MSSQL container by issuing this command:

Then use the docker exec command to login to the container:

Then, issue the command below to get into a MSSQL prompt by providing the admin password when you started the container, as in step 3 above:

Issue the following MSSQL statements to populate an inventory table in a test database, then query the table:

Success! This means the database has been persisted into the MapR volume and is now managed and protected by MapR-XD storage. You can verify by issuing the "ls" command in the container: the MSSQL log, secret, and data directories show up in vol1:

Step 5 – Destroy Current Container, Relaunch a New Container, and Access the Existing Table


Now let’s destroy the current container to simulate a server outage by issuing this command:


# docker rm –f c2e69e75b181


Repeat step 3 above to launch a new container. Login to the container and query the same inventory table right away, when the new container is up and running:

With a huge sense of relief, you see the data previously entered is still there, thanks to MapR!


Step 6 – Scale It Up and Beyond


With the container technology know-how in place, it is extremely easy to spin up multiple containers all at once. Simply repeat steps 2 and 3 to assign each MSSQL container a new volume in MapR, and off you go.


In this blog, we demonstrated how to containerize MSSQL with MapR PACC and persist its database into MapR for data protection and disaster recovery. MapR PACCs are a great way for many other applications that require a scalable and robust storage layer to have their data managed and distributed for DR and scalability. The MapR PACCs can also be managed for deployment at scale with an orchestrator, like Kubernetes, Mesos, or Docker, to achieve true scalability and high availability.

To learn how to create HDInsight Spark Cluster in Microsoft Azure Portal please refer to part one of my artcile. After creation of spark cluster named, I have highlighted the URL of my Cluster.

Microsoft Azure


Microsoft Azure


A total of 4 nodes are created -- 2 Head Nodes and 2 Name Nodes -- for a total of 16 cores and an available total of a 60 cluster capacity; out of it 16 are used and 44 clusters remain for scaling up. You can also click and visit Cluster Dashboard, Ambari View and also you can scale the size of clusters.

Apache Ambari is for management and monitoring of Hadoop clusters in the form of WEB UI and REST services. Ambari is used to monitor the clusters and make changes in configuration. Apache  Ambari is used for provision, monitoring and managing the clusters in an easier way. Using Ambari you can manage central security setup and fully visibility into cluster health. Ambari Dashboard looks like below,


Microsoft Azure


Using Ambari Dashboard you can manage and configure services, hosts, alerts for critical conditions etc. Also many services are integrated using Ambari WEB UI. Below is Hive Query Editor through Ambari,


Microsoft Azure


You can write, run and process the Hive Query in Ambari WEB UI you can convert that result in to charts etc you can save queries manage history of queries etc.


Microsoft Azure


Above snapshot is a list of services available in Ambari and below is HDInsight SuketuSpark clients list.


Microsoft Azure


In the new browser you can type or you can directly click on Jupyter Logo in azure portal to open Jupyter notebook. The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more. Jupyter and zeppelin are two notepads integrated with hdinsight.


Microsoft Azure

You can use Jupyter notebook to run Spark SQL queries against the Spark cluster. HDInsight Spark clusters provide two kernels that you can use with the Jupyter notebook.

  • PySpark (for applications written in Python)
  • Spark (for applications written in Scala)

PySpark is the python binding for the Spark Platform and API and is not much different from the Java/Scala versions. Learning Scala is a better choice than python as Scala being a functional langauge makes it easier to paralellize code, which is a great feature if working with Big data.

Like Java, Scala is object-oriented, and uses a curly-brace syntax reminiscent of the C programming language. Unlike Java, Scala has many features of functional programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern matching.

When you type or when you click on zeppelin icon in azure portal than zeppelin notepad will be open in new browser tab. Below is a snapshot of that.


Microsoft Azure


A Zeppelin is web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.

Flexible searching and indexing for web applications and sites is almost always useful and sometimes absolutely essential. While there are many complex solutions that manage data and allow you to retrieve and interact with it through HTTP methods, ElasticSearch has gained popularity due to its easy configuration and incredible malleability.

Elasticsearch is an open-source search engine built on top of Apache Lucene, a full-text search-engine library.


Basic Crud
CRUD stands for create, read, update, and delete. These are all operations that are needed to effectively administer persistent data storage. Luckily, these also have logical equivalents in HTTP methods, which makes it easy to interact using standard methods. The CRUD methods are implemented by the HTTP methods POST, GET, PUT, and DELETE respectively.

In order to use ElasticSearch for anything useful, such as searching, the first step is to populate an index with some data. This process known as indexing.


Documents are indexed—stored and made searchable—by using the index API.

In ElasticSearch, indexing corresponds to both “Create” and “Update” in CRUD – if we index a document with a given type and ID that doesn’t already exists it’s inserted. If a document with the same type and ID already exists, it’s overwritten.


From our perspectives as users of ElasticSearch, a document is a JSON object. As such a document can can have fields in the form of JSON properties. Such properties can be values such as strings or numbers, but they can also be other JSON objects.


In order to create a document, we make a PUT request to the REST API to a URL made up of the index name, type name and ID. That is: http://localhost:9200///[] and include a JSON object as the PUT data.

Index and type are required while the id part is optional. If we don’t specify an ID ElasticSearch will generate one for us. However, if we don’t specify an id we should use POST instead of PUT. The index name is arbitrary. If there isn’t an index with that name on the server already one will be created using default configuration.


As for the type name it too is arbitrary. It serves several purposes, including:

Each type has its own ID space.
Different types can have different mappings (“schema” that defines how properties/fields should be indexed).
Although it’s possible, and common, to search over multiple types, it’s easy to search only for one or more specific type(s).



Let’s index something! We can put just about anything into our index as long as it can be represented as a single JSON object. For the sake of having something to work with we’ll be indexing, and later searching for, movies. Here’s a classic one:

Sample JSON object


To index the above JSON object we decide on an index name (“movies”), a type name (“movie”) and an ID (“1”) and make a request following the pattern described above with the JSON object in the body.

A request that indexes the sample JSON object as a document of type ‘movie’ in an index named ‘movies’

Execute the above request using cURL or paste it into sense and hit the green arrow to run it. After doing so, given that ElasticSearch is running, you should see a response looking like this:

Response from ElasticSearch to the indexing request.



The request for, and result of, indexing the movie in Sense.

As you see, the response from ElasticSearch is also a JSON object. It’s properties describe the result of the operation. The first three properties simply echo the information that we specified in the URL that we made the request to. While this can be convenient in some cases it may seem redundant. However, remember that the ID part of the URL is optional and if we don’t specify an ID the _id property will be generated for us and its value may then be of great interest to us.


Related Article: Sort the Results Using a Sort Property


The fourth property, _version, tells us that this is the first version of this document (the document with type “movie” with ID “1”) in the index. This is also confirmed by the fifth property, “created”, whose value is true.

Now that we’ve got a movie in our index let’s look at how we can update it, adding a list of genres to it. In order to do that we simply index it again using the same ID. In other words, we make the exact same indexing request as as before but with an extended JSON object containing genres.

Indexing request with the same URL as before but with an updated JSON payload.


This time the response from ElasticSearch looks like this:

The response after performing the updated indexing request.



Not surprisingly the first three properties are the same as before. However, the _version property now reflects that the document has been updated as it now has 2 a version number. The created property is also different, now having the value false. This tells us that the document already existed and therefore wasn’t created from scratch.


It may seem that the created property is redundant. Wouldn’t it be enough to inspect the _-
version property to see if its value is greater than one? In many cases that would work. However,
if we were to delete the document the version number wouldn’t be reset meaning that if we later
indexed a document with the same ID the version number would be greater than one.


So, what’s the purpose of the _version property then? While it can be used to track how many
times a document has been modified it’s primary purpose is to allow for optimistic concurrency


If we supply a version in indexing requests ElasticSearch will then only overwrite the document
if the supplied version is the same as for the document in the index. To try this out add a version
query string parameter to the URL of the request with “1” as value, making it look like this:

Indexing request with a ‘version’ query string parameter.


Now the response from ElasticSearch is different. This time it contains an error property with a message explaining that the indexing didn’t happen due to a version conflict.


Response from ElasticSearch indicating a version conflict.


Getting by ID
We’ve seen how to indexing documents, both new ones and existing ones, and have looked at how ElasticSearch responds to such requests. However, we haven’t actually confirmed that the documents exists, only that ES tells us so.

So, how do we retrieve a document from an ElasticSearch index? Of course we could search for it. However that’s overkill if we only want to retrieve a single document with a known ID. A simpler and faster approach is be to retrieve it by ID.


In order to do that we make a GET request to the same URL as when we indexed it, only this time the ID part of the URL is mandatory. In other words, in order to retrieve a document by ID from ElasticSearch we make a GET request to HTTP://LOCALHOST:9200///. Let’s try it with our movie using the following request:



As you can see the result object contains similar meta data as we saw when indexing, such as index, type and version. Last but not least it has a property named _source which contains the actual document body. There’s not much more to say about GET as it’s pretty straightforward. Let’s move on to the final CRUD operation.


Deleting documents
In order to remove a single document from the index by ID we again use the same URL as for indexing and retrieving it, only this time we change the HTTP verb to DELETE.


Request for deleting the movie with ID 1.


curl -XDELETE “http://localhost:9200/movies/movie/1“


The response object contains some of the usual suspects in terms of meta data, along with a property named “_found” indicating that the document was indeed found and that the operation was successful.

Response to the DELETE request.

"found": true,
"_index": "movies",
"_type": "movie",
"_id": "1",
"_version": 3
If we, after executing the DELETE request, switch back to GET we can verify that the document has indeed been deleted:

Response when making the the DELETE request a second time.

"_index": "movies",
"_type": "movie",
"_id": "1",
"found": false


Simple Spark Tips #1

Posted by MichaelSegel Oct 19, 2017

Many developers are switching over to using Spark and Spark.SQL as a way to ingest and use data.  As an example, you could be asked to take a .csv file and convert it in to a parquet file or even a Hive or MapR-DB table.


With spark, its very easy to do this... you just load the file in to a DataFrame/DataSet and then write the file out as a parquet file and you're done. The code to create the DataFrame:

val used_car_databaseDF =
        .option("header", "true") //reading the headers
        .option("mode", "DROPMALFORMED")

where the used_cars_databaseURL is a String of the path to the file that I had created earlier in my code.


But suppose you want to work with the data as a SQL table? Spark allows you to create temporary tables/views of the data and rename the DataFrame (DF) into a table 'alias'. 


Here I've created a table/view used_cars where I can now use this in a Spark.sql command.

spark.sql("SELECT COUNT(*)  FROM used_cars ").show();

Obviously this is just a simple example just to show that you can run a query and see its output.

If you're working with a lot of different tables, its easy to lose tract of the tables/views that you've created.


But spark does have a couple of commands which will allow you to view the list of tables that you have already set up for use.  The spark.catalog .  Below is some sample code I pulled from my notebook where I have been experimenting with using Spark and MapR

import org.apache.spark.sql.catalog
import org.apache.spark.sql.SparkSession

//Note: We need to look at listing columns for each table...
spark.catalog.listColumns("creditcard").show // Another test table

// Now lets try to run thru catalog
println("Testing walking thru the table catalog...")
val tableDS = spark.catalog.listTables
println("There are "+ tableDS.count + " rows in the catalog...")

tableDS.printSchema // Prints the structure of the objects in the dataSet

    e => println( }

// Now trying a different way...

    e =>
    val n =
    println("Table Name: "+n)
    spark.catalog.listColumns(n).collect.foreach{ e => println("\t""\t"+e.dataType) }

Note: While I am not sure if I needed to pull in org.apache.spark.sql.SparkSession , it doesn't seem to hurt.


In the sample code, I use the method show() which formats the output and displays it.  However, show() is limited to only the first 20 rows of output, regardless of the source. This can be problematic, especially if you have more than 20 temp tables in your session, or that there are more than 20 columns when we inspect a table.

For more information on the spark catalog, please see: 

listTables() is a method that returns a list of Table objects that describe the temporary view that I have created.


As I said earlier, while show() will give you a nice formatted output, it limits you to the first 20 rows of output.

So lets instead printout our own list of tables.  I used the printSchema() method to identify elements of the Table object. The output looks like this:

|-- name: string (nullable = true)
|-- database: string (nullable = true)
|-- description: string (nullable = true)
|-- tableType: string (nullable = true)
|-- isTemporary: boolean (nullable = false)

For my first example, I'm walking through the list of tables and printing out the name of the table.


In my second example, foreach table I want to print out the table's schema. (In this example, only the column name and its data type.  This works well when you have more than 20 tables and you have more than 20 columns in a table.


If we put this all together, we can load a file, apply filters, and then store the data.  Without knowing the schema, its still possible to determine the data set's schema and then use that information to build out a schema to dynamically create a hive table or to put the data in to a MapR-DB table.


Securing Zeppelin

Posted by MichaelSegel Oct 5, 2017

Zeppelin is an Apache open source Notebook that supports multiple interpreters and seems to be one of the favorites for working with spark.


Zeppelin is capable of running on a wide variety of platforms so that it’s possible to run tests and perform code development away from working on a cluster or on a cluster.


As always, in today’s world, it is no longer paranoia to think about securing your environment.   This article focuses on the setup to secure Zeppelin and is meant as a supplement focusing on adding security.


Zeppelin itself is easy to install. You can ‘wget’ or download the pre-built binaries from the Apache Zeppelin site, follow the instructions and start using it right away.


The benefits are that you can build a small test environment on your laptop without having to work on a cluster. This reduces the impact on a shared environment.


However, its important to understand how to also set up and secure your environment.


Locking down the environment


Running Local

Zeppelin can run local on your desktop platform. In most instances your desktop doesn’t have a static IP address.   In securing your environment, you will want to force Zeppelin to only be available to the localhost/ environment.


In order to set this up, two entries in the $ZEPPELIN_HOME/conf/zeppelin-site.xml file have to be set.





<description>Server address</description>






<description>Server port.</description>



Setting the zeppelin.server.addr to only listen on the localhost address will mean that no one from the outside of the desktop will be able to access the Zeppelin service.


Setting the zeppelin.server.port to a value other than the 8080 the default is done because that port is the default for many services. By going to a different and unique port you can keep this consistent between the instances on your desktop and on a server. While this isn’t necessary, it does make life easier.


Beyond those two settings, there isn’t much else that you need to change.

Notice that there are properties that allow you to set up SSL tickets. While the documentation contains directions on how to set up SSL directly, there is a bug where trying to run with pkcs12 certificates causes an error. In trying to follow up, no resolution could be found. The recommendation is to use a proxy server nginx for managing the secure connection. (More on this later.)


Since the only interface is on the interface, SSL really isn’t required.


The next issue is that by default, there is no user authentication. Zeppelin provides this through Shiro. From the Apache Shiro website:

Apache Shiro™ is a powerful and easy-to-use Java security framework that performs authentication, authorization, cryptography, and session management. With Shiro’s easy-to-understand API, you can quickly and easily secure any application – from the smallest mobile applications to the largest web and enterprise applications.


While it may be ok to run local code as an anonymous user, its also possible for Zeppelin to run locally, yet access a cluster that is maintained remotely which may not accept anonymous users.


In order to setup shiro, just copy the shiro.ini.template to shiro.ini.

Since my SOHO environment is rather small and limited to a handful of people, I am not yet running LDAP. (Maybe one day … ) So I practice the K.I.S.S principle. The only thing need from shiro is to set up local users.


# List of users with their password allowed to access Zeppelin.

# To use a different strategy (LDAP / Database / ...) check the shiro doc at

#admin = password1, admin

#user1 = password2, role1, role2

#user2 = password3, role3

#user3 = password4, role2

If you find the [users] section, you’ll notice that the list of admin and userX entries are not commented out. Any entry in the form of <user> = <password>, <role> [ , <role2>, <role3> … ]   will be active. So you really don’t want to have an entry here that will give someone access.


So you will need to create an entry for yourself and other users.


Note: There are other entries below this section. One section details where you can use LDAP for authenticating users. If you are running LDAP, it would be a good idea to set this up to use LDAP.


Once you’ve made and saved the changes, that’s pretty much it. You can start/restart the service and you will authenticate against the entry in shiro.


Note the following:

While attempting to follow the SSL setup, I was directed to a stack overflow conversation on how to accomplish this. IMHO, it’s a major red flag when the documentation references a stack overflow article on how to setup and configure anything.

Running on a Server

Suppose you want to run Zeppelin on your cluster? Zeppelin then becomes a shared resource. You can set Zeppelin to run spark contexts per user or per notebook instead of one instance for the entire service.

The larger issue is that you will need to use an external interface to gain access, and even if you’re behind a firewall, you will need to have SSL turned on. Because I want to be able to run notes from outside of my network, I have to have SSL in place.   As I alluded to earlier, the ability to configure SSLs from within Zeppelin wasn’t working and the only guidance was to instead set up a proxy using nginx. (This came from two reliable sources)


With nginx, you should use the same configuration that we have already set up. Zeppelin will only listen on the local host and rely on the proxy server to handle external connections. Since my linux server is sitting next to me, I have a monitor set up so I can easily test connections to the local host, ensuring that my zeppelin instance is up and running. I followed the same steps that I used to set up my desktop and it ran without a hitch.


Unlike the instructions for trying to set up SSL directly, the information found on the Zeppelin site was very helpful. You can find a link to it here:


There are various ways of obtaining nginx, since I run Centos, I could have pulled down a version via yum, and of course if you run a different version of Linux, you can use their similar tool. Of course downloading it from the official site will get you the latest stable release.

While the documentation for setting up nginx w Zeppelin is better, there are still gaps… yet its still pretty straight forward.


Nginx installs in /etc/nginx directory. Under this directory, all of the configuration files are located in the ./conf.d directory.   I should have taken better notes, but going from memory, there was one file… the default configuration file. I suggest that you ignore the file and copy it in to another file name. I chose default.conf.ini . Based on the documentation, I was under the impression that nginx will look at all *.conf files for various setup data.


I then created a zeppelin.conf file and cut and pasted the section from the zeppelin documents.


upstream zeppelin {

   server [YOUR-ZEPPELIN-SERVER-IP]:[YOUR-ZEPPELIN-SERVER-PORT];   # For security, It is highly recommended to make this address/port as non-public accessible



# Zeppelin Website

server {


   listen 443 ssl;                                     # optional, to serve HTTPS connection

   server_name [YOUR-ZEPPELIN-SERVER-HOST];             # for example:


   ssl_certificate [PATH-TO-YOUR-CERT-FILE];           # optional, to serve HTTPS connection

   ssl_certificate_key [PATH-TO-YOUR-CERT-KEY-FILE];   # optional, to serve HTTPS connection


   if ($ssl_protocol = "") {

       rewrite ^ https://$host$request_uri? permanent; # optional, to force use of HTTPS



   location / {   # For regular websever support

       proxy_pass http://zeppelin;

       proxy_set_header X-Real-IP $remote_addr;

       proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

       proxy_set_header Host $http_host;

       proxy_set_header X-NginX-Proxy true;

       proxy_redirect off;

       auth_basic "Restricted";

       auth_basic_user_file /etc/nginx/.htpasswd;



   location /ws { # For websocket support

       proxy_pass http://zeppelin/ws;

       proxy_http_version 1.1;

       proxy_set_header Upgrade websocket;

       proxy_set_header Connection upgrade;

       proxy_read_timeout 86400;



As you can see, the configuration is pretty straight forward. Note the comment in the upstream zeppelin section. This is why using the loopback / localhost interface is a good idea.


Since the goal of the use of nginx is to create a secure (SSL) interface to Zeppelin, we need to create a public/private key pair. A simple use of Google will turn up a lots of options. Note: If you don’t have OpenSSL already installed on your server, you should set it up ASAP.   Using OpenSSL, the following command works:


openssl req -x509 -newkey rsa:4096 -keyout key.pem -out cert.pem -days 365


Note that this will create a 4096 bit key. That’s a bit of an overkill for this implementation however, the minimum key length for SSL connection these days is 2048 so its not really too long.


Note: If you use this command as is, you will be required to provide a simple password which is used to encrypt the key. The downside to that is that each time you want to start/stop the web service, you will be required to manually enter the passphrase. Using the –nodes option will remove this requirement, however the key is visible. You can change the permissions on the key file to control access.


For ngenix, I created the key pair in the ./conf.d directory and set their paths in the zeppelin.conf file.


After the edits, if you start the service, you’re up and running.

Well almost….


Further Tweaking


If you try to use the service, nginx asks your for a user name and password.

   location / {   # For regular websever support

       proxy_pass http://zeppelin;

       proxy_set_header X-Real-IP $remote_addr;

       proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;

       proxy_set_header Host $http_host;

       proxy_set_header X-NginX-Proxy true;

       proxy_redirect off;

       auth_basic "Restricted";

       auth_basic_user_file /etc/nginx/.htpasswd;


In this section, the auth_basic setting is set to “Restricted” indicating that a password check has to be performed.

The setting auth_basic_user_file is set to the path of the password file.


The instructions on how to set this up are also found within Zeppelin’s setup page.   While this is secure… you enter a password at the proxy before being able to use the proxy to the protected web service, does this make sense? You need a password to access a website that again asks you to log in before you can use it? Our goal in using nginx was to set up SSL so that any traffic between the client and the server is encrypted and not out in the open.   For our use case, it makes more sense that if you connect to the service that you want it to establish an SSL socket and then take you to your zeppelin service where you could then authenticate.


The simple fix is to set auth_basic to off.   This allows you to still authenticate the user without having to log in twice and your notebooks do not run as ‘anonymous’.


In Summary


Running Zeppelin out of the box with no security is not a good idea. This article helps to demonstrate some simple tweaks that help lock down your environment so that you can run Zeppelin on a desktop or connect to a service running on the edge of your cluster.

I am sure that there are other things that one could do to further lock down Zeppelin or tie in to your network. At a minimum you will want to authenticate your users via shiro as well as offer a SSL connection.


With Zeppelin up and running, now the real fun can begin.

Multi-cloud environments can be an effective way to hedge risk and enable future flexibility for applications. With a secure VPN in between two or more sites, you can leverage the global namespace and high availability features in MapR (like mirroring and replication) to drive business continuity across a wide variety of use cases.


In this tutorial, I will walk through the steps of how to connect Microsoft Azure and Amazon AWS cloud using an IPsec VPN, which will enable secure IP connectivity between the two clouds and serve as a Layer-3 connection you can then use to connect two or more MapR clusters.


Multi-Cloud Architecture with Site-to-Site VPN


Let's first take a look at the end result for which we're aiming.



On the left side of the below figure is an Azure setup with a single Resource Group (MapR_Azure). We'll set this up in the 'US West' zone. On the right side is an Amazon EC2 network in a VPC, which we will deploy in the Northern Virginia zone. This is an example of using geographically disperse regions to lessen the risk of disruptions to operations.  After the VPN is completed we can use MapR replication to ensure that data lives in the right place and applications can read/write to both sites seamlessly.


This "site-to-site" VPN will be encrypted and run over the open Internet, with a separate IP subnet in each cloud, using the VPN gateways to route traffic through a tunnel.  Note that this is a "Layer 3" VPN, in that traffic is routed between two subnets using IP forwarding.  It's also possible to do this at Layer 2, bridging Ethernet frames between the two networks, but I'll leave that for another post (or an exercise for the curious reader. )


Setting up the Azure Side


First, prepare the Resource Group; in our example, we called the group 'MapR_Azure.'

Select it, and find the 'Virtual Network' resource by selecting the '+' icon, then type 'virtual network' in the search box.




Select 'Resource Manager' as the deployment model and press 'Create.' We use the name 'clusternet' for the network.


We will create two subnets, one on each cloud side. On Azure, we create the address range of and a single subnet of within the range. We'll make the Azure and EC2 address prefixes 10.11 and 10.10, respectively, to make for easy identification and troubleshooting.



Create a public IP for the VPN connections. Select '+' to add a resource, then type 'public IP address' in the search box. Press 'Create,' then set up the IP as follows. Select 'Use existing' for the Resource Group, and keep 'Dynamic' for the IP address assignment. The name is 'AzureGatewayIP.'



Take note of this address; we will use it later.


Next, create a 'Virtual Network Gateway' the same way. This entity will serve as the VPN software endpoint.



Reference the public IP you created in the previous step (in our case, 'AzureGatewayIP').


Note the concept of a 'GatewaySubnet' here. This is a subnet that is to be used exclusively by the Azure VPN software. It must be within the configured address range, and you can't connect any other machines to it. Microsoft says "some configurations require more IP addresses to be allocated to the gateway services than do others."  It sounds a little mysterious, but allocating a /24 network seems to work fine for most scenarios.


Select 'clusternet' as the virtual network (what you created in the earlier step), use a Gateway subnet of, and use the 'AzureGatewayIP' address. This will create a new subnet entry called 'GatewaySubnet' for the network.


For testing purposes, select the VpnGw1 SKU. This allows up to 650 Mbps of network throughput, which is more than enough to connect a couple of small clusters, but you can go up to 1.25 Gbps with the VpnGw3 SKU.


This may take up to 45 minutes (according to Microsoft) but it usually completes in a few minutes.


Setting up the AWS Side


We need to pause here to set up a few things on AWS. First, create a VPC in the AWS Console VPC Dashboard. Here we set the IPv4 address range as



Navigate to 'Subnets,' and create a subnet in the VPC. Here we use



Next, create an Internet gateway to connect our VPC to the internet.  This step is important (and easily overlooked), otherwise traffic cannot be routed in between the Elastic IP and the subnet we just created.




Select 'Attach to VPC,' and use the new VPC 'clusterVPC'.




Go back to the EC2 Dashboard and select 'Launch Instance.' We will create an Amazon Linux instance to maintain the VPN. Select the 'Amazon Linux' AMI, and configure the instance details as follows:





Be sure to select the 'clusterVPC' we just created and 'clustersubnet' for the subnet. Select 'Disable' for 'Auto-assign Public IP' because we want to use an Elastic IP that we will associate later.


Under the last step, select 'Edit Security Groups,' and then select 'Create a new security group.' Open the group to all traffic coming from the AzureGatewayIP we configured previously (in this case, Also (optionally), add any rules that you need to connect to the instance via ssh.



Click on 'Launch,' and optionally create a new key pair or use an existing one for the instance.


While the instance launches, let's create an Elastic IP for the VPN endpoint. In the 'Network & Security' menu of the AWS Console, select 'Elastic IPs,' and then allocate an address.



Note this address (here, for later.


Associate the address with the instance we just created.



Finalizing the connection on the Azure side


Let's return to the Azure setup, and use the information from AWS to complete the connection. Add a Local Network Gateway.



The term 'local' is a bit of a misnomer here because the other network is not a local one; it's another cloud network.  Microsoft uses the term 'local' to refer to an on-premise network that you might want to connect to Azure.  For 'IP address,' use the Elastic IP you created in the previous section. For 'Address space,' use the range (the AWS subnet).


Next, add a Connection. Select 'Site-to-site' VPN, and fill in the remaining details for your Resource Group.



Select the AzureGateway we configured as well as the Local Network Gateway (AWSVPN). Enter a key that will be used for the session.



Now is a good time to launch an instance for testing. Type 'Ubuntu Server' into the search box, and select Ubuntu Server 14.04 LTS. Configure the instance details, size, and settings.



Under the last Settings window, configure the virtual network, subnet, and 'None' for a public IP address (we don't need one because the VPN will handle outbound/inbound connectivity). Select a new or existing network security group.



Finalizing the connection on the AWS side


It's a good time to make sure the subnet you created has a default route to the internet gateway. From the AWS Console, navigate to 'Route Tables' and find the subnet associated with your VPC, select the subnet and the 'Routes' tab, and add a default route:



Returning to AWS, ssh into the instance we just created and download/install strongswan along with some dependency packages.


sudo yum install gcc gmp-devel


bzip2 -d strongswan-5.6.0.tar.bz2

tar xvf strongswan-5.6.0.tar

cd strongswan-5.6.0

./configure && make && sudo make install


This should install strongswan in /usr/local, where we will edit the configuration files.


Edit the file /usr/local/etc/ipsec.conf and add the following entry:

conn azure


The 'left' and 'leftsubnet' options refer to the Amazon (local) side. Use the local private IP address and subnet. For the right side, use the AzureGatewayIP we configured ( and the 'clusternet' subnet.


Finally, edit the file /usr/local/etc/ipsec.secrets, and add your shared secret key. : PSK "testing123"


Start the VPN with:


sudo sudo /usr/local/sbin/ipsec start


You can run sudo tail -f /var/log/messages to check the status of the connection.


You should now be able to ping the Azure machine by running ping (or the address of the single interface of that machine). You can check it on the Azure side by viewing the Connection:



If you see 'Connected' as above, congratulations: you have a working two-cloud environment!


Additional Notes


Here are a few other concerns to watch when embarking on a multi-cloud adventure.


Ingress and Egress Data Transfers


Inter-site bandwidth is something to consider in your plan. At the time of this writing, most use cases of data transfer in to EC2 are free, with some exceptions. Data transfer out is free to most other Amazon services, like S3 and Glacier, and also free up to 1GB/month to other internet-connected sites, but costs a small amount per GB after that.


Data transfer in Azure is similar: all inbound data transfers are free, and there is a schedule of costs for outgoing transfers to the internet and other Azure zones.


Bandwidth and Latency


Amazon has a page on how to check bandwidth. Doing some quick tests with iperf between the two sites, here are some typical results:


Accepted connection from, port 35688
[ 5] local port 5201 connected to port 35690
[ ID] Interval Transfer Bitrate
[ 5] 0.00-1.00 sec 36.7 MBytes 308 Mbits/sec
[ 5] 1.00-2.00 sec 39.8 MBytes 334 Mbits/sec
[ 5] 2.00-3.00 sec 42.1 MBytes 353 Mbits/sec
[ 5] 3.00-4.00 sec 39.7 MBytes 333 Mbits/sec
[ 5] 4.00-5.00 sec 30.5 MBytes 256 Mbits/sec
[ 5] 5.00-6.00 sec 30.0 MBytes 252 Mbits/sec
[ 5] 6.00-7.00 sec 30.9 MBytes 259 Mbits/sec
[ 5] 7.00-8.00 sec 36.7 MBytes 308 Mbits/sec
[ 5] 8.00-9.00 sec 41.5 MBytes 348 Mbits/sec
[ 5] 9.00-10.00 sec 37.0 MBytes 311 Mbits/sec
[ 5] 10.00-10.03 sec 977 KBytes 245 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
- [ ID] Interval Transfer Bitrate
- [ 5] 0.00-10.03 sec 366 MBytes 306 Mbits/sec receiver


That's some pretty hefty bandwidth (306 Mbits/sec) between the two sites.


Ready to setup a MapR cluster?  


By Ronald van Loon


Editor's Note: This post was originally published on LinkedIn on June 12, 2017.

The world is long past the Industrial Revolution, and now we are experiencing an era of Digital Revolution. Machine Learning, Artificial Intelligence, and Big Data Analysis are the reality of today’s world.

I recently had a chance to talk to Ciaran Dynes, Senior Vice President of Products at Talend and Justin Mullen, Managing Director at Datalytyx. Talend is a software integration vendor that provides Big Data solutions to enterprises, and Datalytyx is a leading provider of big data engineering, data analytics, and cloud solutions, enabling faster, more effective, and more profitable decision-making throughout an enterprise.

The Evolution of Big Data Operations

To understand more about the evolution of big data operations, I asked Justin Mullen about the challenges his company faced five years ago and why they were looking for modern integration platforms. He responded with, “We faced similar challenges to what our customers were facing. Before Big Data analytics, it was what I call ‘Difficult Data analytics.’ There was a lot of manual aggregation and crunching of data from largely on premise systems. And then the biggest challenge that we probably faced was centralizing and trusting the data before applying the different analytical algorithms available to analyze the raw data and visualize the results in meaningful ways for the business to understand.”


He further added that, “Our clients not only wanted this analysis once, but they wanted continuous refreshes of updates on KPI performance across months and years. With manual data engineering practices, it was very difficult for us to meet the requirements of our clients, and that is when we decided we needed a robust and trustworthy data management platform that solves these challenges.”

The Automation and Data Science

Most of the economists and social scientists are concerned about the automation that is taking over the manufacturing and commercial processes. If the digitalization and automation continues to grow at the same pace it is currently happening, there is a high probability of machines partly replacing humans in the workforce. We are seeing some examples of the phenomena in our world today, but it is predicted to be far more prominent in the future.

However, Dynes says, “Data scientists are providing solutions to intricate and complex problems confronted by various sectors today. They are utilizing useful information from data analysis to understand and fix things. Data science is an input and the output is yielded in the form of automation. Machines automate, but humans provide the necessary input to get the desired output.”

This creates a balance in the demand for human and machine services. Both, automation and data science go parallel. One process is incomplete without the other. Raw data is worth nothing if it cannot be manipulated to produce meaningful results and similarly, machine learning cannot happen without sufficient and relevant data.

Start Incorporating Big Data and Machine Learning Solutions into Business Models

Dynes says, “Enterprises are realizing the importance of data, and are incorporating Big Data and Machine Learning solutions into their business models.” He further adds that, “We see automation happening all around us. It is evident in the ecommerce and manufacturing sectors, and has vast applications in the mobile banking and finance.”

When I asked him about his opinion regarding the transformation in the demand of machine learning processes and platforms, he added that, “The demand has always been there. Data analysis was equally useful five years ago as it is now. The only difference is that five years ago there was entrepreneurial monopoly and data was stored secretively. Whoever had the data, had the power, and there were only a few prominent market players who had the access to data.”

Justin has worked with different companies. Some of his most prominent clients were Calor Gas, Jaeger and Wejo. When talking about the challenges those companies faced before implementing advanced analytics or machine learning he said, “The biggest challenges most of my clients face was the accumulation of the essential data at one place so that the complex algorithms can be run simultaneously but the results can be viewed in one place for better analysis. The data plumbing and data pipelines were critical to enable data insights to become continuous rather than one-off.”

The Reasons for Rapid Digitalization

Dynes says, “We are experiencing rapid digitalization because of two major reasons. The technology has evolved at an exponential rate in the last couple of years and secondly, organization culture has evolved massively.” He adds, “With the advent of open source technologies and cloud platforms, data is now more accessible. More people have now access to information, and they are using this information to their benefits.”

In addition to the advancements and developments in the technology, “the new generation entering the workforce is also tech dependent. They rely heavily on the technology for their everyday mundane tasks. They are more open to transparent communication. Therefore, it is easier to gather data from this generation, because they are ready to talk about their opinions and preferences. They are ready to ask and answer impossible questions,” says Dynes.

Integrating a New World with the Old World

When talking about the challenges that companies face while opting for Big Data analytics solutions Mullen adds, “The challenges currently faced by industry while employing machine learning are twofold. The first challenge they face is related to data collection, data ingestion, data curation (quality) and then data aggregation. The second challenge is to combat the lack of human skills in data-engineering, advanced analytics, and machine learning”

“You need to integrate a new world with the old world.

The old world relied heavily on data collection in big batches

while the new world focuses mainly on the real-time data solutions”

Dynes says, “You need to integrate a new world with the old world. The old world relied heavily on data collection while the new world focuses mainly on the data solutions. There are limited solutions in the industry today that deliver on both these requirements at once right now.”


He concludes by saying that, “The importance of data engineering cannot be neglected, and machine learning is like Pandora’s Box. Its applications are widely seen in many sectors, and once you establish yourself as a quality provider, businesses will come to you for your services. Which is a good thing.”

Follow Ciaran Dynes, Justin Mullen, and Ronald van Loon on Twitter and LinkedIn for more interesting updates on Big Data solutions and machine learning.

Additional Resources

By Rachel Silver

Are you a data scientist, engineer, or researcher, just getting into distributed processing and PySpark, and you want to run some of the fancy new Python libraries you've heard about, like MatPlotLib?

If so, you may have noticed that it's not as simple as installing it on your local machine and submitting jobs to the cluster. In order for the Spark executors to access these libraries, they have to live on each of the Spark worker nodes.

You could go through and manually install each of these environments using pip, but maybe you also want the ability to use multiple versions of Python or other libraries like pandas? Maybe you also want to allow other colleagues to specify their own environments and combinations?

If this is the case, then you should be looking toward using condas to provide specialized and personalized Python configurations that are accessible to Python programs. Conda is a tool to keep track of conda packages and tarball files containing Python (or other) libraries and to maintain the dependencies between packages and the platform.

Continuum Analytics provides an installer for conda called Miniconda, which contains only conda and its dependencies, and this installer is what we’ll be using today.

For this blog, we’ll focus on submitting jobs from spark-submit. In a later iteration of this blog, we’ll cover how to use these environments in notebooks like Apache Zeppelin and Jupyter.

Installing Miniconda and Python Libraries to All Nodes

If you have a larger cluster, I recommend using a tool like pssh (parallel SSH) to automate these steps across all nodes.

To begin, we’ll download and install the Miniconda installer for Linux (64-bit) on each node where Apache Spark is running. Please make sure, before beginning the install, that you have the bzip2 library installed on all hosts:

I recommend choosing /opt/miniconda3/ as the install directory, and, when the install completes, you need to close and reopen your terminal session.

If your install is successful, you should be able to run ‘conda list’ and see the following packages:


Miniconda installs an initial default conda, running Python 3.6.1. To make sure this installation worked, run a version command:

python -V Python 3.6.1 :: Continuum Analytics, Inc.

To explain what’s going on here: we haven’t removed the previous default version of Python, and it can still be found by referencing the default path: /bin/python. We’ve simply added some new Python packages, like Java alternatives, that we can point to while submitting jobs without disrupting our cluster environment. See:

/bin/python -V Python 2.7.5

Now, let’s go ahead and create a test environment with access to Python 3.5 and NumPy libraries.

First, we create the conda and specify the Python version (do this as your cluster user):

conda create --name mapr_numpy python=3.5

Next, let’s go ahead and install NumPy to this environment:

conda install --name mapr_numpy numpy

Then, let’s activate this environment, and check the Python version:


Please complete these steps for all nodes that will run PySpark code.

Using Spark-Submit with Conda

Let’s begin with something very simple, referencing environments and checking the Python version to make sure it’s being set correctly. Here, I’ve made a tiny script that prints the Python version:


Testing NumPy

Now, let’s make sure this worked!

I’m creating a little test script called, containing the following:


If I were to run this script without activating or pointing to my conda with NumPy installed, I would see this error:


In order to get around this error, we’ll specify the Python environment in our submit statement:


Now for Something a Little More Advanced...

This example of PySpark, using the NLTK Library for Natural Language Processing, has been adapted from Continuum Analytics.

We’re going to run through a quick example of word tokenization to identify parts of speech that demonstrates the use of Python environments with Spark on YARN.

First, we’ll create a new conda, and add NLTK to it on all cluster nodes:

conda create --name mapr_nltk nltk python=3.5 source activate mapr_nltk

Note that some builds of PySpark are not compatible with Python 3.6, so we’ve specified an older version.

Next, we have to download the demo data from the NLTK repository:


This step will download all of the data to the directory that you specify–in this case, the default MapR-FS directory for the cluster user, accessible by all nodes in the cluster.

Next, create the following Python script:


Then, run the following as the cluster user to test:


Additional Resources

Editor's note: this blog post was originally published in the Converge Blog on June 29, 2017

By Carol McDonald


This post is the second in a series where we will go over examples of how MapR data scientist Joe Blue assisted MapR customers, in this case a regional bank, to identify new data sources and apply machine learning algorithms in order to better understand their customers. If you have not already read the first part of this customer 360° series, then it would be good to read that first. In this second part, we will cover a bank customer profitability 360° example, presenting the before, during and after.

Bank Customer Profitability - Initial State

The back story: a regional bank wanted to gain insights about what’s important to their customers based on their activity with the bank. They wanted to establish a digital profile via a customer 360 solution in order to enhance the customer experience, to tailor products, and to make sure customers have the right product for their banking style.

As you probably know, profit is equal to revenue minus cost. Customer profitability is the profit the firm makes from serving a customer or customer group, which is the difference between the revenues and the costs associated with the customer in a specified period.

Banks have a combination of fixed and variable costs. For example, a building is a fixed cost, but how often a person uses an ATM is a variable cost. The bank wanted to understand the link between their product offerings, customer behavior, and customer attitudes toward the bank in order to identify growth opportunities.

Bank Customer Profitability - During

The bank had a lot of different sources of customer data:

  • Data warehouse account information.
  • Debit card purchase information.
  • Loan information such as what kind of loan and how long it has been open.
  • Online transaction information such as who they’re paying, how often they’re online, or if they go online at all.
  • Survey data.

Analyzing data across multiple data sources

This data was isolated in silos, making it difficult to understand the relationships between the bank’s products and the customer’s activities. The first part of the solution workflow is to get the data into the MapR Platform, which is easy since MapR enables you to mount the cluster itself via NFS. Once all of the data is on the MapR Platform, the data relationships can be explored interactively using the Apache Drill schema-free SQL query engine.

A key data source was a survey that the bank had conducted in order to segment their customers based on their attitudes toward the bank. The survey asked questions like, “Are you embracing new technologies?” and “Are you trying to save?” The responses were then analyzed by a third party in order to define four customer personas. But the usefulness of this survey was limited because it was only performed on 2% of the customers. The question for the data scientist was, “How do I take this survey data and segment the other 98% of the customers, in order to make the customer experience with the bank more profitable?”

Feature extraction

Feature engineering is the process of transforming raw data into inputs for a machine learning algorithm. With data science you often hear about the algorithms that are used, but actually a bigger part—consuming about 80% of a data scientist’s time—is taking the raw data and combining it in a way that is most predictive.

The goal was to find interesting properties in the bank’s data that could be used to segment the customers into groups based on their activities. A key part of finding the interesting properties in the data was working with the bank’s domain experts, because they know their customers better than anyone.

Apache Drill was used to extract features such as:

  • What kind of loans, mix of accounts, and mix of loans does the customer have?
  • How often does the customer use a debit card?
  • How often does the customer go online?
  • What does the customer buy? How much do they spend? Where do they shop?
  • How often does the customer go into the branches?

Accentuate business expertise with new learning

After the behavior of the customers was extracted, it was possible to link these features by customer ID with the labeled survey data, in order to perform machine learning.

The statistical computing language R was used to build segmentation models using many machine learning classification techniques. The result was four independent ensembles, each predicting the likelihood of belonging to one persona, based on their banking activity.

The customer segments were merged and tested with the labeled survey data, allowing to link the survey “customer attitude” personas with their banking actions and provide insights.

Banking Customer 360° - After

The solution results of modeling with the customer data are displayed in the Customer Products heat map below. Each column is an “attitude”-based persona, and each row is a type of bank account or loan. Green indicates the persona is more likely to have this product, and red indicates less likely.

This graph helps define these personas by what kinds of products they like or don’t like.

This can give insight into:

  • How to price some products
  • How to generate fees
  • Gateway products, which allow to go from a less profitable customer segment to a more profitable one.

In the Customer Payees heat map below, the rows are electronic payees. This heat map shows where customer personas are spending their money, which can give channels for targeting and attracting a certain persona to grow your business.

In this graph, the bright green blocks show that Fitness club A, Credit card/Bank A, and Credit card/Bank C are really strong for Persona D. Persona A is almost the opposite of the other personas. This “customer payees” heat map gives a strong signal about persona behavior. It’s hard to find signals like this, but they provide an additional way to look at customer data in a way that the bank could not conceive of before.

Bank Customer 360° Summary

In this example, we discussed how data science can link customer behavioral data with a small sample survey to find customer groups who share behavioral affinities, which can be used to better identify a customer base for growth. The bank now has the ability to project growth rates based on transition between personas over time and find direct channels that allow them to target personas through marketing channels. 


Editor's Note: this post was originally published in the Converge Blog on December 08, 2016. 


Related Resources:

How to Use Data Science and Machine Learning to Revolutionize 360° Customer Views 

Customer 360: Knowing Your Customer is Step One | MapR 

Applying Deep Learning to Time Series Forecasting with TensorFlow 

Churn Prediction with Apache Spark Machine Learning 

Apache Drill Apache Spark