Skip navigation
All Places > The Exchange > Blog
1 2 3 Previous Next

The Exchange

138 posts

Introduction to Artificial Intelligence

“The science and engineering of making intelligent machines, especially intelligent computer programs”.

Also, intelligence distinguish us from everything in the world. As it has the ability to understand, apply knowledge. Also, improve skills that played a significant role in our evolution.
We can define Artificial Intelligence as the area of computer science. Further, they deal with the ways in which computers can be made. As they made to perform cognitive functions ascribed to humans.

Benefits of Artificial Intelligence

a. Error Reduction
We use artificial intelligence in most of the cases as this helps us in reducing the risk. Also, increases the chance of reaching accuracy with the greater degree of precision.
b. Difficult Exploration
In mining, we use artificial intelligence and science of robotics. Also, other fuel exploration processes. Moreover, we use complex machines for exploring the ocean. Hence, overcoming the ocean limitation.

Risks of Artificial Intelligence

a. High Cost
Its creation requires huge costs as they are very complex machines. Also, repair and maintenance require huge costs.
b. No Replicating Humans
As intelligence is believed to be a gift of nature. Also, an ethical argument will continue whether the human intelligence is to be replicated or not.

Artificial Intelligence Applications and Examples

a. Virtual Personal Assistants
Basically, in this a huge amount of data is collected from a variety of sources to learn about users. Also, one needs to be more effective in helping them organize and track their information.
b. Video Games
Generally, we are use AI—since the very first video games.
Basically, machine Learning technology is used by Siri users. Also, they use it to get understand natural language questions and requests.
Tesla is something you are missing if you are a car geek. Also, this is one of the best automobiles available until now.

Educational Requirements for Career in Artificial Intelligence

  • Various level of math, including probability, statistics, algebra, calculus, logic, and algorithms.
  • Bayesian networking or graphical modeling, including neural nets.
  • Physics, engineering, and robotics.
  • Computer science, programming languages, and coding.
  • Cognitive science theory.

Artificial Intelligence Career Domains

A career in this can be realized within a variety of settings including :

  • private companies
  • public organizations
  • education
  • the arts
  • healthcare facilities
  •  government agencies and
  • the military.

Roles in AI Career

  • Software analysts and developers.
  • Computer scientists and computer engineers.
  • Algorithm specialists.
  • Research scientists and engineering consultants.
  • Mechanical engineers and maintenance technicians.
  • Manufacturing and electrical engineers.
  • Surgical technicians working with robotic tools.
  • Military and aviation electricians working with flight simulators, drones, and armaments.

Future of Artificial Intelligence

Artificial Intelligence is used by one another after the company for its benefits. Also, it’s fact that artificial intelligence is reached in our day-to-day life. Moreover, with a breakneck speed.
On the basis of this information, arises a new question:
Is it possible that artificial Intelligence outperforms human performance?
If yes, then is it happens and how much does it take?
Only when Artificial Intelligence is able to do a job better than humans.
According to the survey results:
  • machines are predicted to be better than humans in translating languages;
  • running a truck;
  • working in the retail sector, and can completely outperform humans by 2060.
As a result, MI researchers believed that AI will become better than humans in the next 40-year time frame.
  • To build AI smarter, companies have already acquired around 34 AI startups. It was aquired in the first quarter of 2017. These companies are reinforcing their leads in the world of Artificial Intelligence.
  • In every sphere of life, AI is present. We use AI to organize big data into different patterns and structures. Also, patterns help in a neural network, machine learning, and data analytics.
  • From 80’s to now, Artificial intelligence is now part of our everyday lives, it’s very hard to believe. Moreover, it is becoming more intelligent and accepted every day. Also, with lots of opportunities for business.
Few steps to ensure the business stays relevant to the AI revolution:
a. A finger on the pulse
Maybe the time is going on it’s not right for your business to harness the value of AI. Although, doesn’t mean you should stop keeping up like others are using AI. Only reading IT journal trade is a good place to start. Rather start focusing on how businesses are leveraging AI.
b. Piggyback on the innovators
To implement AI, there are so many resources present from an industry that will help you.
For example:
Google has developed a machine learning system, TensorFlow. That has released as open source software.
c. Brainstorm potential uses with your team
If you want, team must be engaged in encouraging in the areas of business, AI could be deployed. Data-heavy, inefficient are processes that are likely benefit. Moreover, find where these exist. Also, how artificial intelligence is used to solve them.
d. Start small and focus on creating real value
It’s not mandatory to move forward for the sake only. Rather, it’s necessary to focus on objectives and start finding a best solution for it. Moreover, mean finding the specific process to run AI pilot. Also, see how it goes, learn and build from there.
e. Prepare the ground
Before, to maximize the value of AI, its good to ensure your current process. I.e working in a best possible way.
f. Collaborate
To collaborate with a non-competing business. That is further down the road in terms of programming and enabling AI. AI has potential to transform business. That is how business moves and takes ups and down.
For example-
Like movies, where humans stop, machines used to perform. As it requires steps and trials.
g. Cyborg Technology
Basically, it’s the biggest limitation in the human being. i.e its own bodies and brains. Its seen that we will argue with ourself. As a result, Cyborg technology is added for our convenience. Moreover, this technology reduces the limitations. Also, we will deal on the daily basis.
h. Taking over dangerous jobs
In bomb defusing, robots are used to save thousands of lives. Basically, if we see technically they are drones. Although, they require human to control them. Over years, as technology improves, we need AI integration to help these machines.
i. Solving climate change
This might seem like a tall order from a robot. But one says that:
machines have more access to data than one person ever could— storing a mind-boggling number of statistics. We can use big data so that AI could one day identify trends. Also, it comes with the use that information to come up with solutions to the world’s biggest problems.

Jobs in Artificial Intelligence

  • Computational philosopher: To ensure human-aligned ethics are embedded in AI algorithms
  • Robot personality designer
  • Robot obedience trainer
  • Autonomous vehicle infrastructure designer: New road and traffic signs to be read by computer
  • Algorithm trainers include the growing army of so-called “click workers” . That, help algorithms learn to recognize images or analyze sentiment, for instance.

Python Applications

So we’ve been learning Python programming over the last two months and we’ve learned quite some useful stuff. But when you can see what you can do with something, it feels powerful. It lends you some actual motivation to keep going. So let’s discuss python applications to that python can accomplish in the world. In this applications of Python programming tutorial you will know about 9 applications of Python Lets go through these Python applications one by one.

Python Applications

Python Applications

2. Web and Internet Development

Python lets you develop a web application without too much trouble. It has libraries for internet protocols like HTML and XML, JSON, e-mail processing, FTP, IMAP, and easy-to-use socket interface. Yet, the package index has more libraries:

  • Requests – An HTTP client library
  • BeautifulSoup – An HTML parser
  • Feedparser – For parsing RSS/Atom feeds
  • Paramiko – For implementing the SSH2 protocol
  • Twisted Python – For asynchronous network programming

We also have a gamut of frameworks available. Some of these are- Django, Pyramid. We also get microframeworks like flask and bottle. We’ve discussed these in our write-up on an Introduction to Python Programming.

We can also write CGI scripts, and we get advanced content management systems like Plone and Django CMS.

3. Applications of Python Programmong in Desktop GUI

Most binary distributions of Python ship with Tk, a standard GUI library. It lets you draft a user interface for an application. Apart from that, some toolkits are available:

  • wxWidgets
  • Kivy – for writing multitouch applications
  • Qt via pyqt or pyside

And then we have some platform-specific toolkits:

  • GTK+
  • Microsoft Foundation Classes through the win32 extensions
  • Delphi

4. Scientific and Numeric Applications

This is one of the the very common applications of python programming. With its power, it comes as no surprise that python finds its place in the scientific community. For this, we have:

  • SciPy – A collection of packages for mathematics, science, and engineering.
  • Pandas- A data-analysis and -modeling library
  • IPython – A powerful shell for easy editing and recording of work sessions. It also supports visualizations and parallel computing.
  • Software Carpentry Course – It teaches basic skills for scientific computing and running bootcamps. It also provides open-access teaching materials.
  • Also, NumPy lets us deal with complex numerical calculations.

5. Software Development Application

Software developers make use of python as a support language. They use it for build-control and management, testing, and for a lot of other things:

  • SCons – for build-control
  • Buildbot, Apache Gump – for automated and continuous compilation and testing
  • Roundup, Trac – for project management and bug-tracking.
  • Roster of Integrated Development Environments

6. Python Applications in Education

Thanks to its simplicity, brevity, and large community, Python makes for a great introductory programming language. Applications of python programming  in education has huge scope as it is a great language to teach in schools or even learn on your own.

7. Python Applications in Business

Python is also a great choice to develop ERP and e-commerce systems:

  • Tryton – A three-tier, high-level general-purpose application platform.
  • Odoo – A management software with a range of business applications. With that, it’s an all-rounder and forms a complete suite of enterprise-management applications in-effect.

8. Database Access

This is one of the hottest Python Applications.

With Python, you have:

  • Custom and ODBC interfaces to MySQL, Oracle, PostgreSQL, MS SQL Server, and others. These are freely available for download.
  • Object databases like Durus and ZODB
  • Standard Database API

9. Network Programming

With all those possibilities, how would Python slack in network programming? It does provide support for lower-level network programming:

  • Twisted Python – A framework for asynchronous network programming. We mentioned it in section 2.
  • An easy-to-use socket interface

10. Games and 3D Graphics

Safe to say, this one is the most interesting. When people hear someone say they’re learning Python, the first thing they get asked is – ‘So, did you make a game yet?’

PyGame, PyKyra are two frameworks for game-development with Python. Apart from these, we also get a variety of 3D-rendering libraries.

If you’re one of those game-developers, you can check out PyWeek, a semi-annual game programming contest.

11. Other Python Applications

These are some of the major Python Applications. Apart from what we just discussed, it still finds use in more places:

  • Console-based Applications
  • Audio – or Video- based Applications
  • Applications for Images
  • Enterprise Applications
  • 3D CAD Applications
  • Computer Vision (Facilities like face-detection and color-detection)
  • Machine Learning
  • Robotics
  • Web Scraping (Harvesting data from websites)
  • Scripting
  • Artificial Intelligence
  • Data Analysis (The Hottest of Python Applications)

This was all about the Python Applications Tutorial. If you like this tutorial on applications of Python programming comment below.

12. Conclusion – Python Applications

Python is everywhere and now that we know python Applications. We can do with it, we feel more powerful than ever. If there’s a unique project you’ve made in the Python language. Share your experience with us in the comments. You can also share your queries regarding Python Application tutorial.

1. Tensorflow Tutorial

Objective: Today in this Tensorflow Tutorial, we’ll be learning what is Tensorflow, where it is used, its different features, Tensorflow applications, Latest Release along with its advantages and disadvantages and how to use it in your project.

Tensorflow Tutorial | What is Tensorflow

Tensorflow Tutorial | What is Tensorflow

2. History of Tensorflow

Distbelief, which is what Tensorflow was called before it was upgraded was built in 2011 as a proprietary system based on deep learning neural networks. The source code of distbelief was modified and made into a much better application based library and soon in 2015 came to be known as tensorflow.

3. What is Tensorflow?

TensorFlow is a powerful data flow oriented machine learning library created the Brain Team of Google and made open source in 2015. It is designed to be easy to use and widely applicable on both numeric and neural network oriented problems as well as other domains.

Basically, tensorflow is a low-level toolkit for doing complicated math and it targets researchers who know what they’re doing to build experimental learning architectures, to play around with them and to turn them into running software.

It can be thought of as a programming system in which you represent computations as graphs. Nodes in the graph represent math operations, and the edges represent multidimensional data arrays (tensors) communicated between them.

4. Latest Release

The latest release of tensorflow is 1.7.0 and is available on It has been designed with deep learning in mind but it is applicable to a much wider range of problems.

Next let’s us know more about Tensore in this Tensorflow Tutorial.

5. About Tensors

Now, as the name suggests, it provides primitives for defining functions on tensors and automatically computing their derivatives.

Tensors are higher dimensional arrays which are used in computer programming to represent a multitude of data in the form of numbers. There are other n-d array libraries available on the internet like Numpy but tensorflow stands apart from them as it offers methods to create tensor functions and automatically compute derivatives.

Tensorflow Tutorial

Tensorflow Tutorial

There are other n-d array libraries available on the internet like Numpy but tensorflow stands apart from them as it offers methods to create tensor functions and automatically compute derivatives.


6. Other Uses

You can build other machine learning algorithms on it as well such as decision trees or k-Nearest Neighbors. Given below is an ecosystem of Tensorflow :

Tensorflow Tutorial - Tensorflow Ecosystem

Tensorflow Tutorial – Tensorflow Ecosystem

As can be seen from the above representation, tensorflow integrates well and has dependencies that include GPU processing, python and Cpp and you can use it integrated with container softwares like docker as well.

Next in Tensorflow Tutorial I am introducing you to an important concept of Tensorboard.

7. Tensorboard

Tensorboard, a suit of visualizing tools, is an easy solution to Tensorflow  offered by the creators that lets you visualize the graphs, plot quantitative metrics about the graph with additional data like images to pass through it.

mnist_tensorboard - Tensorflow Tutorial

mnist_tensorboard – Tensorflow Tutorial

8. Operation

Tensorflow runs on a variety of platforms and the installation is Linux-only and more tedious than CPU-only installation. It can be installed using pip or using conda environment. The applications go beyond deep learning to support other forms of machine learning like reinforcement learning, which takes you into goal-oriented tasks like winning video games or helping a robot navigate an uneven landscape.

9. Tensorflow Applications

There are umpteen applications of machine learning and tensorflow allows you to explore the majority of them including sentiment analysis, google translate, text summarization and the one for which it is quite famous for, image recognition which is used by major companies all over the world, including Airbnb, eBay, Dropbox, Snapchat, Twitter, Uber, SAP, Qualcomm, IBM, Intel, and of course, Google, Facebook, Instagram, and even Amazon for various purposes.

10. Tensorflow Features

Tensorflow has APIs for Matlab, and C++ and has a wide language support. With each day passing by, researchers are working on making it more better and recently in the latest Tensorflow Summit, tensorflow.js, a javascript library for training and deploying machine learning models has been introduced and an open source browser integrated platform is available for use at  where you can see the real-time changes that occur while changing the hyperparameters.

11. Tensorflow Advantages

  • Tensorflow has a responsive construct as you can easily visualize each and every part of the graph.
  • It has platform flexibility, meaning it is modular and some parts of it can be standalone while the others coalesced.
  • It is easily trainable on CPU as well as GPU for distributed computing.
  • It has auto differentiation capabilities which benefit gradient based machine learning algorithms meaning you can compute derivatives of values with respect to other values which results in a graph extension.
  • It has advanced support for threads, asynchronous computation, and queues.
  • It is customizable and open source.

12. Tensorflow Limitations

  • Has GPU memory conflicts with Theano if imported in the same scope.
  • No support for OpenCL
  • Requires prior knowledge of advanced calculus and linear algebra along with a pretty good understanding of machine learning.

This was all on Tensorflow Tutorial.

13. Conclusion

Tensorflow is a great library which can be used for numerical and graphical computation of data in creating deep learning networks and is the most widely used library for various applications like Google Search, Google Translate, Google Photos and many more. There are numerous and amazing things that people have done using machine learning some of which include applications relating to health care, recommendation engines for movies, music, personalized ads, social media sentiment mining, to name a few  and with these advancements in machine learning and artificial intelligence that seem mind-boggling, tensorflow is tool that’s helping to achieve these goals.


This Post was originally Published on Tensorflow Tutorial | Dataflair 


Data Analytics for Newbies

Posted by Harshali May 22, 2018

1. What is Big Data Analytics?

Data is information in raw format. With increasing data size, it has become need for inspecting, cleaning, transforming, and modeling data with the goal of finding useful information, making conclusions, and supporting decision making. This process is known as Big Data data analysis.

Data mining is a particular data analysis technique where modeling and knowledge discovery for predictive rather than purely descriptive purposes is focused. Business intelligence covers data analysis that relies heavily on aggregation, focusing on business information. In statistical applications, some people divide business analytics into descriptive statistics, exploratory data analysis (EDA), and confirmatory data analysis (CDA). EDA focuses on discovering new features in the data and CDA focuses on confirming or falsifying existing hypotheses. Predictive analytics does forecasting or classification by focusing on statistical or structural models while in text analytics, statistical, linguistic and structural techniques are applied to extract and classify information from textual sources, a species of unstructured data. All are varieties of data analysis.

The Big Data wave has changed ways in which industries function. With Big Data has emerged the requirement to implement advanced analytics to it. Now experts can make more accurate and profitable decisions.

In this session of Big Data Analytics tutorial for beginners, we are going to see characteristics and need of data analysis.

2. Analysis versus Reporting

An analysis is an interactive process of a person tackling a problem, finding the data required to get an answer, analyzing that data, and interpreting the results in order to provide a recommendation for action.

A reporting environment or a business intelligence (BI) environment involves calling and execution of reports. The outputs are then printed in the desired form. Reporting refers to the process of organizing and summarizing data in an easily readable format to communicate important information. Reports help organizations in monitoring different areas of a performance and improving customer satisfaction. In other words, you can consider reporting as the process of converting raw data into useful information, while analysis transforms information into insights.

Let us understand difference between data analysis and data reporting in this Big Data Analytics Tutorial:

  • Reporting provides data. A report will show the user what had happened in the past, to avoid inferences and help to get a feel of the data while analysis provides answers for any question or issue.An analysis process takes any steps needed to get the answers to those questions.
  • Reporting just provides the data that is asked for while analysis provides the information or the answer that is actually needed.
  • Reporting is done in standardized manner while analysis can be customized. There are fixed standard formats for reporting while analysis is done as per the requirement and it is customizable as needed.
  • Reporting can be done using a tool and it generally does not involve any person while in analysis, person is required who is doing analysis and who will lead the process. He guides the complete analysis process.
  • Reporting is inflexible while analysis is flexible. Reporting provides no or limited context about what’s happening in the data and hence is inflexible while analysis emphasizes data points that are significant, unique, or special, and it explains why they are important to the business.

Any doubt yet in the Big Data Analytics tutorial for beginners? Please Comment.

3. Data Analytics Process

Now in Big Data Analytics Tutorial we are going to see the analytic process or how analyzing data can be done?

Big Data Analytics Tutorial for beginners - Process

Big Data Analytics Tutorial for beginners – Process

a. Business Understanding

The very first step consists of business understanding. Whenever any requirement occurs, firstly we need to determine business objective, assess the situation, determine data mining goals and then produce the project plan as per the requirement. Business objectives are defined in this phase.

b. Data Exploration

Second step consists of Data understanding. For further process, we need to gather initial data, describe and explore the data and verify data quality to ensure it contains the data we require. Data collected from the various sources is described in terms of its application and need for the project in this phase. This is also known as data exploration. This is necessary to verify the quality of data collected.

c. Data Preparation

Next come Data preparation. From the data collected in last step, we need to select data as per the need, clean it, construct it to get useful information and then integrate it all. Finally we need to format the data to get appropriate data. Data is selected, cleaned, and integrated in the format finalized for the analysis in this phase.

d. Data Modeling

Once data is gathered, we need to do data modeling. For this, we need to select modeling technique, generate test design, build model and assess the model built. Data model is build to analyze relationships between various selected objects in the data, test cases are built for assessing the model and model is tested and implemented on the data in this phase.

e. Data Evaluation

Next come data evaluation where we evaluate the results generated in last step, review the scope of error and determine next steps that need to be performed. Results of the test cases are evaluated and reviewed for the scope of error in this phase.

f. Deployment

Final step in analytic process is deployment. Here we need to plan the deployment and monitoring and maintenance, we need to produce final report and review the project. Results of the analysis are deployed in this phase. This is also known as reviewing of the project.

The complete above process is known as business analytics process.

4. Introduction to Data Mining

Data mining, also called as data or knowledge discovery, means analyzing data from different perspectives and summarizing it into useful information – information that can be used to take important decisions. And so we are discussing it in this Big Data Analytics tutorial. It is the technique of exploring, analyzing, and detecting patterns in large amounts of data. Goal of data mining is either data classification or data prediction. In classification, data is sorted into groups while in prediction, value of a continuous variable is predicted.

In today’s world, data mining is been used in several sectors like Retail, sales analytics, Financial, Communication, Marketing Organizations etc. For example, a marketer may want to find who did and did not respond to a promotion. In prediction, the idea is to predict the value of a continuous (ie non-discrete) variable; for example, a marketer may be interested in finding who will respond to a promotion.

Some examples of Data Mining are:

a. Classification of trees

These are Tree-shaped structures that represent sets of decisions.

b. Logistic regression

It predicts the probability of an outcome that can only have two values.

c. Neural networks

These are non-linear predictive models that resemble biological neural networks in structure and learn through training.

d. Clustering techniques like the K-nearest neighbors

This is the technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes it is called the k-nearest neighbor technique.

e. Anomaly detection

It is the identification of items, events or observations which do not conform to an expected pattern or other items in a dataset.

After this Big Data Analytics tutorial, you can read our detailed tutorial on Data Mining.

5. Characteristics of Big Data Analysis

We have already seen characteristics of Big Data like volume, velocity and variety. Let us now see in this Big Data Analytics Tutorial, characteristics of Big Data Analytics which make it different from traditional kind of analysis.

Big Data Analytics Tutorial - Characteristics

Big Data Analytics Tutorial – Characteristics

Big Data analysis has the following characteristics:

a. Programmatic

There might be need to write program for data analysis by using code to manipulate it or do any kind of exploration because of the scale of the data.

b. Data driven

It means progress in an activity is compelled by data and program statements describe the data to be matched and the processing required rather than defining a sequence of steps to be taken. Many analysts use hypothesis driven approach to data analysis, Big Data can use the massive amount of data to drive the analysis.

c. Attributes usage

For proper and accurate analysis of data, it can use lot of attributes. In the past, analysts dealt with hundreds of attributes or characteristics of the data source, with Big Data there are now thousands of attributes and millions of observations.

d. Iterative

As whole data is broken into samples and samples are then analyzed, data analytics can be iterative in nature. More compute power enables iteration of the models until Big Data analysts are satisfied. This has led to development of new applications designed for addressing analysis requirements and time frames.

6. Great analysis with framing the Problem Correctly

In order to have a great analysis, it is necessary to ask the right question, gather the right data to address it, and design the right analysis to answer the question. Then only analysis can be called as correct and successful. Lets discuss this in detail in this Big Data Analytics tutorial for beginners.

Framing of problem means ensuring that important questions have been asked and critical assumptions have been laid out. For example, is the goal of a new initiative to drive more revenue or more profit? The choice leads to a huge difference in the analysis and actions that follow. Is all the data required available, or is it necessary to collect some more data? Without framing the problem, the rest of the work is useless.

For great analysis, problem should be framed correctly. This includes assessing the data correctly, developing a solid analysis plan, and taking into account the various technical and practical considerations in play.

Any business problem can be analyzed for 2 issues:

a. Statistical Significance

How problem is statistically important for decision making. Statistical significance testing takes some assumptions and determines the probability of happening of results if the assumptions are correct.

b. Business Importance

It means how the problem is related with business and its importance. Always put the results in business context as part of the final validation process.

7. Skills required to be a Data Analyst

In today’s world, there is an increasing demand for analytical professionals. It is taking time for academic programs to adapt and scale to develop more talent.

All the data collected and the models created are of no use if the organization lacks skilled Big Data analysts. A Big Data analyst requires both skill and knowledge for getting good data analytics jobs.

To be a successful analyst, a professional requires expertise on the various Big data analytical tools like R & SAS. He should be able to use these business analytics tools properly and gather required details. He should also be able to take decisions which are both statistically significant and important to the business.

Even if you know how to use a data analysis tool of any type, you also need to have the right skills, experience and perspective to use it. An analytics tool may save a user some programming but he or she still needs to understand the analytics that are being generated. Then only a person can be called as successful Data analyst.

Business people with no analytical expertise may want to leverage analytics, but they do not need to do the actual heavy lifting. The job of the analytics team is to enable business people to drive analytics through the organization. Let business people spend their time selling the power of analytics upstream and changing the business processes they manage to make use of analytics. If analytics teams do what they do best and business teams do what they do best, it will be a winning combination.

This was all on Big Data Analytics Tutorial.

1. Hadoop Ecosystem Components

The objective of this Apache Hadoop ecosystem components tutorial is to have an overview what are the different components of Hadoop ecosystem that make Hadoop so powerful and due to which several Hadoop job roles are available now. We will also learn about Hadoop ecosystem components like HDFS and HDFS components, MapReduce, YARN, Hive, Apache Pig, Apache HBase and HBase components, HCatalogue, Avro, Thrift, Drill, Apache mahout, Sqoop, Apache Flume, Ambari, Zookeeper and Apache OOzie to deep dive into Big Data Hadoop and to acquire master level knowledge of the Hadoop Ecosystem.

Introduction to Apache Hadoop Ecosystem Components and their roles.

                                                          Hadoop Ecosystem Components Diagram

2. Introduction to Hadoop Ecosystem

As we can see the different Hadoop ecosystem explained in the above figure of Hadoop Ecosystem. Now We are going to discuss the list of Hadoop Components in this section one by one in detail.

2.1. Hadoop Distributed File System

It is the most important component of Hadoop Ecosystem. HDFS is the primary storage system of Hadoop. Hadoop distributed file system (HDFS) is a java based file system that provides scalable, fault tolerance, reliable and cost efficient data storage for Big data. HDFS is a distributed filesystem that runs on commodity hardware. HDFS is already configured with default configuration for many installations. Most of the time for large clusters configuration is needed. Hadoop interact directly with HDFS by shell-like commands.

HDFS  Components:

There are two major components of Hadoop HDFS- NameNode and DataNode. Let’s now discuss these Hadoop HDFS Components-

i. NameNode

It is also known as Master node. NameNode does not store actual data or dataset. NameNode stores Metadata i.e. number of blocks, their location, on which Rack, which Datanode the data is stored and other details. It consists of files and directories.

Tasks of HDFS NameNode

  • Manage file system namespace.
  • Regulates client’s access to files.
  • Executes file system execution such as naming, closing, opening files and directories.

ii. DataNode

It is also known as Slave. HDFS Datanode is responsible for storing actual data in HDFS. Datanode performs read and write operation as per the request of the clients. Replica block of Datanode consists of 2 files on the file system. The first file is for data and second file is for recording the block’s metadata. HDFS Metadata includes checksums for data. At startup, each Datanode connects to its corresponding Namenode and does handshaking. Verification of namespace ID and software version of DataNode take place by handshaking. At the time of mismatch found, DataNode goes down automatically.

Tasks of HDFS DataNode

  • DataNode performs operations like block replica creation, deletion, and replication according to the instruction of NameNode.
  • DataNode manages data storage of the system.

This was all about HDFS as a Hadoop Ecosystem component.

Refer HDFS Comprehensive Guide to read Hadoop HDFS in detail and then proceed with the Hadoop Ecosystem tutorial.

2.2. MapReduce

Hadoop MapReduce is the core Hadoop ecosystem component which provides data processing. MapReduce is a software framework for easily writing applications that process the vast amount of structured and unstructured data stored in the Hadoop Distributed File system.

MapReduce programs are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster. Thus, it improves the speed and reliability of cluster this parallel processing.


Hadoop Ecosystem Overview - Hadoop MapReduce

                                               Hadoop Ecosystem Overview – Hadoop MapReduce

Working of MapReduce

Hadoop Ecosystem component ‘MapReduce’ works by breaking the processing into two phases:

  • Map phase
  • Reduce phase

Each phase has key-value pairs as input and output. In addition, programmer also specifies two functions: map function and reduce function

Map function takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key/value pairs). Read Mapper in detail.

Reduce function takes the output from the Map as an input and combines those data tuples based on the key and accordingly modifies the value of the key. Read Reducer in detail.

Features of MapReduce

  • Simplicity – MapReduce jobs are easy to run. Applications can be written in any language such as java, C++, and python.
  • Scalability – MapReduce can process petabytes of data.
  • Speed – By means of parallel processing problems that take days to solve, it is solved in hours and minutes by MapReduce.
  • Fault Tolerance – MapReduce takes care of failures. If one copy of data is unavailable, another machine has a copy of the same key pair which can be used for solving the same subtask.

Refer MapReduce Comprehensive Guide for more details.

Hope the Hadoop Ecosystem explained is helpful to you. The next component we take is YARN.

2.3. YARN

Hadoop YARN (Yet Another Resource Negotiator) is a Hadoop ecosystem component that provides the resource management. Yarn is also one the most important component of Hadoop Ecosystem.  YARN is called as the operating system of Hadoop as it is responsible for managing and monitoring workloads. It allows multiple data processing engines such as real-time streaming and batch processing to handle data stored on a single platform.

Apache Hadoop Ecosystem - Hadoop Yarn Diagram

                                         Apache Hadoop Ecosystem – Hadoop Yarn Diagram

YARN has been projected as a data operating system for Hadoop2. Main features of YARN are:

  • Flexibility – Enables other purpose-built data processing models beyond MapReduce (batch), such as interactive and streaming. Due to this feature of YARN, other applications can also be run along with Map Reduce programs in Hadoop2.
  • Efficiency – As many applications run on the same cluster, Hence, efficiency of Hadoop increases without much effect on quality of service.
  • Shared – Provides a stable, reliable, secure foundation and shared operational services across multiple workloads. Additional programming models such as graph processing and iterative modeling are now possible for data processing.

Refer YARN Comprehensive Guide for more details.

2.4. Hive

The Hadoop ecosystem component, Apache Hive, is an open source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hive do three main functions: data summarization, query, and analysis.

Hive use language called HiveQL (HQL), which is similar to SQL. HiveQL automatically translates SQL-like queries into MapReduce jobs which will execute on Hadoop.

Components of Hadoop Ecosystem - Hive Diagram

                                              Components of Hadoop Ecosystem – Hive Diagram

Main parts of Hive are:

  • Metastore – It stores the metadata.
  • Driver – Manage the lifecycle of a HiveQL statement.
  • Query compiler – Compiles HiveQL into Directed Acyclic Graph(DAG).
  • Hive server – Provide a thrift interface and JDBC/ODBC server.

Refer Hive Comprehensive Guide for more details.

2.5. Pig

Apache Pig is a high-level language platform for analyzing and querying huge dataset that are stored in HDFS. Pig as a component of Hadoop Ecosystem uses PigLatin language. It is very similar to SQL. It loads the data, applies the required filters and dumps the data in the required format. For Programs execution, pig requires Java runtime environment.

Hadoop Ecosystem Tutorial - Pig Diagram

                                                            Hadoop Ecosystem Tutorial – Pig Diagram

Features of Apache Pig:

  • Extensibility – For carrying out special purpose processing, users can create their own function.
  • Optimization opportunities – Pig allows the system to optimize automatic execution. This allows the user to pay attention to semantics instead of efficiency.
  • Handles all kinds of data – Pig analyzes both structured as well as unstructured.

Refer Pig – A Complete guide for more details.

2.6. HBase

Apache HBase is a Hadoop ecosystem component which is distributed database that was designed to store structured data in tables that could have billions of row and millions of columns. HBase is scalable, distributed, and Nosql database that is built on top of HDFS. HBase, provide real time access to read or write data in HDFS.

Hadoop Ecosystem Components - HBase Diagram

                                              Hadoop Ecosystem Components – HBase Diagram

Components of Hbase

There are two HBase Components namely- HBase Master and RegionServer.

i. HBase Master

It is not part of the actual data storage but negotiates load balancing across all RegionServer.

  • Maintain and monitor the Hadoop cluster.
  • Performs administration (interface for creating, updating and deleting tables.)
  • Controls the failover.
  • HMaster handles DDL operation.

ii. RegionServer

It is the worker node which handle read, write, update and delete requests from clients. Region server process runs on every node in Hadoop cluster. Region server runs on HDFS DateNode.

Refer HBase Tutorial for more details.

2.7. HCatalog

It is a table and storage management layer for Hadoop. HCatalog supports different components available in Hadoop ecosystem like MapReduce, Hive, and Pig to easily read and write data from the cluster. HCatalog is a key component of Hive that enables the user to store their data in any format and structure.

By default, HCatalog supports RCFile, CSV, JSON, sequenceFile and ORC file formats.

Benefits of HCatalog:

  • Enables notifications of data availability.
  • With the table abstraction, HCatalog frees the user from overhead of data storage.
  • Provide visibility for data cleaning and archiving tools.

2.8. Avro

Acro is a part of Hadoop ecosystem and is a most popular Data serialization system. Avro is an open source project that provides data serialization and data exchange services for Hadoop. These services can be used together or independently. Big data can exchange programs written in different languages using Avro.

Using serialization service programs can serialize data into files or messages. It stores data definition and data together in one message or file making it easy for programs to dynamically understand information stored in Avro file or message.

Avro schema – It relies on schemas for serialization/deserialization. Avro requires the schema for data writes/read. When Avro data is stored in a file its schema is stored with it, so that files may be processed later by any program.

Dynamic typing – It refers to serialization and deserialization without code generation. It complements the code generation which is available in Avro for statically typed language as an optional optimization.

Features provided by Avro:

  • Rich data structures.
  • Remote procedure call.
  • Compact, fast, binary data format.
  • Container file, to store persistent data.

2.9. Thrift

It is a software framework for scalable cross-language services development. Thrift is an interface definition language for RPC(Remote procedure call) communication. Hadoop does a lot of RPC calls so there is a possibility of using Hadoop Ecosystem componet Apache Thrift for performance or other reasons.

Hadoop Ecosystem - Thrift Diagram

Hadoop Ecosystem – Thrift Diagram

2.10. Apache Drill

The main purpose of the Hadoop Ecosystem Component is large-scale data processing including structured and semi-structured data. It is a low latency distributed query engine that is designed to scale to several thousands of nodes and query petabytes of data. The drill is the first distributed SQL query engine that has a schema-free model.

Application of Apache drill

The drill has become an invaluable tool at cardlytics, a company that provides consumer purchase data for mobile and internet banking. Cardlytics is using a drill to quickly process trillions of record and execute queries.

Features of Apache Drill:

The drill has specialized memory management system to eliminates garbage collection and optimize memory allocation and usage. Drill plays well with Hive by allowing developers to reuse their existing Hive deployment.

  • Extensibility – Drill provides an extensible architecture at all layers, including query layer, query optimization, and client API. We can extend any layer for the specific need of an organization.
  • Flexibility – Drill provides a hierarchical columnar data model that can represent complex, highly dynamic data and allow efficient processing.
  • Dynamic schema discovery – Apache drill does not require schema or type specification for data in order to start the query execution process. Instead, drill starts processing the data in units called record batches and discover schema on the fly during processing.
  • Drill decentralized metadata – Unlike other SQL Hadoop technologies, the drill does not have centralized metadata requirement. Drill users do not need to create and manage tables in metadata in order to query data.

2.11. Apache Mahout

Mahout is open source framework for creating scalable machine learning algorithm and data mining library. Once data is stored in Hadoop HDFS, mahout provides the data science tools to automatically find meaningful patterns in those big data sets.

Algorithms of Mahout are:

  • Clustering – Here it takes the item in particular class and organizes them into naturally occurring groups, such that item belonging to the same group are similar to each other.
  • Collaborative filtering – It mines user behavior and makes product recommendations (e.g. Amazon recommendations)
  • Classifications – It learns from existing categorization and then assigns unclassified items to the best category.
  • Frequent pattern mining – It analyzes items in a group (e.g. items in a shopping cart or terms in query session) and then identifies which items typically appear together.

2.12. Apache Sqoop

Sqoop imports data from external sources into related Hadoop ecosystem components like HDFS, Hbase or Hive. It also exports data from Hadoop to other external sources. Sqoop works with relational databases such as teradata, Netezza, oracle, MySQL.

Explain Hadoop Ecosystem - Apache Sqoop Diagram

                                             Explain Hadoop Ecosystem – Apache Sqoop Diagram

Features of Apache Sqoop:

  • Import sequential datasets from mainframe – Sqoop satisfies the growing need to move data from the mainframe to HDFS.
  • Import direct to ORC files – Improves compression and light weight indexing and improve query performance.
  • Parallel data transfer – For faster performance and optimal system utilization.
  • Efficient data analysis – Improve efficiency of data analysis by combining structured data and unstructured data on a schema on reading data lake.
  • Fast data copies – from an external system into Hadoop.

2.13. Apache Flume

Flume efficiently collects, aggregate and moves a large amount of data from its origin and sending it back to HDFS. It is fault tolerant and reliable mechanism. This Hadoop Ecosystem component allows the data flow from the source into Hadoop environment. It uses a simple extensible data model that allows for the online analytic application. Using Flume, we can get the data from multiple servers immediately into hadoop.

Hadoop Ecosystem Components - Apache Flume

                                                 Hadoop Ecosystem Components – Apache Flume

Refer Flume Comprehensive Guide for more details

2.14. Ambari

Ambari, another Hadop ecosystem component, is a management platform for provisioning, managing, monitoring and securing apache Hadoop cluster. Hadoop management gets simpler as Ambari provide consistent, secure platform for operational control.

Hadoop Ecosystem Tutorial - Ambari Diagram

                                                        Hadoop Ecosystem Tutorial – Ambari Diagram

Features of Ambari:

  • Simplified installation, configuration, and management – Ambari easily and efficiently create and manage clusters at scale.
  • Centralized security setup – Ambari reduce the complexity to administer and configure cluster security across the entire platform.
  • Highly extensible and customizable – Ambari is highly extensible for bringing custom services under management.
  • Full visibility into cluster health – Ambari ensures that the cluster is healthy and available with a holistic approach to monitoring.

2.15. Zookeeper

Apache Zookeeper is a centralized service and a Hadoop Ecosystem component for maintaining configuration information, naming, providing distributed synchronization, and providing group services. Zookeeper manages and coordinates a large cluster of machines.

Hadoop Ecosystem Explained - ZooKeeper Diagram

                                              Hadoop Ecosystem Explained – ZooKeeper Diagram

Features of Zookeeper:

  • Fast – Zookeeper is fast with workloads where reads to data are more common than writes. The ideal read/write ratio is 10:1.
  • Ordered – Zookeeper maintains a record of all transactions.

2.16. Oozie

It is a workflow scheduler system for managing apache Hadoop jobs. Oozie combines multiple jobs sequentially into one logical unit of work. Oozie framework is fully integrated with apache Hadoop stack, YARN as an architecture center and supports Hadoop jobs for apache MapReduce, Pig, Hive, and Sqoop.

Apache Hadoop Ecosystem - Oozie Diagram

                                                    Apache Hadoop Ecosystem – Oozie Diagram

In Oozie, users can create Directed Acyclic Graph of workflow, which can run in parallel and sequentially in Hadoop. Oozie is scalable and can manage timely execution of thousands of workflow in a Hadoop cluster. Oozie is very much flexible as well. One can easily start, stop, suspend and rerun jobs. It is even possible to skip a specific failed node or rerun it in Oozie.

There are two basic types of Oozie jobs:

  • Oozie workflow – It is to store and run workflows composed of Hadoop jobs e.g., MapReduce, pig, Hive.
  • Oozie Coordinator – It runs workflow jobs based on predefined schedules and availability of data.

This was all about Components of Hadoop Ecosystem

3. Conclusion: Components of Hadoop Ecosystem

We have covered all the Hadoop Ecosystem Components in detail. Hence these Hadoop ecosystem components empower Hadoop functionality. As you have learned the components of Hadoop ecosystem, so refer Hadoop installation guide to use Hadoop functionality. If you like this blog or feel any query so please feel free to share with us.


See Also-

1. Hadoop MapReduce Tutorial


This Hadoop MapReduce tutorial describes all the concepts of Hadoop MapReduce in great details. In this tutorial, we will understand what is MapReduce and how it works, what is Mapper, Reducer, shuffling, and sorting, etc. This Hadoop MapReduce Tutorial also covers internals of MapReduce, DataFlow, architecture, and Data locality as well. So lets get started with the Hadoop MapReduce Tutorial.


Apache Hadoop MapReduce Tutorial for beginners.

                                                                         Hadoop MapReduce Tutorial

2. What is MapReduce?


MapReduce is the processing layer of Hadoop. MapReduce programming model is designed for processing large volumes of data in parallel by dividing the work into a set of independent tasks. You need to put business logic in the way MapReduce works and rest things will be taken care by the framework. Work (complete job) which is submitted by the user to master is divided into small works (tasks) and assigned to slaves.

MapReduce programs are written in a particular style influenced by functional programming constructs, specifical idioms for processing lists of data. Here in MapReduce, we get inputs from a list and it converts it into output which is again a list. It is the heart of Hadoop. Hadoop is so much powerful and efficient due to MapRreduce as here parallel processing is done.

This is what MapReduce is in Big Data. In the next step of Mapreduce Tutorial we have MapReduce Process, MapReduce dataflow how MapReduce divides the work into sub-work, why MapReduce is one of the best paradigms to process data.



2.1. High-level Understanding of Hadoop MapReduce Tutorial

Now in this Hadoop Mapreduce Tutorial let’s understand the MapReduce basics, at a high level how MapReduce looks like, what, why and how MapReduce works?

Map-Reduce divides the work into small parts, each of which can be done in parallel on the cluster of servers. A problem is divided into a large number of smaller problems each of which is processed to give individual outputs. These individual outputs are further processed to give final output.

Hadoop Map-Reduce is scalable and can also be used across many computers. Many small machines can be used to process jobs that could not be processed by a large machine. Next in the MapReduce tutorial we will see some important MapReduce Traminologies.


2.2. Apache MapReduce Terminologies

Let’s now understand different terminologies and concepts of MapReduce, what is Map and Reduce, what is a job, task, task attempt, etc.

Map-Reduce is the data processing component of Hadoop. Map-Reduce programs transform lists of input data elements into lists of output data elements. A Map-Reduce program will do this twice, using two different list processing idioms-

  • Map
  • Reduce

In between Map and Reduce, there is small phase called Shuffle and Sort in MapReduce.

Let’s understand basic terminologies used in Map Reduce.


  • What is a MapReduce Job?


MapReduce Job or a A “full program” is an execution of a Mapper and Reducer across a data set. It is an execution of 2 processing layers i.e mapper and reducer. A MapReduce job is a work that the client wants to be performed. It consists of the input data, the MapReduce Program, and configuration info. So client needs to submit input data, he needs to write Map Reduce program and set the configuration info (These were provided during Hadoop setup in the configuration file and also we specify some configurations in our program itself which will be specific to our map reduce job).


  • What is Task in Map Reduce?

A task in MapReduce is an execution of a Mapper or a Reducer on a slice of data. It is also called Task-In-Progress (TIP). It means processing of data is in progress either on mapper or reducer.


  • What is Task Attempt?

Task Attempt is a particular instance of an attempt to execute a task on a node. There is a possibility that anytime any machine can go down. For example, while processing data if any node goes down, framework reschedules the task to some other node. This rescheduling of the task cannot be infinite. There is an upper limit for that as well. The default value of task attempt is 4. If a task (Mapper or reducer) fails 4 times, then the job is considered as a failed job. For high priority job or huge job, the value of this task attempt can also be increased.

Install Hadoop and play with MapReduce. Next topic in the Hadoop MapReduce tutorial is the Map Abstraction in MapReduce.


2.3. Map Abstraction


Let us understand the abstract form of Map in MapReduce, the first phase of MapReduce paradigm, what is a map/mapper, what is the input to the mapper, how it processes the data, what is output from the mapper?

The map takes key/value pair as input. Whether data is in structured or unstructured format, framework converts the incoming data into key and value.

  • Key is a reference to the input value.
  • Value is the data set on which to operate.

Map Processing:

  • A function defined by user – user can write custom business logic according to his need to process the data.
  • Applies to every value in value input.
  • An output of Map is called intermediate output.
  • Can be the different type from input pair.
  • An output of map is stored on the local disk from where it is shuffled to reduce nodes.

Next in Hadoop MapReduce Tutorial is the Hadoop Abstraction


2.4. Reduce Abstraction

Now let’s discuss the second phase of MapReduce – Reducer in this MapReduce Tutorial, what is the input to the reducer, what work reducer does, where reducer writes output?

Reduce takes intermediate Key / Value pairs as input and processes the output of the mapper. Usually, in the reducer, we do aggregation or summation sort of computation.

  • Input given to reducer is generated by Map (intermediate output)
  • Key / Value pairs provided to reduce are sorted by key

Reduce processing:

  • A function defined by user – Here also user can write custom business logic and get the final output.
  • Iterator supplies the values for a given key to the Reduce function.

Reduce produces a final list of key/value pairs:

  • An output of Reduce is called Final output.
  • It can be a different type from input pair.
  • An output of Reduce is stored in HDFS.

Let us understand in this Hadoop MapReduce Tutorial How Map and Reduce work together.

2.5. How Map and Reduce work Together?

Let us understand how Hadoop Map and Reduce work together?

MapReduce Tutorials: Learn, how Hoadoop Map and Reduce work together?

                                   Hadoop MapReduce Tutorial: Combined working of Map and Reduce


Input data given to mapper is processed through user defined function written at mapper. All the required complex business logic is implemented at the mapper level so that heavy processing is done by the mapper in parallel as the number of mappers is much more than the number of reducers. Mapper generates an output which is intermediate data and this output goes as input to reducer.

This intermediate result is then processed by user defined function written at reducer and final output is generated. Usually, in reducer very light processing is done. This final output is stored in HDFS and replication is done as usual.

2.6. MapReduce DataFlow

Now let’s understand in this Hadoop MapReduce Tutorial complete end to end data flow of MapReduce, how input is given to the mapper, how mappers process data, where mappers write the data, how data is shuffled from mapper to reducer nodes, where reducers run, what type of processing should be done in the reducers?

MapReduce Tutorial: Apache Hadoop MapReduce data flow process.

                                     Hadoop MapReduce Tutorial: Hadoop MapReduce Dataflow Process


As seen from the diagram of mapreduce workflow in Hadoop, the square block is a slave. There are 3 slaves in the figure. On all 3 slaves mappers will run, and then a reducer will run on any 1 of the slave. For simplicity of the figure, the reducer is shown on a different machine but it will run on mapper node only.

Let us now discuss the map phase:


An input to a mapper is 1 block at a time. (Split = block by default)


An output of mapper is written to a local disk of the machine on which mapper is running. Once the map finishes, this intermediate output travels to reducer nodes (node where reducer will run).

Reducer is the second phase of processing where the user can again write his custom business logic. Hence, an output of reducer is the final output written to HDFS.


By default on a slave, 2 mappers run at a time which can also be increased as per the requirements. It depends again on factors like datanode hardware, block size, machine configuration etc. We should not increase the number of mappers beyond the certain limit because it will decrease the performance.


Mapper in Hadoop Mapreduce writes the output to the local disk of the machine it is working. This is the temporary data. An output of mapper is also called intermediate output. All mappers are writing the output to the local disk. As First mapper finishes, data (output of the mapper) is traveling from mapper node to reducer node. Hence, this movement of output from mapper node to reducer node is called shuffle.


Reducer is also deployed on any one of the datanode only. An output from all the mappers goes to the reducer. All these outputs from different mappers are merged to form input for the reducer. This input is also on local disk. Reducer is another processor where you can write custom business logic. It is the second stage of the processing. Usually to reducer we write aggregation, summation etc. type of functionalities. Hence, Reducer gives the final output which it writes on HDFS.


Map and reduce are the stages of processing. They run one after other. After all, mappers complete the processing, then only reducer starts processing.


Though 1 block is present at 3 different locations by default, but framework allows only 1 mapper to process 1 block. So only 1 mapper will be processing 1 particular block out of 3 replicas. The output of every mapper goes to every reducer in the cluster i.e every reducer receives input from all the mappers. Hence, framework indicates reducer that whole data has processed by the mapper and now reducer can process the data.


An output from mapper is partitioned and filtered to many partitions by the partitioner. Each of this partition goes to a reducer based on some conditions. Hadoop works with key value principle i.e mapper and reducer gets the input in the form of key and value and write output also in the same form. Follow this link to learn How Hadoop works internally? MapReduce DataFlow is the most important topic in this MapReduce tutorial. If you have any query regading this topic or ant topic in the MapReduce tutorial, just drop a comment and we will get back to you. Now, let us move ahead in this MapReduce tutorial with the Data Locality principle.


2.7. Data Locality in MapReduce


Let’s understand what is data locality, how it optimizes Map Reduce jobs, how data locality improves job performance?


Move computation close to the data rather than data to computation”. A computation requested by an application is much more efficient if it is executed near the data it operates on. This is especially true when the size of the data is very huge. This minimizes network congestion and increases the throughput of the system. The assumption is that it is often better to move the computation closer to where the data is present rather than moving the data to where the application is running. Hence, HDFS provides interfaces for applications to move themselves closer to where the data is present.

Since Hadoop works on huge volume of data and it is not workable to move such volume over the network. Hence it has come up with the most innovative principle of moving algorithm to data rather than data to algorithm. This is called data locality.


This was all about the Hadoop MapReduce Tutorial.


3. Conclusion: Hadoop MapReduce Tutorial

Hence, MapReduce empowers the functionality of Hadoop. Since it works on the concept of data locality, thus improves the performance. In the next tutorial of mapreduce, we will learn the shuffling and sorting phase in detail.

This was all about the Hadoop Mapreduce tutorial. I Hope you are clear with what is MapReduce like the Hadoop MapReduce Tutorial.

See Also-

If you have any question regarding the Hadoop Mapreduce Tutorial OR if you like the Hadoop MapReduce tutorial please let us know your feedback in the comment section.

We live in a world of information technology and mobile communications. Fast paced lifestyle of globally interconnected people demand efficiency. The key to success in this modern world is speed and accuracy. Knowledge comes from information, and information is extracted from data. Data is basically a coded representation of virtually anything that concerns human beings. In the context of computing, Data is text, numbers, formula, images, animation, video, etc.

Data is organized into databases for efficient storage, access, and modification. Structured query languages (SQL) and relational database management systems (RDBMs) achieved a lot of success. They helped in the automation of modern businesses and evolution of enterprise level information systems. But internet and Smartphones have totally changed the game in recent years. We now live in the world of disruptive technologies and the buzzwords are Big Data, Artificial Intelligence, and Machine Learning.



Streaming video is familiar even to young children, while Google deals with Terabytes (TB). Big data refers to the voluminous growth of information sources across the globe. Analytics is one aspect of the science that handles large and diverse types of data. Meaningful information cannot be generated without data processing. And Python comes in here as an excellent programming language.

Other contemporary software technologies include Java, R, Oracle NoSQL, Apache Spark or Hadoop, SAS, etc., These products are variously described as programming languages, software suites, analytical tools, platforms and computing frameworks. The professionals are also designated as data scientists, analysts, statisticians, database administrators, and architects. Their skills include engineering math, computing, algorithmic analysis, AI skills, and machine learning.


Tools and Technologies

Big data has specific traits such as volume, velocity, veracity, and variety. There is large volume of diverse data like text, images, audio, and video. In addition, the analysts have to use reliable software tools that can cope with rapid data collection. Python stands out as a contemporary language with rich set of "ready to use" libraries. The language is easy to learn, and clever programmers can write concise, readable code.


What Happens in Data Analysis?

Raw or structured data has to be subjected to elaborate processing. The procedures or methods aim to explore the various facets of data sets. In a database, the analyst submits queries and extracts useful information. The basic idea is to manipulate data, collect leads and make informed decisions for a business organization. Traditionally, the beneficiaries were IT starts ups, web services like Google, and business corporations.

However, Big Data requires much more than simple database queries and numerical reports. A comprehensive field of study known as data science is currently in high demand. The scientists aspire to achieve predetermined goals using a sequence of steps. They carry out data retrieval, preparation, analysis, and modeling. Statistical techniques, visualization, and data mining activities are crucial for success.


Why Big Data?

Let us look at the different types of big data to comprehend the underlying challenges -

  1. Structured - A database system for payroll, HR, inventory, sales management, etc., An excel spreadsheet with student grades, library book catalogue, pharmacy drug list, etc.,
  2. Unstructured - Good examples are web or mobile based text exchange or email. Twitter or micro-blog messages are also in plain English.
  3. Natural Language - Domain specific communication and comprehension of languages. It relates to linguistic concepts like syntax, structure, semantics, etc.
  4. Graph - In computer science, this is a hierarchical representation of related data. Graphs and trees are traversed to establish networks and understand connections.
  5. Multimedia, Streaming, etc. - Audio, video, pictures, and streamed data is Methods include image sensing, video screen processing, deep learning, 3D modeling, and event handling.


What Are Its Uses?

Internet is not just about accessing websites for information or entertainment. It is also not restricted to online businesses. Social media has boomed, and Smartphone dependency is a reality. The volume of diverse data has grown by leaps and bounds from Kilo or Mega Bytes to Giga, Tera, and Peta Bytes. The future is being described in terms of Predictive Analytics, Cloud Computing, Artificial Intelligence, and Quantum Computers. The beneficiaries include corporate world, government organizations, academia, and global technocracy.

Python is a highly versatile and multi-purpose language with reusable code or packages. A developer can learn and adapt quickly to the programming environment and tools. There is support for natural language processing toolkits or NLTKs. Web integration is smooth and dependable, as are extensibility and scalability.


Power of Python

Advantages of Python

  • A practical and pragmatic choice that has easy to learn syntax and semantics.
  • Extensive support for structured and object-oriented programming concepts.
  • User-friendly and flexible integrated development environments (IDEs).
  • Even non-programmers can learn quickly due to easily readable code.
  • Advanced features like modularity, exceptional handling and code reusability.
  • Packages for a wide range of applications (bioinformatics, AI, social sciences).
  • Extensive support for networking, databases, web technologies, and data science.
  • Efficient coding and data processing on multiple platforms and operating systems.


     Data Scientist Support

The R programming language is a force to reckon with in statistical analysis. Python does not lag behind as it has many powerful packages for data scientists.

  1. Hadoop - It is an open source, Apache platform for big data solutions. Python integrates well with Hadoop HDFS API. The PyDoop package facilitates complex problem solving and data retrieval. It can also be used to develop MapReduce applications (parallel, distributed algorithms for big data processing).
  2. Computing - Modules like NumPy, SciPy, Pandas, etc., assist in math and numerical analysis. Programmers work with multi-dimensional arrays, matrices, linear algebra, and calculus. High level data structures and optimal code perform analysis, transformation, and mapping of data sets.
  3. Machine Learning - Efficient algorithmic analysis and processing through PyBrain, TensorFlow, etc., Scikit-learn is used with SciPy for data clustering, regression, and classification.
  4. Visualization - Matplotlib, Statsmodels, and Gensim assist in data visualization, graph plotting, statistical and topical modeling.



We live in a world of high performance computers and networking technologies. Data and information have become vital for business and personal success. Information systems with relational databases and user-friendly front ends do not suffice. Social media is on the upswing, and progress is defined by informed choices and faster data access. Humans cannot handle information overload, and data science has become popular.

Large, voluminous amounts of data are generated in the form of text, numbers, pictures, audio, and video. Data scientists develop models and techniques to access and manipulate this Big Data. Their data analysis is crucial for information retrieval and business profits. Python and R have been rated as sophisticated software to achieve these goals. Python has powerful packages for developing applications in Engineering, AI, Bio-informatics, and Social Sciences (Media, Politics, Governance, Twitter, Facebook, etc.).


  1. Install MapR 6.0 Sandbox:
  2. Ensure you have enough space on the Sandbox to install StreamSets Data Collector and StreamSets Data Collector Edge. Keep at least 5GB of space available. To check how much space is available and/or to add more space, follow this guide:


Install StreamSets Data Collector

  1. SSH into the Sandbox and login as root


$ ssh mapr@localhost -p 2222


Last login: Wed Jan 31 21:30:50 2018

Welcome to your Mapr Demo virtual machine.

[mapr@maprdemo ~]$ su -


Last login: Wed Jan 31 21:30:54 PST 2018 on pts/0

[root@maprdemo ~]#


  1. Download the RPM and extract the binaries

Get the latest version to install from

Note: We’ll be using StreamSets Data Collector version





[root@maprdemo ~]# wget


Note: If the download link does not work, use the fully qualified download link:


--2018-02-01 05:37:42--

Resolving (

Connecting to (||:80... connected.

HTTP request sent, awaiting response... 200 OK

Length: 3914629120 (3.6G) [application/x-tar]

Saving to: ‘streamsets-datacollector-’


[root@maprdemo ~]# tar -xf streamsets-datacollector-

[root@maprdemo ~]# ls

anaconda-ks.cfg  config.sandbox original-ks.cfg  streamsets-datacollector-  streamsets-datacollector-

[root@maprdemo ~]#


  1. Remove unrequired stage libraries

StreamSets installs each package as a stage library. You can choose to do a full install with all the stage libraries or selectively install only what’s required. The full install will take ~3.5GB of space. We do not need to do a full install because half the stage libraries will not be required for MapR. Remove these unwanted stage libraries as follows:


[root@maprdemo ~]# cd streamsets-datacollector-

[root@maprdemo streamsets-datacollector-]# rm -rf streamsets-datacollector-cdh* && rm -rf streamsets-datacollector-hdp* && rm -rf streamsets-datacollector-apache-kudu* && rm -rf streamsets-datacollector-mapr_5*


  1. Install

[root@maprdemo streamsets-datacollector-]# pwd


[root@maprdemo streamsets-datacollector-]# yum localinstall streamsets*.rpm

Loaded plugins: fastestmirror, langpacks

Examining streamsets-datacollector- streamsets-datacollector-

Marking streamsets-datacollector- to be installed

Examining streamsets-datacollector-apache-kafka_0_10-lib- streamsets-datacollector-apache-kafka_0_10-lib-

Marking streamsets-datacollector-apache-kafka_0_10-lib- to be installed

Examining streamsets-datacollector-apache-kafka_0_11-lib- streamsets-datacollector-apache-kafka_0_11-lib-

Marking streamsets-datacollector-apache-kafka_0_11-lib- to be installed

Examining streamsets-datacollector-apache-kafka_0_9-lib- streamsets-datacollector-apache-kafka_0_9-lib-

Marking streamsets-datacollector-apache-kafka_0_9-lib- to be installed

Examining streamsets-datacollector-apache-kafka_1_0-lib- streamsets-datacollector-apache-kafka_1_0-lib-

Marking streamsets-datacollector-apache-kafka_1_0-lib- to be installed

Examining streamsets-datacollector-apache-solr_6_1_0-lib- streamsets-datacollector-apache-solr_6_1_0-lib-

Marking streamsets-datacollector-apache-solr_6_1_0-lib- to be installed

Examining streamsets-datacollector-aws-lib- streamsets-datacollector-aws-lib-

Marking streamsets-datacollector-aws-lib- to be installed

Examining streamsets-datacollector-azure-lib- streamsets-datacollector-azure-lib-

Marking streamsets-datacollector-azure-lib- to be installed

Examining streamsets-datacollector-basic-lib- streamsets-datacollector-basic-lib-

Marking streamsets-datacollector-basic-lib- to be installed

Examining streamsets-datacollector-bigtable-lib- streamsets-datacollector-bigtable-lib-

Marking streamsets-datacollector-bigtable-lib- to be installed

Examining streamsets-datacollector-cassandra_3-lib- streamsets-datacollector-cassandra_3-lib-

Marking streamsets-datacollector-cassandra_3-lib- to be installed

Examining streamsets-datacollector-cyberark-credentialstore-lib- streamsets-datacollector-cyberark-credentialstore-lib-

Marking streamsets-datacollector-cyberark-credentialstore-lib- to be installed

Examining streamsets-datacollector-dev-lib- streamsets-datacollector-dev-lib-

Marking streamsets-datacollector-dev-lib- to be installed

Examining streamsets-datacollector-elasticsearch_5-lib- streamsets-datacollector-elasticsearch_5-lib-

Marking streamsets-datacollector-elasticsearch_5-lib- to be installed

Examining streamsets-datacollector-google-cloud-lib- streamsets-datacollector-google-cloud-lib-

Marking streamsets-datacollector-google-cloud-lib- to be installed

Examining streamsets-datacollector-groovy_2_4-lib- streamsets-datacollector-groovy_2_4-lib-

Marking streamsets-datacollector-groovy_2_4-lib- to be installed

Examining streamsets-datacollector-influxdb_0_9-lib- streamsets-datacollector-influxdb_0_9-lib-

Marking streamsets-datacollector-influxdb_0_9-lib- to be installed

Examining streamsets-datacollector-jdbc-lib- streamsets-datacollector-jdbc-lib-

Marking streamsets-datacollector-jdbc-lib- to be installed

Examining streamsets-datacollector-jks-credentialstore-lib- streamsets-datacollector-jks-credentialstore-lib-

Marking streamsets-datacollector-jks-credentialstore-lib- to be installed

Examining streamsets-datacollector-jms-lib- streamsets-datacollector-jms-lib-

Marking streamsets-datacollector-jms-lib- to be installed

Examining streamsets-datacollector-jython_2_7-lib- streamsets-datacollector-jython_2_7-lib-

Marking streamsets-datacollector-jython_2_7-lib- to be installed

Examining streamsets-datacollector-kinetica_6_0-lib- streamsets-datacollector-kinetica_6_0-lib-

Marking streamsets-datacollector-kinetica_6_0-lib- to be installed

Examining streamsets-datacollector-mapr_6_0-lib- streamsets-datacollector-mapr_6_0-lib-

Marking streamsets-datacollector-mapr_6_0-lib- to be installed

Examining streamsets-datacollector-mapr_6_0-mep4-lib- streamsets-datacollector-mapr_6_0-mep4-lib-

Marking streamsets-datacollector-mapr_6_0-mep4-lib- to be installed

Examining streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib- streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib-

Marking streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib- to be installed

Examining streamsets-datacollector-mongodb_3-lib- streamsets-datacollector-mongodb_3-lib-

Marking streamsets-datacollector-mongodb_3-lib- to be installed

Examining streamsets-datacollector-mysql-binlog-lib- streamsets-datacollector-mysql-binlog-lib-

Marking streamsets-datacollector-mysql-binlog-lib- to be installed

Examining streamsets-datacollector-omniture-lib- streamsets-datacollector-omniture-lib-

Marking streamsets-datacollector-omniture-lib- to be installed

Examining streamsets-datacollector-rabbitmq-lib- streamsets-datacollector-rabbitmq-lib-

Marking streamsets-datacollector-rabbitmq-lib- to be installed

Examining streamsets-datacollector-redis-lib- streamsets-datacollector-redis-lib-

Marking streamsets-datacollector-redis-lib- to be installed

Examining streamsets-datacollector-salesforce-lib- streamsets-datacollector-salesforce-lib-

Marking streamsets-datacollector-salesforce-lib- to be installed

Examining streamsets-datacollector-stats-lib- streamsets-datacollector-stats-lib-

Marking streamsets-datacollector-stats-lib- to be installed

Examining streamsets-datacollector-vault-credentialstore-lib- streamsets-datacollector-vault-credentialstore-lib-

Marking streamsets-datacollector-vault-credentialstore-lib- to be installed

Examining streamsets-datacollector-windows-lib- streamsets-datacollector-windows-lib-

Marking streamsets-datacollector-windows-lib- to be installed

Resolving Dependencies

--> Running transaction check

---> Package streamsets-datacollector.noarch 0: will be installed

---> Package streamsets-datacollector-apache-kafka_0_10-lib.noarch 0: will be installed

---> Package streamsets-datacollector-apache-kafka_0_11-lib.noarch 0: will be installed

---> Package streamsets-datacollector-apache-kafka_0_9-lib.noarch 0: will be installed

---> Package streamsets-datacollector-apache-kafka_1_0-lib.noarch 0: will be installed

---> Package streamsets-datacollector-apache-solr_6_1_0-lib.noarch 0: will be installed

---> Package streamsets-datacollector-aws-lib.noarch 0: will be installed

---> Package streamsets-datacollector-azure-lib.noarch 0: will be installed

---> Package streamsets-datacollector-basic-lib.noarch 0: will be installed

---> Package streamsets-datacollector-bigtable-lib.noarch 0: will be installed

---> Package streamsets-datacollector-cassandra_3-lib.noarch 0: will be installed

---> Package streamsets-datacollector-cyberark-credentialstore-lib.noarch 0: will be installed

---> Package streamsets-datacollector-dev-lib.noarch 0: will be installed

---> Package streamsets-datacollector-elasticsearch_5-lib.noarch 0: will be installed

---> Package streamsets-datacollector-google-cloud-lib.noarch 0: will be installed

---> Package streamsets-datacollector-groovy_2_4-lib.noarch 0: will be installed

---> Package streamsets-datacollector-influxdb_0_9-lib.noarch 0: will be installed

---> Package streamsets-datacollector-jdbc-lib.noarch 0: will be installed

---> Package streamsets-datacollector-jks-credentialstore-lib.noarch 0: will be installed

---> Package streamsets-datacollector-jms-lib.noarch 0: will be installed

---> Package streamsets-datacollector-jython_2_7-lib.noarch 0: will be installed

---> Package streamsets-datacollector-kinetica_6_0-lib.noarch 0: will be installed

---> Package streamsets-datacollector-mapr_6_0-lib.noarch 0: will be installed

---> Package streamsets-datacollector-mapr_6_0-mep4-lib.noarch 0: will be installed

---> Package streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib.noarch 0: will be installed

---> Package streamsets-datacollector-mongodb_3-lib.noarch 0: will be installed

---> Package streamsets-datacollector-mysql-binlog-lib.noarch 0: will be installed

---> Package streamsets-datacollector-omniture-lib.noarch 0: will be installed

---> Package streamsets-datacollector-rabbitmq-lib.noarch 0: will be installed

---> Package streamsets-datacollector-redis-lib.noarch 0: will be installed

---> Package streamsets-datacollector-salesforce-lib.noarch 0: will be installed

---> Package streamsets-datacollector-stats-lib.noarch 0: will be installed

---> Package streamsets-datacollector-vault-credentialstore-lib.noarch 0: will be installed

---> Package streamsets-datacollector-windows-lib.noarch 0: will be installed

--> Finished Dependency Resolution

MapR_Core                                                                                                                                                                            | 1.4 kB 00:00:00

MapR_Core/primary                                                                                                                                                                    | 4.7 kB 00:00:00

MapR_Ecosystem                                                                                                                                                                       | 1.4 kB 00:00:00

MapR_Ecosystem/primary                                                                                                                                                               | 14 kB 00:00:00

base/7/x86_64                                                                                                                                                                        | 3.6 kB 00:00:00

base/7/x86_64/group_gz                                                                                                                                                               | 156 kB 00:00:00

base/7/x86_64/primary_db                                                                                                                                                             | 5.7 MB 00:00:02

epel/x86_64/metalink                                                                                                                                                                 | 13 kB 00:00:00

epel/x86_64                                                                                                                                                                          | 4.7 kB 00:00:00

epel/x86_64/group_gz                                                                                                                                                                 | 266 kB 00:00:00

epel/x86_64/updateinfo                                                                                                                                                               | 880 kB 00:00:00

epel/x86_64/primary_db                                                                                                                                                               | 6.2 MB 00:00:01

extras/7/x86_64                                                                                                                                                                      | 3.4 kB 00:00:00

extras/7/x86_64/primary_db                                                                                                                                                           | 166 kB 00:00:00

updates/7/x86_64                                                                                                                                                                     | 3.4 kB 00:00:00

updates/7/x86_64/primary_db                                                                                                                                                          | 6.0 MB 00:00:01


Dependencies Resolved



Package                                                            Arch Version Repository                                                               Size



streamsets-datacollector                                           noarch /streamsets-datacollector-                                           162 M

streamsets-datacollector-apache-kafka_0_10-lib                     noarch /streamsets-datacollector-apache-kafka_0_10-lib-                      38 M

streamsets-datacollector-apache-kafka_0_11-lib                     noarch /streamsets-datacollector-apache-kafka_0_11-lib-                      40 M

streamsets-datacollector-apache-kafka_0_9-lib                      noarch /streamsets-datacollector-apache-kafka_0_9-lib-                       38 M

streamsets-datacollector-apache-kafka_1_0-lib                      noarch /streamsets-datacollector-apache-kafka_1_0-lib-                       40 M

streamsets-datacollector-apache-solr_6_1_0-lib                     noarch /streamsets-datacollector-apache-solr_6_1_0-lib-                      17 M

streamsets-datacollector-aws-lib                                   noarch /streamsets-datacollector-aws-lib-                                    46 M

streamsets-datacollector-azure-lib                                 noarch /streamsets-datacollector-azure-lib-                                  18 M

streamsets-datacollector-basic-lib                                 noarch /streamsets-datacollector-basic-lib-                                  36 M

streamsets-datacollector-bigtable-lib                              noarch /streamsets-datacollector-bigtable-lib-                               55 M

streamsets-datacollector-cassandra_3-lib                           noarch /streamsets-datacollector-cassandra_3-lib-                            17 M

streamsets-datacollector-cyberark-credentialstore-lib              noarch /streamsets-datacollector-cyberark-credentialstore-lib-              5.2 M

streamsets-datacollector-dev-lib                                   noarch /streamsets-datacollector-dev-lib-                                    14 M

streamsets-datacollector-elasticsearch_5-lib                       noarch /streamsets-datacollector-elasticsearch_5-lib-                        18 M

streamsets-datacollector-google-cloud-lib                          noarch /streamsets-datacollector-google-cloud-lib-                           28 M

streamsets-datacollector-groovy_2_4-lib                            noarch /streamsets-datacollector-groovy_2_4-lib-                             19 M

streamsets-datacollector-influxdb_0_9-lib                          noarch /streamsets-datacollector-influxdb_0_9-lib-                           14 M

streamsets-datacollector-jdbc-lib                                  noarch /streamsets-datacollector-jdbc-lib-                                   27 M

streamsets-datacollector-jks-credentialstore-lib                   noarch /streamsets-datacollector-jks-credentialstore-lib-                   2.6 M

streamsets-datacollector-jms-lib                                   noarch /streamsets-datacollector-jms-lib-                                    17 M

streamsets-datacollector-jython_2_7-lib                            noarch /streamsets-datacollector-jython_2_7-lib-                             53 M

streamsets-datacollector-kinetica_6_0-lib                          noarch /streamsets-datacollector-kinetica_6_0-lib-                           32 M

streamsets-datacollector-mapr_6_0-lib                              noarch /streamsets-datacollector-mapr_6_0-lib-                               43 M

streamsets-datacollector-mapr_6_0-mep4-lib                         noarch /streamsets-datacollector-mapr_6_0-mep4-lib-                          94 M

streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib                noarch /streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib-                152 M

streamsets-datacollector-mongodb_3-lib                             noarch /streamsets-datacollector-mongodb_3-lib-                              16 M

streamsets-datacollector-mysql-binlog-lib                          noarch /streamsets-datacollector-mysql-binlog-lib-                           16 M

streamsets-datacollector-omniture-lib                              noarch /streamsets-datacollector-omniture-lib-                               15 M

streamsets-datacollector-rabbitmq-lib                              noarch /streamsets-datacollector-rabbitmq-lib-                               16 M

streamsets-datacollector-redis-lib                                 noarch /streamsets-datacollector-redis-lib-                                  14 M

streamsets-datacollector-salesforce-lib                            noarch /streamsets-datacollector-salesforce-lib-                             20 M

streamsets-datacollector-stats-lib                                 noarch /streamsets-datacollector-stats-lib-                                  32 M

streamsets-datacollector-vault-credentialstore-lib                 noarch /streamsets-datacollector-vault-credentialstore-lib-                 3.8 M

streamsets-datacollector-windows-lib                               noarch /streamsets-datacollector-windows-lib-                                14 M


Transaction Summary


Install  34 Packages


Total size: 1.1 G

Installed size: 1.1 G

Is this ok [y/d/N]: y

Downloading packages:

Running transaction check

Running transaction test

Transaction test succeeded

Running transaction

 Installing : streamsets-datacollector-                                                                                                                                               1/34

 Installing : streamsets-datacollector-salesforce-lib-                                                                                                                                2/34

 Installing : streamsets-datacollector-groovy_2_4-lib-                                                                                                                                3/34

 Installing : streamsets-datacollector-cyberark-credentialstore-lib-                                                                                                                  4/34

 Installing : streamsets-datacollector-aws-lib-                                                                                                                                       5/34

 Installing : streamsets-datacollector-cassandra_3-lib-                                                                                                                               6/34

 Installing : streamsets-datacollector-rabbitmq-lib-                                                                                                                                  7/34

 Installing : streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib-                                                                                                                    8/34

 Installing : streamsets-datacollector-jdbc-lib-                                                                                                                                      9/34

 Installing : streamsets-datacollector-apache-kafka_1_0-lib-                                                                                                                         10/34

 Installing : streamsets-datacollector-dev-lib-                                                                                                                                      11/34

 Installing : streamsets-datacollector-omniture-lib-                                                                                                                                 12/34

 Installing : streamsets-datacollector-mongodb_3-lib-                                                                                                                                13/34

 Installing : streamsets-datacollector-redis-lib-                                                                                                                                    14/34

 Installing : streamsets-datacollector-windows-lib-                                                                                                                                  15/34

 Installing : streamsets-datacollector-jks-credentialstore-lib-                                                                                                                      16/34

 Installing : streamsets-datacollector-jython_2_7-lib-                                                                                                                               17/34

 Installing : streamsets-datacollector-kinetica_6_0-lib-                                                                                                                             18/34

 Installing : streamsets-datacollector-jms-lib-                                                                                                                                      19/34

 Installing : streamsets-datacollector-stats-lib-                                                                                                                                    20/34

 Installing : streamsets-datacollector-elasticsearch_5-lib-                                                                                                                          21/34

 Installing : streamsets-datacollector-apache-solr_6_1_0-lib-                                                                                                                        22/34

 Installing : streamsets-datacollector-apache-kafka_0_11-lib-                                                                                                                        23/34

 Installing : streamsets-datacollector-mapr_6_0-lib-                                                                                                                                 24/34

 Installing : streamsets-datacollector-azure-lib-                                                                                                                                    25/34

 Installing : streamsets-datacollector-mysql-binlog-lib-                                                                                                                             26/34

 Installing : streamsets-datacollector-vault-credentialstore-lib-                                                                                                                    27/34

 Installing : streamsets-datacollector-apache-kafka_0_10-lib-                                                                                                                        28/34

 Installing : streamsets-datacollector-basic-lib-                                                                                                                                    29/34

 Installing : streamsets-datacollector-influxdb_0_9-lib-                                                                                                                             30/34

 Installing : streamsets-datacollector-apache-kafka_0_9-lib-                                                                                                                         31/34

 Installing : streamsets-datacollector-mapr_6_0-mep4-lib-                                                                                                                            32/34

 Installing : streamsets-datacollector-bigtable-lib-                                                                                                                                 33/34

 Installing : streamsets-datacollector-google-cloud-lib-                                                                                                                             34/34

 Verifying  : streamsets-datacollector-salesforce-lib-                                                                                                                                1/34

 Verifying  : streamsets-datacollector-groovy_2_4-lib-                                                                                                                                2/34

 Verifying  : streamsets-datacollector-cyberark-credentialstore-lib-                                                                                                                  3/34

 Verifying  : streamsets-datacollector-aws-lib-                                                                                                                                       4/34

 Verifying  : streamsets-datacollector-cassandra_3-lib-                                                                                                                               5/34

 Verifying  : streamsets-datacollector-rabbitmq-lib-                                                                                                                                  6/34

 Verifying  : streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib-                                                                                                                    7/34

 Verifying  : streamsets-datacollector-jdbc-lib-                                                                                                                                      8/34

 Verifying  : streamsets-datacollector-apache-kafka_1_0-lib-                                                                                                                          9/34

 Verifying  : streamsets-datacollector-dev-lib-                                                                                                                                      10/34

 Verifying  : streamsets-datacollector-omniture-lib-                                                                                                                                 11/34

 Verifying  : streamsets-datacollector-mongodb_3-lib-                                                                                                                                12/34

 Verifying  : streamsets-datacollector-redis-lib-                                                                                                                                    13/34

 Verifying  : streamsets-datacollector-windows-lib-                                                                                                                                  14/34

 Verifying  : streamsets-datacollector-jks-credentialstore-lib-                                                                                                                      15/34

 Verifying  : streamsets-datacollector-jython_2_7-lib-                                                                                                                               16/34

 Verifying  : streamsets-datacollector-kinetica_6_0-lib-                                                                                                                             17/34

 Verifying  : streamsets-datacollector-jms-lib-                                                                                                                                      18/34

 Verifying  : streamsets-datacollector-stats-lib-                                                                                                                                    19/34

 Verifying  : streamsets-datacollector-elasticsearch_5-lib-                                                                                                                          20/34

 Verifying  : streamsets-datacollector-apache-solr_6_1_0-lib-                                                                                                                        21/34

 Verifying  : streamsets-datacollector-apache-kafka_0_11-lib-                                                                                                                        22/34

 Verifying  : streamsets-datacollector-mapr_6_0-lib-                                                                                                                                 23/34

 Verifying  : streamsets-datacollector-azure-lib-                                                                                                                                    24/34

 Verifying  : streamsets-datacollector-mysql-binlog-lib-                                                                                                                             25/34

 Verifying  : streamsets-datacollector-vault-credentialstore-lib-                                                                                                                    26/34

 Verifying  : streamsets-datacollector-apache-kafka_0_10-lib-                                                                                                                        27/34

 Verifying  : streamsets-datacollector-basic-lib-                                                                                                                                    28/34

 Verifying  : streamsets-datacollector-influxdb_0_9-lib-                                                                                                                             29/34

 Verifying  : streamsets-datacollector-apache-kafka_0_9-lib-                                                                                                                         30/34

 Verifying  : streamsets-datacollector-mapr_6_0-mep4-lib-                                                                                                                            31/34

 Verifying  : streamsets-datacollector-bigtable-lib-                                                                                                                                 32/34

 Verifying  : streamsets-datacollector-google-cloud-lib-                                                                                                                             33/34

 Verifying  : streamsets-datacollector-                                                                                                                                              34/34



 streamsets-datacollector.noarch 0:                                                         streamsets-datacollector-apache-kafka_0_10-lib.noarch 0:

 streamsets-datacollector-apache-kafka_0_11-lib.noarch 0:                                   streamsets-datacollector-apache-kafka_0_9-lib.noarch 0:

 streamsets-datacollector-apache-kafka_1_0-lib.noarch 0:                                    streamsets-datacollector-apache-solr_6_1_0-lib.noarch 0:

 streamsets-datacollector-aws-lib.noarch 0:                                                 streamsets-datacollector-azure-lib.noarch 0:

 streamsets-datacollector-basic-lib.noarch 0:                                               streamsets-datacollector-bigtable-lib.noarch 0:

 streamsets-datacollector-cassandra_3-lib.noarch 0:                                         streamsets-datacollector-cyberark-credentialstore-lib.noarch 0:

 streamsets-datacollector-dev-lib.noarch 0:                                                 streamsets-datacollector-elasticsearch_5-lib.noarch 0:

 streamsets-datacollector-google-cloud-lib.noarch 0:                                        streamsets-datacollector-groovy_2_4-lib.noarch 0:

 streamsets-datacollector-influxdb_0_9-lib.noarch 0:                                        streamsets-datacollector-jdbc-lib.noarch 0:

 streamsets-datacollector-jks-credentialstore-lib.noarch 0:                                 streamsets-datacollector-jms-lib.noarch 0:

 streamsets-datacollector-jython_2_7-lib.noarch 0:                                          streamsets-datacollector-kinetica_6_0-lib.noarch 0:

 streamsets-datacollector-mapr_6_0-lib.noarch 0:                                            streamsets-datacollector-mapr_6_0-mep4-lib.noarch 0:

 streamsets-datacollector-mapr_spark_2_1_mep_3_0-lib.noarch 0:                              streamsets-datacollector-mongodb_3-lib.noarch 0:

 streamsets-datacollector-mysql-binlog-lib.noarch 0:                                        streamsets-datacollector-omniture-lib.noarch 0:

 streamsets-datacollector-rabbitmq-lib.noarch 0:                                            streamsets-datacollector-redis-lib.noarch 0:

 streamsets-datacollector-salesforce-lib.noarch 0:                                          streamsets-datacollector-stats-lib.noarch 0:

 streamsets-datacollector-vault-credentialstore-lib.noarch 0:                               streamsets-datacollector-windows-lib.noarch 0:



[root@maprdemo streamsets-datacollector-]#


  1. Setup connectivity to MapR

The command modifies configuration files, creates the required symbolic links, and installs the appropriate MapR stage libraries.

[root@maprdemo streamsets-datacollector-]# cd /opt/streamsets-datacollector/

[root@maprdemo streamsets-datacollector]# ls

api-lib  bin cli-lib  container-lib libexec  libs-common-lib root-lib  sdc-static-web streamsets-libs  user-libs

[root@maprdemo streamsets-datacollector]# export SDC_HOME=/opt/streamsets-datacollector

[root@maprdemo streamsets-datacollector]# export SDC_CONF=/etc/sdc

[root@maprdemo streamsets-datacollector]# export MAPR_MEP_VERSION=4

[root@maprdemo streamsets-datacollector]# $SDC_HOME/bin/streamsets setup-mapr


+ printf 'Done\n'


+ echo Succeeded



  1. Start the service

[root@maprdemo streamsets-datacollector-]# systemctl start sdc


  1. Check Service Status

[root@maprdemo streamsets-datacollector-]# systemctl status sdc

  • sdc.service - StreamSets Data Collector (SDC)

  Loaded: loaded (/usr/lib/systemd/system/sdc.service; static; vendor preset: disabled)

  Active: active (running) since Thu 2018-02-01 06:19:20 PST; 26s ago

Main PID: 31899 (_sdc)

  CGroup: /system.slice/sdc.service

          ├─31899 /bin/bash /opt/streamsets-datacollector/libexec/_sdc -verbose

          └─31939 /usr/bin/java -classpath /opt/streamsets-datacollector/libexec/bootstrap-libs/main/streamsets-datacollector-bootstrap-* -Djava.secu...


Feb 01 06:19:20 maprdemo.local streamsets[31899]: API_CLASSPATH                  : /opt/streamsets-datacollector/api-lib/*.jar

Feb 01 06:19:20 maprdemo.local streamsets[31899]: CONTAINER_CLASSPATH            : /etc/sdc:/opt/streamsets-datacollector/container-lib/*.jar

Feb 01 06:19:20 maprdemo.local streamsets[31899]: LIBS_COMMON_LIB_DIR            : /opt/streamsets-datacollector/libs-common-lib/

Feb 01 06:19:20 maprdemo.local streamsets[31899]: STREAMSETS_LIBRARIES_DIR       : /opt/streamsets-datacollector/streamsets-libs

Feb 01 06:19:20 maprdemo.local streamsets[31899]: STREAMSETS_LIBRARIES_EXTRA_DIR : /opt/streamsets-datacollector/streamsets-libs-extras/

Feb 01 06:19:20 maprdemo.local streamsets[31899]: USER_LIBRARIES_DIR             : /opt/streamsets-datacollector/user-libs/

Feb 01 06:19:20 maprdemo.local streamsets[31899]: JAVA OPTS                      : -Xmx1024m -Xms1024m -s...amsets-dataco

Feb 01 06:19:20 maprdemo.local streamsets[31899]: MAIN CLASS                     : com.streamsets.datacollector.main.DataCollectorMain

Feb 01 06:19:21 maprdemo.local streamsets[31899]: Logging initialized @945ms to org.eclipse.jetty.util.log.Slf4jLog

Feb 01 06:19:34 maprdemo.local streamsets[31899]: Running on URI : 'http://maprdemo:18630'

Hint: Some lines were ellipsized, use -l to show in full.


  1. Enable port forwarding

To access the UI for StreamSets Data Collector, enable port 18630 to be accessible.

Here’s how you’ll do that if you use Virtual Box.


Select Settings for the Sandbox and then click on Network settings



Add an entry for host port 18630


Select OK.


  1. Log into SDC & verify MapR stages are visible

Log into the SDC UI with the following url: http://localhost:18630

Default login is: admin/admin

Verify that you see MapR stages in the UI by first creating a pipeline.


Create a new pipeline


If all goes well, you should be able to see all the MapR stages as shown above.

I came across these series of podcasts hosted by Jim Scott about Kubernetes and I though you might like them!



Happy listening!

What is happening now in machine learning is very much like the homebrew computer movement from a half-century ago? 


Article by Ted Dunning  published on February 2nd 2018 at TDWI.


Can you name a technology that almost all of us have been using for 30 years that is paradoxically now considered to be the Next Big Thing?


That would be machine learning. It has been powering things (such as credit card fraud detection) since the late 1980s, about the same time, banks started widespread use of neural networks for signature and amount verification on checks.


Machine learning has been around a long time, even in very widely used applications. That said, there has been a massive technological revolution over the last 15 years.


This upheaval is normally described in recent news and marketing copy as revolutionary because what appeared to be impossibly hard tasks (such as playing Go, recognizing a wide range of images, or translating text in video on the fly) have suddenly yielded to modern methods tantalizing us with the promise that stunning new products are just around the corner.


The Real Change Isn't What You Think

In some sense, however, the important thing that has changed is a shift, taking machine learning from something that can be used in a few niche applications supported by an army of Ph.D.-qualified mathematicians and programmers into something that can turn a few weekends of effort by an enthusiastic developer into an impressive project. If you happen to have that army of savants, that is all well and good, but the real news is not what you can do with such an army. Instead, the real news is about what you can do without such an army.


Just recently, academic and industrial researchers have started to accompany the publication of their results with working models and the code used to build them. Interestingly, it is commonly possible to start with these models and tweak them just a bit to perform some new task, often taking just a fraction of a percent as much data and compute time to do this retuning. You can take an image-classification program originally trained to recognize images in any of 1,000 categories using tens of millions of images and thousands of hours of high-performance computer time for training and rejigger it to distinguish chickens from blue jays with a few thousand sample images and a few minutes to hours of time on your laptop. Deep learning has, in a few important areas, turned into cheap learning.


Over the last year or two, this change has resulted in an explosion of hobby-level projects where deep learning was used for all kinds of fantastically fun -- but practically pretty much useless -- projects. As fanciful and even as downright silly as these projects have been, they have had a very practical and important effect of building a reservoir of machine learning expertise among developers who have a wild array of different kinds of domain knowledge.


Coming Soon


Those developers who have been building machines to kick bluejays out of a chicken coop, play video games, sort Legos, or track their cat's activities will inevitably be branching out soon to solve problems they see in their everyday life at work. It will be a short walk from building a system for the joy of messing about to building systems that solve real problems.


What is happening now in machine learning is very much like the homebrew computer movement from a half-century ago. The first efforts resulted in systems that only a hacker could love, but before long we had the Apple II and then the Macintosh. What started as a burst of creative energy changed the world.

We stand on the verge of the same level of change.


About the Author

Ted Dunning is chief applications architect at MapR Technologies and a board member for the Apache Software Foundation. He is a PMC member and committer of the Apache Mahout, Apache Zookeeper, and Apache Drill projects and a mentor for several incubator projects. He was chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems and built fraud detection systems for ID Analytics (LifeLock). He has a Ph.D. in computing science from the University of Sheffield and 24 issued patents to date. He has co-authored a number of books on big data topics including several published by O’Reilly related to machine learning. Find him on Twitter as @ted_dunning. 

The MapR Music Catalog application by Tug Grall explains the key MapR-DB features, and how to use them to build a complete Web application. Here are the steps to develop, build and run the application:

  1. Introduction
  2. MapR Music Architecture
  3. Setup your environment
  4. Import the Data Set
  5. Discover MapR-DB Shell and Apache Drill
  6. Work with MapR-DB and Java
  7. Add Indexes
  8. Create a REST API
  9. Deploy to Wildfly
  10. Build the Web Application with Angular
  11. Work with JSON Arrays
  12. Change Data Capture
  13. Add Full Text Search to the Application
  14. Build a Recommendation Engine

The source code of the MapR Music Catalog application is available in this GitHub Repository.

Tüpras is a Turkish oil refinery which is the largest industrial company in Turkey and the 7th oil refinery in Europe. The Big Data & Analytics Team from Tüpras was the 2017 Gold Stevie Winner in the category: 'IT Team of the Year'. 

"In bullet-list form, briefly summarize up to ten (10) accomplishments of the nominated team since the beginning of 2016 (up to 150 words).

  • Establishing a new “Big Data platform based on Hadoop, MapR”
  • Integrating daily 300 billion raw process data depended on 200K sensors of 4 refineries into Big Data
  • Historical export of 10 years data integrated into Big Data
  • Reducing the data frequencies of 30s and 60s with the new platform to 1s frequency
  • Developing “Tüpras Historian Database-THD” for access and analysis of all refinery data with a single web-based application
  • “Management Information System-MIS” platform developed for visual analysis of approximately 50K metric/KPI calculated from process data. The platform supports self service reporting and Decision Making Support tools for “proactive monitoring & analysis”.
  • Developing “Engineering Platform” to run fast What-IF scenarios
  • Developing “Alarm Management” system to centralize and analyze DCS (distributed control system) alarms
  • Implementing of “Predictive Maintenance” scenarios based on Machine Learning
  • Developing IOS based mobile applications of MIS and THD"
In the attached powerpoint presentation about the project, you might be interested in particular slide 4 of the platform architecture and also slides about why they selected MapR
Also a couple interesting videos about the company and the project ( in Turkish), 
  • Tupra Big Data and Analytics video: ( 1'58 duration)
  • Tupras Intro Video


During execution, Gateway and Apex generate event log records that provide an audit trail. This can be used to understand the activity of the system and to diagnose problems. Usually, the event log records are stored into a local file system and later can be used for analysis and diagnostic.

Gateway also provides an universal ability to pass and store Gateway and Apex event log records to 3rd party sources. You can use external tools to store the log events and also to query and report. For this you must configure the logger appender in the Gateway configuration files.

Configuring Logger Appenders

Gateway and Apex Client processes run on the machine node where the Gateway instance has been installed. Therefore, you can configure the logger appenders using the regular log4j properties (datatorrent/releases/3.9.0/conf/

Following is an example of log4j properties configuration for Socket Appender:


You can use the regular attribute property “apex.attr.LOGGER_APPENDER” to configure the logger appenders for Apex Application Master and Containers. This can be defined in the configuration file dt-site.xml (global, local, and user) or in the static and runtime application properties.

Use the following syntax to enter the logger appender attribute value:


Following is an example of logger appender attribute configuration for Socket Appender:

  <property>     <name>apex.attr.LOGGER_APPENDER</name>     <value>tcp;,            log4j.appender.tcp.RemoteHost=logstashnode1,            log4j.appender.tcp.Port=5400,            log4j.appender.tcp.ReconnectionDelay=10000,            log4j.appender.tcp.LocationInfo=true     </value>   </property>

Integrating with ElasticSearch and Splunk

You can use different methods to store event log records to an external data source. However, we recommend to use the following method:

Gateway and Apex can be configured to use Socket Appender to send logger events to Logstash and Logstash can deploy event log records to any output data sources. For instance, the following picture shows the integration workflow with ElasticSearch and Splunk.

Following is an example of Logstash configuration:

input {  getting of looger events from Socket Appender   log4j {     mode => "server"     port => 5400     type => "log4j"   } }  Filter{  transformation of looger events to event log records   mutate {     remove_field => [ "@version","path","tags","host","type","logger_name" ]     rename => { "apex.user" => "user" }     rename => { "apex.application" => "application" }     rename => { "apex.containerId" => "containerId" }     rename => { "apex.applicationId" => "applicationId" }     rename => { "apex.node" => "node" }     rename => { "apex.service" => "service" }     rename => { "dt.node" => "node" }     rename => { "dt.service" => "service" }     rename => { "priority" => "level" }     rename => { "timestamp" => "recordTime" }    }    date {     match => [ "recordTime", "UNIX" ]     target => "recordTime"   } }  output {   elasticsearch {  putting of event log records to ElasticSearch cluster   hosts => ["esnode1:9200","esnode2:9200","esnode3:9200"]     index => "apexlogs-%{+YYYY-MM-dd}"     manage_template => false   }    tcp {  putting of event log records to Splunk    host => "splunknode"    mode => "client"    port => 15000    codec => "json_lines"  } }

ElasticSearch users can use Kibana reporting tool for analysis and diagnostic. Splunk users can use Splunkweb.

Links to 3rd party tools:



MapR Persistent Application Client Containers (PACCs) support containerization of existing and new applications by providing containers with persistent data access from anywhere. PACCs are purposely built for connecting to MapR services. They offer secure authentication and connection at the container level, extensible support for the application layer, and can be customized and published in Docker Hub.


Microsoft SQL Server 2017 for Linux offers the flexibility of running MSSQL in a Linux environment. Like all RDBMs, it also needs a robust storage platform to persist in databases, where it is managed and protected securely.


By containerizing MSSQL with MapR PACCs, customers have all the benefits of MSSQL, MapR, and Docker combined. Here, MSSQL offers robust RDBM services that persist data into MapR for disaster recovery and data protection, while leveraging Docker technologies for scalability and agility.


The diagram below shows the architecture for our demonstration:


A MapR Cluster

Before you can deploy the container, you need a MapR cluster for persisting data to. There are multiple ways to deploy a MapR cluster. You can use a sandbox, or you can use MapR Installer for on-premises or cloud deployment. The easiest way to deploy MapR on Azure is through the MapR Azure Marketplace. Once you sign up for Azure, purchase a subscription that has enough quotas, such as CPU cores and storage, and fill out a form to answer some basic questions for the infrastructure and MapR, then off you go at the click of a button. A fully deployed MapR cluster should be at your fingertips within 20 minutes.


A VM with Docker CE/EE Running

Second, you need to spin up a VM in the same VNet or subnet where your MapR cluster is located. Docker CE/EE is required. For information on how to install Docker, follow this link: Docker supports a wide variety of OS platforms. We used CentOS for our demo.

Deploying the MSSQL Container

Once you have the MapR cluster and VM running, you can kick off your container deployment.


Step 1 - Build a Docker Image


Login to your VM as root and run the following command:


curl -L | bash


In a few minutes, you should see a similar message to the one below, indicating a successful build:


Execute the following command to verify the image (mapr-azure/pacc-mssql:latest) is indeed stored in the local Docker repository:

Step 2 – Create a Volume for MSSQL

Before starting up the container, you need to create a volume on the MapR cluster to persist the database into. Login to the MapR cluster as user ‘mapr’ and run the following command to create a volume (e.g., vol1) mounted on path /vol1 in the filesystem:


maprcli volume create –path /vol1 –name vol1


You can get the cluster name by executing this command:


maprcli dashboard info -json | grep name


Step 3 – Start Up the Container

Run the following command to spin up the container with the image we just built in Step 1 above:


# docker run --rm --name pacc-mssql -it \

--cap-add SYS_ADMIN \

--cap-add SYS_RESOURCE \

--device /dev/fuse \

--security-opt apparmor:unconfined \

--memory 0 \

--network=bridge \


-e SA_PASSWORD=m@prr0cks \

-e MAPR_CLUSTER=mapr522 \

-e MSSQL_BASE_DIR=/mapr/mapr522/vol1 \


-e MAPR_MOUNT_PATH=/mapr \

-e MAPR_TZ=Etc/UTC \





-p 1433:1433 \



Note you can replace –it with –d in the first line to place the startup process running in the background.

You can customize the environment variables, colored in red above, to fit your environment. The variable SA_PASSWORD is for the MSSQL admin user. MAPR_CLUSTER is the cluster name. MSSQL_BASE_DIR is the path to MapR-XD, where MSSQL will be persisting its data. The path usually takes the form of /mapr/<cluster name>/<volume name>. MAPR_CLDB_HOSTS is the IP address of the cldb hosts in the MapR cluster. In our case, we only have a single node cluster, so only one IP is used. Finally, the default MSSQL port is 1433. You can use the –p option in Docker to expose it to a port of your choice on the VM host. We selected the same port 1433 in the demo.


There are other environment variables you can pass into MapR PACC. For more information, please refer to this link:


In a few minutes, you should see a message like the one below that indicates the MSSQL server is ready:


2017-11-16 22:54:30.49 spid19s     SQL Server is now ready for client connections. This is an informational message; no user action is required.

Step 4 – Create a Table in MSSQL, and Insert Some Data

Now you are ready to insert some sample data into a test MSSQL database. To do so, find the container ID of the running MSSQL container by issuing this command:

Then use the docker exec command to login to the container:

Then, issue the command below to get into a MSSQL prompt by providing the admin password when you started the container, as in step 3 above:

Issue the following MSSQL statements to populate an inventory table in a test database, then query the table:

Success! This means the database has been persisted into the MapR volume and is now managed and protected by MapR-XD storage. You can verify by issuing the "ls" command in the container: the MSSQL log, secret, and data directories show up in vol1:

Step 5 – Destroy Current Container, Relaunch a New Container, and Access the Existing Table


Now let’s destroy the current container to simulate a server outage by issuing this command:


# docker rm –f c2e69e75b181


Repeat step 3 above to launch a new container. Login to the container and query the same inventory table right away, when the new container is up and running:

With a huge sense of relief, you see the data previously entered is still there, thanks to MapR!


Step 6 – Scale It Up and Beyond


With the container technology know-how in place, it is extremely easy to spin up multiple containers all at once. Simply repeat steps 2 and 3 to assign each MSSQL container a new volume in MapR, and off you go.


In this blog, we demonstrated how to containerize MSSQL with MapR PACC and persist its database into MapR for data protection and disaster recovery. MapR PACCs are a great way for many other applications that require a scalable and robust storage layer to have their data managed and distributed for DR and scalability. The MapR PACCs can also be managed for deployment at scale with an orchestrator, like Kubernetes, Mesos, or Docker, to achieve true scalability and high availability.

To learn how to create HDInsight Spark Cluster in Microsoft Azure Portal please refer to part one of my artcile. After creation of spark cluster named, I have highlighted the URL of my Cluster.

Microsoft Azure


Microsoft Azure


A total of 4 nodes are created -- 2 Head Nodes and 2 Name Nodes -- for a total of 16 cores and an available total of a 60 cluster capacity; out of it 16 are used and 44 clusters remain for scaling up. You can also click and visit Cluster Dashboard, Ambari View and also you can scale the size of clusters.

Apache Ambari is for management and monitoring of Hadoop clusters in the form of WEB UI and REST services. Ambari is used to monitor the clusters and make changes in configuration. Apache  Ambari is used for provision, monitoring and managing the clusters in an easier way. Using Ambari you can manage central security setup and fully visibility into cluster health. Ambari Dashboard looks like below,


Microsoft Azure


Using Ambari Dashboard you can manage and configure services, hosts, alerts for critical conditions etc. Also many services are integrated using Ambari WEB UI. Below is Hive Query Editor through Ambari,


Microsoft Azure


You can write, run and process the Hive Query in Ambari WEB UI you can convert that result in to charts etc you can save queries manage history of queries etc.


Microsoft Azure


Above snapshot is a list of services available in Ambari and below is HDInsight SuketuSpark clients list.


Microsoft Azure


In the new browser you can type or you can directly click on Jupyter Logo in azure portal to open Jupyter notebook. The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning and much more. Jupyter and zeppelin are two notepads integrated with hdinsight.


Microsoft Azure

You can use Jupyter notebook to run Spark SQL queries against the Spark cluster. HDInsight Spark clusters provide two kernels that you can use with the Jupyter notebook.

  • PySpark (for applications written in Python)
  • Spark (for applications written in Scala)

PySpark is the python binding for the Spark Platform and API and is not much different from the Java/Scala versions. Learning Scala is a better choice than python as Scala being a functional langauge makes it easier to paralellize code, which is a great feature if working with Big data.

Like Java, Scala is object-oriented, and uses a curly-brace syntax reminiscent of the C programming language. Unlike Java, Scala has many features of functional programming languages like Scheme, Standard ML and Haskell, including currying, type inference, immutability, lazy evaluation, and pattern matching.

When you type or when you click on zeppelin icon in azure portal than zeppelin notepad will be open in new browser tab. Below is a snapshot of that.


Microsoft Azure


A Zeppelin is web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with SQL, Scala and more.