Stack trace is one of the most used techniques to debug an application. In the Java world, JStack is the most ubiquitous tool used to get a stack trace from a single JVM. When it comes to Apache Apex or any other distributed system, it is not easy to get a stack trace, as application components are running in multiple JVMs across multiple machines. There are situations where not all the users will have access to machines or to get a stack trace from an application, as they may be running under different user accounts.
To make the Apache Apex user’s life easy while debugging, we have added a new feature which simplifies capturing a stack trace similar to that from Jstack. This feature is available in Apex-Core 3.4 & DataTorrent RTS 3.5.
Apache Apex applications are YARN applications. To debug Apache Apex applications, we need to first understand the programming model and deployment of applications across the Hadoop cluster.
The programming model of Apache Apex is flow based, with operators containing the business logic and the streams connecting operators. Applications, which are mostly represented as a Directed Acyclic Graph, are deployed across the cluster during launch. For more detailed information about the translation of the application logic into an execution plan, check out the blog:
An Operator is deployed in a container, which can run on any node of a cluster. Term “containers” is overloaded; in this blog, I am referring to a YARN container which ultimately is a single JVM instance for our discussion.
A single container can contain one or more operators. Locality can be set on the stream between the two operators, which controls how operators are placed inside containers. Here we are considering two options for stream locality:
THREAD_LOCAL and CONTAINER_LOCAL. More information about the streams is at:
When two operators are connected by a stream with the locality type CONTAINER_LOCAL, they run in the same JVM with 2 different threads and data elements(Tuples) transferred through an in-memory queue. When the locality type is THREAD_LOCAL, a single thread is allocated for both operators resulting in the operator callbacks being serialized.
Capturing the Stack Trace
There are 2 ways to get the stack trace of a running Apache Apex application. The first option is to use the Command Line Interface (CLI) of Apache Apex and the second option is to use the GUI in DataTorrent RTS.
Apache Apex CLI
With the Apache Apex CLI, the user would identify the container running the operator and then run the stack trace. Following steps can be used to achieve the goal.
a. Connect to your application
b. Get the information about the specific operator
list-operators [pattern] (Pattern is optional and it will show information about all the operators that match pattern)
c. Find the container id from the output of the previous command
Example line: “container”:”container_1467701377054_26031_02_000002″
d. Get the stack trace from the containers
e. A container contains many threads, but the operator thread name contains operator id, name and the class name. Here is one example of the name “4/console:NullOutput”
f. A single thread represents the consecutive group of THREAD_LOCAL operators and the thread name for that will be that of the first operator in the group.
For more information about Apache Apex CLI
DataTorrent RTS contains the GUI Console, which exposes many features of Apache Apex in an easily operable way. Users can go to the physical tab and then select the required container from which to get a stack trace.
Clicking the “stacktrace” button opens a new page with stack trace.
Things to note:
1. The first two stack traces represents 2 operator threads, which are running 3 operators because the stream connecting 2 operators has THREAD_LOCAL locality.
2. Refreshing the page gets a new stack trace.
3. Stack traces are sorted by name, as operator name start with an integer, they will end up in the beginning of the page.
Other debugging techniques using DataTorrent RTS:
- Stack traces for the containers are collected with the help of an Apex helper thread running in the container. Even if the operators are blocked, you will be able to get the stack trace, making it easy to see why your operators are blocked.
- Users don’t need access to individual nodes on which the containers are running.
Let’s consider few examples where this feature is useful
- The callback methods `setup()` and `activate()` in an operator taking longer time
In an Apex operator’s lifecycle, Setup & Activate are called to perform one-time activity. Which may take longer time than usual, so stack trace is a good way to see what’s happening there. One example in this category is the FileInputOperator from a Apex-Malhar library, where the scan operation can take longer time if the target directory contains many files.
- Extra debugging statements
Many times we forget to remove the extra debugging statements, which has the possibility of affecting the performance.
Here is an app package, https://github.com/sandeshh/myapexapp, containing an app called “ApplicationExtraLogging”. Stack trace on the input operator will show the print statements which should not be there.
Most of the features in the Apache Apex are customer driven, so is this feature. Please do share your feedback on the features that you want to see in Apex, that will help us to prioritize them.
By Sandesh Hegde, Engineer at DataTorrent and Committer for Apache Apex