aengelbrecht

Top 5 Items to Configure with Drill on MapR 5.x

Blog Post created by aengelbrecht Employee on May 3, 2017

Perform the following initial configuration steps for Drill deployments on MapR to optimize the initial installation. This configuration can be adjusted and modified over time as needed, but these steps are meant to provide a good starting point.

 

Note: This list is not for Drill on YARN.

 

 

1 - Drill Query Profile and Log Locations

1.1 Drill Query Profile Files

The MapR Converged Data Platform (MCDP) provides a reliable and scalable POSIX-compliant distributed file system (MapR-FS) that can handle large volumes of files efficiently, making it a great choice to store the Drill query profiles. By placing the profile files on MapR-FS, the profiles will be available to all Drill nodes in the cluster and thus viewable in the WebGUI from any Drill node. In addition, the profiles will be protected from a node failure.

 

As an additional bonus, the JSON profile files can then be queried by Drill, which is a good resource to view top users, top queries, what types of queries are being executed, and also which users were executing the queries. It makes it useful for system administration and auditing purposes.

 

We recommend that the query profiles be stored in the /users/mapr/drill directory on MapR-FS. To do so, follow these steps:

  • Edit the drill-override.conf file in the /opt/mapr/drill/<drill-version>/conf directory on all Drill nodes. Note: you can edit the file on one node and use a cluster tool like clush to copy it to all the other nodes.
  • Add the following line to the configuration file:
    • sys.store.provider.zk.blobroot: "maprfs:///user/mapr/drill"
  • Restart all Drill nodes (Drillbits) in the cluster.

 

Below is an example configuration file:

 

drill.exec: {
 cluster-id: "drilldev-drillbits",
 zk.connect: "drilldev:5181",
 sys.store.provider.zk.blobroot: "maprfs:///user/mapr/drill",
 impersonation: {
    enabled: true,
    max_chained_user_hops: 3
 },
  security.user.auth {
        enabled: true,
        packages += "org.apache.drill.exec.rpc.user.security",
        impl: "pam",
        pam_profiles: [ "login" ]
  }
}

1.2 Drill Log Files

Similar to the profile files, the log files can also be stored on MapR-FS for resiliency and convenience, since all Drillbit log files will be together in one location. To be able to do this, MapR Enterprise Edition is required with loopback NFS enabled on all the Drill nodes in the cluster.

 

Using loopback NFS on MapR-FS also makes it much easier to read and work with log files using standard Linux tools on the distributed file system. We recommend adding the Drill node hostname to the filename to make it easier to identify which node generated the log files. Follow these steps:

 

  • Create the drill logs directory in MapR-FS.
    • Hadoop fs -mkdir /user/mapr/drill/logs
  • Edit the logback.xml file in the /opt/mapr/drill/<drill-version>/conf directory on all Drill nodes. Note: you can edit the file on one node and use a cluster tool like clush to copy it to all the other nodes.
  • Edit the following appender section lines in the logback file. Note: that "insert cluster name" means that you should actually enter the cluster name.
    • <file>/mapr/<insert cluster name here>/user/mapr/drill/logs/drillbit_${HOSTNAME}.log</file>
    • <fileNamePattern>/mapr/<insert cluster name here>/user/mapr/drill/logs/drillbit_${HOSTNAME}.log.%i</fileNamePattern>
  • Restart all Drill nodes (Drillbits) in the cluster.

 

Example logback.xml section:

 

  <appender name="FILE" class="ch.qos.logback.core.rolling.RollingFileAppender">
     <file>/mapr/drilldev/user/mapr/drill/logs/drillbit_${HOSTNAME}.log</file>
     <rollingPolicy class="ch.qos.logback.core.rolling.FixedWindowRollingPolicy">
       <fileNamePattern>/mapr/drilldev/user/mapr/drill/logs/drillbit_${HOSTNAME}.log.%i</fileNamePattern>
       <minIndex>1</minIndex>
       <maxIndex>10</maxIndex>
     </rollingPolicy>

     <triggeringPolicy class="ch.qos.logback.core.rolling.SizeBasedTriggeringPolicy">
       <maxFileSize>100MB</maxFileSize>
     </triggeringPolicy>
     <encoder>
       <pattern>%date{ISO8601} [%thread] %-5level %logger{36} - %msg%n</pattern>
     </encoder>
   </appender>

 

1.3 Manage Archival and Retention of Profile and Log Files

After configuring the profile and log files to be stored centrally on MapR-FS, it is important to consider the retention and archival policies.

 

Log Files

The log files by default will create a rolling file policy of up to 10 log files, each being 100MB for each Drillbit in the cluster. This amount can be increased by setting the file size larger or smaller as well as increasing or decreasing the number of log files in the logback.xml file, mentioned in section 1.2. Simply alter the maxIndex and maxFileSize values for all Drillbit logback.xml files and restart the Drillbits.

 

Query Profile Files

The query profiles may be best to archive for future analysis and auditing purposes. Pending the activity of the cluster, it may be best to archive the profiles on a daily basis. Keep in mind that when the files are moved from the profile location, they will no longer be visible in the Drill WebUI profile page, but still will be available for the system administrator in MapR-FS. This step will speed up the view in the WebUI for the latest queries that were run but require MapR-FS access for archived queries.

 

It is recommended that the profiles be stored in a date-based subdirectory structure, as it will allow analysis with Drill (or other tools) of the JSON file and the ability to prune directories based on date.

 

Below is a simple Linux script that utilizes the MapR NFS loopback to move files from the default profile location to an archived subdirectory location in the profiles directory on MapR-FS.

 

#!/bin/bash
# create new sub directory with structure /yyyy/mm/dd for files
   year=$(date +"%Y")
   month=$(date +"%m")
   day=$(date +"%d")
   newdir=${year}/${month}/${day}/
   mkdir -p ./profiles/${newdir}
# move files from base directory to structured archival sub directory
for file in $(find ./profiles -maxdepth 1 -name *.drill)
do
   newfile=${file:11}
   mv ${file} ./profiles/${newdir}/${newfile}
done

 

2 - Drill Spill Locations

Drill can spill data to disk when operators exceed the Drill memory available on nodes. Drill by default will use the local host filesystem /tmp as the spill location, which is limited in space and often performance.

 

By creating and using MapR-FS local volume on each node (that is not replicated to other nodes), the Drill spill to disk operation has a larger storage space available and also better performance, since typically a MapR cluster node will have more storage devices mapped to a local volume, compared to the local OS /tmp space. To utilize these benefits on MapR-FS for Drill spill data, follow these steps.

 

2.1 Create Local MapR-FS Volumes for Drill Spill Data

Either create MapR-FS local volume on each node with replication 1 manually or utilize the script below that will check if drill spill volumes are on nodes, and then create them, if not. Please modify the script as needed for the cluster environment.

#!/bin/bash

for node in $(maprcli node list -columns hn|awk  {'print $1'} | grep -v hostname);do

  volumetest="$(maprcli volume list -filter [p=="/var/mapr/local/${node}/drillspill"] -output terse| awk '/a/{++cnt} END {print cnt}')"
#   echo $volumetest

if [ $volumetest > 0 ]; then
 echo " volume exists: /var/mapr/local/${node}/drillspill"

else
 echo " volume doesn't exist: /var/mapr/local/${node}/drillspill"
 echo " creating volume: /var/mapr/local/${node}/drillspill"

 maprcli volume create \
 -name mapr.${node}.local.drillspill \
 -path /var/mapr/local/${node}/drillspill \
 -replication 1 \
 -localvolumehost ${node}

fi
done

 

2.2 Configure Drill on the Nodes to Utilize the Local Volumes for Spill

Use the same steps as in section one to update the Drill configuration files in a cluster.

Edit the drill-env.sh, and add the following lines to the bottom.

 

node=$(maprcli node list -columns hn|awk  {'print $1'} | grep -v hostname)
spilloc="/var/mapr/local/${node}/drillspill"
export drill_spilloc=$spilloc"


Edit the drill-override.conf file and add these lines.

 

sort.external.spill.directories: [ ${drill_spilloc} ],
sort.external.spill.fs: "maprfs:///",

 

 

Restart all the Drill nodes (Drillbits) in the cluster.

3 - Drill Resource Configuration on a MapR Cluster

The MapR Converged Data Platform is designed to support multiple workloads, of which Drill is one, on the same cluster.

 

3.1 MapR Topology and Drill Nodes in the Cluster

MapR provides the ability to configure a cluster in topologies for nodes and volumes. This means that Drill can be configured on either all the nodes in a cluster or only certain node topologies. MapR similarly supports Volume Topologies for Data. In most cases, it is recommended to deploy Drill on all the nodes where the Data Volumes are located that Drill will need to access.

 

For more information on MapR Topology configuration, see:

http://maprdocs.mapr.com/home/AdministratorGuide/Setting-Up-Topology.html

 

3.2 Drillbit Resource Configuration

In many cases, Drill will be deployed with other applications on the same MapR nodes. In these cases, it is important to clearly understand how much of the node resources will be available to Drill. Keep in mind the nodes will require resources for MapR core components, other EcoSystem components, and additional applications that may run on the nodes and the OS.

 

Once there is a clear picture of which nodes in the cluster will be running Drill and how much of the resources on these nodes can be allocated to Drill, the configuration can be done. In general, it is best to deploy Drillbits with a homogenous resource configuration on all nodes.

 

3.2.1 Drill CPU Resource Configuration

Drill CPU consumption is mostly controlled by two configuration settings.

 

planner.width.max_per_node: This setting is used to control the maximum number of parallel threads (minor fragments) per Drill operator (major fragment) on a node. Keep in mind that Drill can be executing multiple major fragments at the same time per query. Consider setting this parameter to 75% of available cores for Drill clusters with low query concurrency or to 25% for Drill clusters with higher concurrency. This can be used as a starting point and adjusted as needed.

 

Example: Drill is deployed on nodes with 32 cores, but only 50% of CPU resources are allocated to Drill and the rest needs to be available for other applications. The Drill cluster will be used with data exploration with low user/query concurrency.

 

Total cores available to Drill = 32 x 50% = 16 cores

planner.width.max_per_node = 16 x 75% = 12

 

planner.width.max_per_query: This setting is to limit the total number of threads for the overall Drill cluster. It can be used in very large Drill clusters to limit overall resource usage of a single query on the overall cluster. Consider changing the default setting in very large clusters with higher concurrency to prevent a single query from dominating the resource consumption. Keep in mind this change may impact query times on large queries. Keep the default value and only adjust if needed.

 

3.2.2 Drill Memory Resource Configuration

The following three configuration options are the most important for Drill memory configuration.

 

The total Drillbit memory allocation per node: The total memory allocated to a Drillbit on a node is the sum of the Direct Memory and the Heap Memory. Again, it is important to clearly define how much of the node memory is available to Drill. First, make sure that warden.conf is configured to allow enough free memory for Drill by managing the memory allocation of other Ecosystem components, MapR core components, and OS. For more information on warden.conf, see: http://maprdocs.mapr.com/home/AdministratorGuide/MemoryAllocation-OS-MFS-Hadoop.html

 

Once the total memory available to Drillbits per node is known, the configuration can be done.

 

DRILL_HEAP: The heap memory is used for JAVA objects (files, columns, data types) and used by the planner. It is recommended to set this parameter to 20% of the available memory for Drill initially and adjust as needed. This parameter is set in the $DRILL_CONF_DIR/drill-env.sh file by uncommenting the line in the file and setting the appropriate value.

 

DRILL_MAX_DIRECT_MEMORY: This memory is used for data operations in Drill. It is recommended to set this parameter to 80% of the available memory for Drill initially and adjust as needed.This parameter is set in the $DRILL_CONF_DIR/drill-env.sh file by uncommenting the line in the file and setting the appropriate value.

 

planner.memory.max_query_memory_per_node: This is a system and session configuration option and can be altered for certain sessions. It is used to limit the maximum memory per node for sort operators per query. As a system option, it is recommended to set it to whichever is the higher of the default, 2GB or to 20% of the DRILL_MAX_DIRECT_MEMORY. For highly concurrent query workloads, the value may need to be lowered, but for low concurrency and very large data sets, the value may need to be increased if Out-Of-Memory (OOM) conditions are encountered. If OOM conditions are encountered frequently, see other best practices to limit these issues.

 

Example: Drill is deployed on nodes with 256GB of memory, but only 50% of memory resources are allocated to Drill, and the rest needs to be available for other applications. The Drill cluster will be used with data exploration with low user/query concurrency.

 

Total memory available for Drill = 256GB x 50% = 128GB

DRILL_HEAP = 128GB x 20% ~ 26GB

DRILL_MAX_DIRECT_MEMORY = 128GB x 80% ~ 102GB

planner.memory.max_query_memory_per_node = 102GB x 20% ~ 26GB

 

4 - Drill Security Configuration

It is recommended to configure security for Drill immediately when installing on the MapR Platform, as it provides SQL access to the data. User Authentication and Impersonation are two key elements that need to be configured.

 

For more information on securing Drill on MapR, see:

http://maprdocs.mapr.com/home/Drill/securing_drill.html

 

4.1 User Authentication

First, configure User Authentication for Drill. This consists of configuring the Drill Node or Server:

http://maprdocs.mapr.com/home/Drill/configure_server_auth.html

 

And then the clients that connect to Drill:

http://maprdocs.mapr.com/home/Drill/drill_connectors.html

 

4.2 User Impersonation

User Impersonation is needed for various Drill storage plugins on MapR to be configured securely and utilized properly. For more information on how to configure User Impersonation and Chaining, see:

https://drill.apache.org/docs/configuring-user-impersonation/#configuring-impersonation-and-chaining

 

To configure Drill impersonation on the MapR cluster, see:

http://maprdocs.mapr.com/home/Drill/configure_user_impersonation.html

 

Drill supports Inbound Impersonation for applications managing sessions and initial connections, but also provides service to alternative end users connected to these applications. For more information, see:

https://drill.apache.org/docs/configuring-inbound-impersonation/

 

 

5 - MapR-FS Chunk Size

For optimal performance, it is recommended to match the file size to the MapR-FS chunk size for data being used by Drill.

 

5.1 Check MapR-FS Chunk Size

The default chunk size for MapR-FS is 256MB. The chunk size on MapR-FS can be set by directory for flexibility. To check the chunk size of a directory on a MapR-FS Volume, use the following command:

hadoop mfs -ls <path to directory>

 

Example:

[root@drilldev data]# hadoop mfs -ls /data
Found 3 items
drwxrwxr-x  Z U U   - mapr mapr          0 2017-02-28 17:56  536870912 /data/chunk
              p 2049.620.1182378  drilldev:5660
drwxrwxr-x  Z U U   - mapr mapr          7 2017-02-24 17:14  268435456 /data/flat
              p 2049.170.262788  drilldev:5660
drwxrwxr-x  Z U U   - mapr mapr          4 2017-03-17 15:02  268435456 /data/nested
              p 2049.287.263024  drilldev:5660

 

Note the chunk size for /data/chunk is 512MB, whereas the others are 256MB.

 

5.2 Set MapR-FS Chunk Size

To change the chunk size of a directory, use the hadoop mfs -setchunksize command. Note that all existing subdirectories (where the chunk size has not been set), new subdirectories, and new files will then use the new chunk size. However, existing files will continue to use the original chunk size.

 

Example:

hadoop mfs -setchunksize 536870912 /data/flat

 

For more information on MapR-FS chunk size, see:

http://maprdocs.mapr.com/home/AdministratorGuide/Chunk-Size.ht

5.3 Drill Block Size

When parquet data is created with Drill, the block size for the parquet files can be set. For more information, see:

https://drill.apache.org/docs/parquet-format/

 

To find the optimal Drill block size and MapR-FS chunk size for a data set, it is good to consider the total number of files that will be created for the data; however, it is recommended that the Drill block size and MapR-FS chunk size match. See this part of the Drill Best Practices for more information: https://community.mapr.com/thread/18747-in-the-case-of-parquet-does-drill-prefer-a-larger-number-of-small-files-or-a-smaller-number-of-large-files-how-do-i-get-the-best-mileage-of-drill-parallelism-by-controlling-the-layout

Outcomes