How To Query Drill in Python from Jupyter Notebook

Document created by Rachel Silver Employee on Nov 17, 2016Last modified by aalvarez on Jan 16, 2017
Version 9Show Document
  • View in full screen mode

Introduction

From Project Jupyter:
"The Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, machine learning, and much more."

 

We occasionally hear customers expressing interest in getting Jupyter up and running on the MapR Platform. Here, we're going to explain how to get Jupyter working on the MapR Converged Data Platform with Apache Drill and query an example JSON file.

 

The versions used for this demo are:

 

Install/Configure ODBC (Create DSN)
In the steps below, you'll create a Data Source Name (DSN) to allow Jupyter to connect via ODBC without storing connection information in the notebook itself.

 

Collect connection information
In order to create the ODBC connection to the database, you will have to gather the following pieces of information from your Drill cluster:
Drill Cluster ID
ZooKeeper Quorum

These items can be found in the drill-override.conf file. In a default MapR Drill 1.8 installation, they are located in /opt/mapr/drill/drill-1.8.0/conf/drill-override.conf. The information will look as follows:  

 

drill.exec: {
cluster-id: "cluster_name_com-drillbits",
zk.connect: "host1:5181,host2:5181,host3:5181"}

 

Install required libraries

We're going to use the Python pyodbc library. To install this module and its dependencies, please run the following:

 

yum install unixODBC-devel
pip install pyodbc

 

Install the Drill ODBC driver
Depending on where you've decided to run your Jupyter Notebook, you'll need to select the right directions from: Installing the ODBC Driver - Apache Drill. 

We're installing directly on our (CentOS) MapR cluster, so we'll follow the steps for Linux (here).

Next, select the correct driver version for your system and download/transfer it to the host where you want to install Jupyter. For this example, we'll use the MapR Drill ODBC Driver (64-bit).

 

Examples:

wget http://package.mapr.com/tools/MapR-ODBC/MapR_Drill/MapRDrill_odbc_v1.2.1.1000/MapRDrillODBC-1.2.1.x86_64.rpm

Then, as described here, log in as a 'root' user, navigate to the folder where you stored the RPM, and run the following command, substituting in the correct name for the RPM file:

yum localinstall --nogpgcheck <RPMFileName>

Configure Drill ODBC 

First, we need to copy the .ini files from the Drill ODBC Setup directory to our home directory and rename them as invisible files:

 

cp /opt/mapr/drillodbc/Setup/mapr.drillodbc.ini ~/.mapr.drillodbc.ini
cp /opt/mapr/drillodbc/Setup/odbc.ini ~/.odbc.ini
cp /opt/mapr/drillodbc/Setup/odbcinst.ini ~/.odbcinst.ini

 

Then, set your environment variables (preferably in .bash_profile):

 

export ODBCINI=~/.odbc.ini

export MAPRDRILLINI=~/.mapr.drillodbc.ini

export LD_LIBRARY_PATH=/usr/local/lib:/opt/mapr/drillodbc/lib/64:/usr/lib64

 

And, alter ~/.odbc.ini so it reads as follows, subbing in the values in red:

 

[ODBC]
Trace=no

 

[ODBC Data Sources]
[drill64]

 

# This key is not necessary and is only to give a description of the data source.
Description=MapR Drill ODBC Driver (64-bit) DSN

 

# Driver: The location where the ODBC driver is installed to.
Driver=/opt/mapr/drillodbc/lib/64/libmaprdrillodbc64.so

 

# Values for ConnectionType, AdvancedProperties, Catalog, Schema should be set here.
# If ConnectionType is Direct, include Host and Port. If ConnectionType is ZooKeeper, include ZKQuorum and ZKClusterID
# They can also be specified on the connection string.
# AuthenticationType: No authentication; Basic Authentication
ConnectionType=ZooKeeper
ZKQuorum=<take from zk.connect string above>
ZKClusterID=<take from cluster-id above>
AuthenticationType=No Authentication
UID=[USERNAME]
PWD=[PASSWORD]
AdvancedProperties=CastAnyToVarchar=true;HandshakeTimeout=5;QueryTimeout=180;TimestampTZDisplayTimezone=utc;ExcludedSchemas=sys,INFORMATION_SCHEMA;NumberOfPrefetchBuffers=5;
Catalog=DRILL
Schema=

 

To password protect the DSN, select Basic Authentication for the Authentication Type and configure UID and PWD properties (more here).

 

Install & Run Jupyter

 

Install Jupyter on cluster (or local machine)
Jupyter can be installed using conda & Anaconda or just using Python pip:

 

pip install jupyter

...Collecting jupyter
Downloading jupyter-1.0.0-py2.py3-none-any.whl...

Set server parameters (hostname, port, etc.)

First, generate a config file:

 

jupyter notebook --generate-config

 

Writing default config to: /path/to/.jupyter/jupyter_notebook_config.py

 

Then, navigate to the config file given above (jupyter_notebook_config.py) and set the variables as needed. For example, to set the hostname so you can access this server on a public interface, you would set the following*:

 

c.NotebookApp.ip = '*'

 

* We recommend securing this server using, at least, SSL and a hashed password. Directions can be found here.

 

Start Jupyter Notebook

Then, to start the notebook, simply run the following and note the link it provides you with:

 

jupyter notebook --no-browser

 

[I 15:36:06.049 NotebookApp] Writing notebook server cookie secret to /root/.local/share/jupyter/runtime/notebook_cookie_secret
[I 16:36:00.125 NotebookApp] The port 8888 is already in use, trying another port.
[I 16:36:00.129 NotebookApp] Serving notebooks from local directory: /root
[I 16:36:00.129 NotebookApp] 0 active kernels
[I 16:36:00.129 NotebookApp] The Jupyter Notebook is running at:  http://[all ip addresses on your system]:8889/
[I 16:36:00.129 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).

Navigate to the Jupyter Notebook using your hostname, like http://<hostname>:8889

 

Example
At this point, you have everything you need to query data in Drill from a Jupyter Notebook. So, let's test it out using the steps from Drill In 10 Minutes, adapted for our notebook. This notebook is also available in my GitHub (here).

 

Create the connection and cursor in Jupyter/Python

Create a Python 2 notebook and enter the following into a cell to create the connection to Drill over ODBC:

 

import pyodbc
cnxn = pyodbc.connect("DSN=drill64", autocommit=True)
cursor = cnxn.cursor()

 

Then, in the same cell, let's try out a simple query on the built-in sample Drill data to make sure it's all working:

 

cursor.execute("SELECT * FROM cp.`employee.json` LIMIT 10")
response = cursor.fetchall()
if response:
print(response)

 

You should see something like the following:

 

 

Common Problems
Here are the solutions to common problems while running this query:

 

"Can't open lib '/opt/mapr/drillodbc/lib/64/libmaprdrillodbc64."

 

Please make sure that you set $LD_LIBRARY_PATH as shown above:

export LD_LIBRARY_PATH=/usr/local/lib:/opt/mapr/drillodbc/lib/64:/usr/lib64

 

"Unable to locate SQLGetPrivateProfileString function. (11560) (SQLDriverConnect)"

This issue normally happens if we made mistakes in the "Step 4: Configure the MapR Drill ODBC Driver" here:

Please follow these steps to fix this problem: MapR Drill ODBC fails to connect with error "Unable to locate SQLGetPrivateProfileString function" | Open Knowledge Base.

1 person found this helpful

Attachments

    Outcomes