Drill Parquet INT96 Type

Document created by prakhar Employee on Sep 29, 2017Last modified by anjaliwilliams on Oct 11, 2017
Version 6Show Document
  • View in full screen mode

Environment:

mapr-5.2.0

Drill 1.10

 

Description:

Sometimes you see an error, like below, when you try to run a query over a Parquet file.

 

ERROR:

0: jdbc:drill:>  select * from  dfs.tmp.`part-00000-6a7ab87f-1310-4ddc-8390-3bf26697e105.snappy.parquet`;

Error: SYSTEM ERROR: ArrayIndexOutOfBoundsException

Fragment 0:0

[Error Id: 0072aa97-852a-4ee5-bbfd-bec167f24e11 on vm75-133.support.mapr.com:31010] (state=,code=0)

0: jdbc:drill:>

 

The Parquet file is being written via a Spark job in the cluster. The issue occurs when trying to fetch the columns with INT96 datatype.

 

To test whether you have an INT96 type in a Parquet file, you can check it, as shown below:

 

bash-4.2$ java -jar parquet-mr/parquet-tools/target/parquet-tools-1.6.1-SNAPSHOT.jar  schema /tmp/part-00000-6a7ab87f-1310-4ddc-8390-3bf26697e105.snappy.parquet |grep int96
  required int96 cretd_timstm;
  optional int96 last_modfd_timstm;

 

RESOLUTION:

Drill 1.10 and later can implicitly interpret the Parquet INT96 type as TIMESTAMP (with standard 8 byte/millisecond precision) when the store.parquet.reader.int96_as_timestamp option is enabled. In earlier versions of Drill (1.2 through 1.9) or when the store.parquet.reader.int96_as_timestamp option is disabled, you must use the CONVERT_FROM function for Drill to correctly interpret INT96 values as TIMESTAMP values.

 

0: jdbc:drill:> select convert_from(cretd_timstm,'TIMESTAMP_IMPALA') ,convert_from(last_modfd_timstm ,'TIMESTAMP_IMPALA')  from dfs.tmp.`part-00000-6a7ab87f-1310-4ddc-8390-3bf26697e105.snappy.parquet` limit 3;

+-----------------------------+--------------------------+

|           EXPR$0            |          EXPR$1          |

+-----------------------------+--------------------------+

| 5882405-01-30 21:26:27.686  | 2017-01-13 07:32:21.741  |

| 1189426-06-15 04:03:19.299  | 2017-04-18 08:55:59.095  |

| 52877-01-18 17:02:53.783    | 2016-07-22 11:00:00.0    |

+-----------------------------+--------------------------+

3 rows selected (0.168 seconds)

 

 

Or you can enable the below property to TRUE, if you are using Drill 1.10 or later.

 

Alter Session|System set store.parquet.reader.int96_as_timestamp = true

 

 

0: jdbc:drill:zk=node1:5181> alter session set store.parquet.reader.int96_as_timestamp = true;
+-------+---------------------------------------------------+
|  ok   |                      summary                      |
+-------+---------------------------------------------------+
| true  | store.parquet.reader.int96_as_timestamp updated.  |
+-------+---------------------------------------------------+
1 row selected (0.198 seconds)
0: jdbc:drill:zk=node1:5181>

 

Note: INT96 is really required for nanosecond precision!

From the spate of bugs related to INT96 in various projects like Spark, Hive, Parquet, Drill, etc., INT96 may even be deprecated in the near future.

Ref: https://issues.apache.org/jira/browse/PARQUET-323

 

Hope this helps. 

2 people found this helpful

Attachments

    Outcomes