AnsweredAssumed Answered

Hive query fails with error while reading data from complex data types with TIMESTAMP column, stored in Parquet

Question asked by AmarnathVibhute on Oct 20, 2016
Latest reply on Nov 10, 2016 by takeshi

Hello All!

 

I am getting an error while reading nested records (complex data type with explode) which has column with TIMESTAMP data type in Hive managed table and stored as Parquet

 

DDL :

create table all_datatype_prq
(id int,
col_string string,
col_timestamp timestamp,
col_address array<struct<city:string,pin:bigint,dob:date,login:timestamp>> )
row format delimited
fields terminated by ","
collection items terminated by "|"
map keys terminated by "~"
stored as parquet;

 

Hive query & error:

hive (test_db)> select
> id,
> col_string,
> address.city,

> address.login
> from all_datatype_prq
> LATERAL VIEW explode(col_address) expl as address;
OK
Failed with exception java.io.IOException:parquet.io.ParquetDecodingException: Can not read value at 0 in block -1 in file maprfs:///PATH/all_datatype_prq/000000_0

Supporting Info:
Hive version: 1.2
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

Parquet JAR: parquet-hadoop-bundle-1.6.0.jar

 

Observations :

1. If the same Hive table is stored as TXT or ORC, there is no issue while reading Address array. So issue is specific when we store data in Parquet.

2. If we remove 'login' attribute with RIMESTAMP from Address Array there is no issue for reading data.  

3. If we change datatype from TIMESTAMP to STRING for 'login' column then no issues while reading Address array even if it is stored as Parquet.

4. This issue occurs with MAP, ARRAY & STRUCT if they have TIMESTAMP column in it.

5. If we select any column outside of Address array, there is no issue for reading.

 

Any help on this topic will be appreciated as I would like to understand how to read TIMESTAMP column in an Array from Hive managed table stored as Parquet.

 

Thanks,

Amarnath

Outcomes