AnsweredAssumed Answered

selecting from parquet in hive gives nulls

Question asked by marknettle on Sep 30, 2015
Latest reply on Oct 4, 2015 by marknettle
I have a very odd situation where if I select from a parquet table (using beeline), I can see the values, but if I use those values to insert into another table, I get nulls. Here's the minimum signature. It doesn't matter if I create the second table first, or use a CTAS statement. Doesn't matter if the second table is txtfile or parquet. It doesn't happen when the first table is txtfile.

    0: jdbc:hive2://...> create table foo (x string, y int) stored as parquet;
    0: jdbc:hive2://...> insert into table foo values ("fnord",1);
    0: jdbc:hive2://...> select * from foo;
    +--------+--------+--+
    | foo.x  | foo.y  |
    +--------+--------+--+
    | fnord  | 1      |
    +--------+--------+--+
    0: jdbc:hive2://...> create table bar as select * from foo;
    0: jdbc:hive2://...> select * from bar;
    +--------+--------+--+
    | bar.x  | bar.y  |
    +--------+--------+--+
    | NULL   | NULL   |
    +--------+--------+--+
    
    0: jdbc:hive2://...> show create table foo;
    +---------------------------------------------------------------------+--+
    |                           createtab_stmt                            |
    +---------------------------------------------------------------------+--+
    | CREATE TABLE `foo`(                                                 |
    |   `x` string,                                                       |
    |   `y` int)                                                          |
    | ROW FORMAT SERDE                                                    |
    |   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'     |
    | STORED AS INPUTFORMAT                                               |
    |   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'   |
    | OUTPUTFORMAT                                                        |
    |   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'  |
    | LOCATION                                                            |
    |   'maprfs:/user/hive/warehouse/tmp.db/foo'                          |
    | TBLPROPERTIES (                                                     |
    |   'COLUMN_STATS_ACCURATE'='true',                                   |
    |   'numFiles'='1',                                                   |
    |   'numRows'='1',                                                    |
    |   'rawDataSize'='2',                                                |
    |   'totalSize'='268',                                                |
    |   'transient_lastDdlTime'='1443646441')                             |
    +---------------------------------------------------------------------+--+

I'm running a small (3 node) cluster on AWS ubuntu.

    mapr-core                             5.0.0.32987.GA-1
    mapr-hive                             1.2.20150924090
Looking at the "foo" parquet file itself, I can see it contains

    parquet-mr version 1.6.0
Searching around for logs with errors or the like, the most suspicious found was this in /opt/mapr/hive/hive-1.2/logs/hive-mapr-hiveserver2-....out

    Oct 1, 2015 9:46:59 AM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
    Oct 1, 2015 9:46:59 AM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 1 records.
    Oct 1, 2015 9:46:59 AM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
    Oct 1, 2015 9:46:59 AM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 0 ms. row count = 1

Outcomes