AnsweredAssumed Answered

Hive Server 2 / Beeline returns no rows with an INNER JOIN on a Hive on HBase table

Question asked by matthew_moisen on May 26, 2016
Latest reply on Jun 10, 2016 by matthew_moisen

TL;DR I can execute a Hive query using an INNER JOIN on a HBase/MapR-DB table in the Hive (Hive Server 1) command line and return the correct rows. However, the same query on the Beeline (Hive Server 2) command line returns no rows. I can do an INNER JOIN fine on regular Hive on HDFS tables fine.

Edit: Any reference to "HBase" below should be understood as MapR-DB tables, not regular HBase tables.

 

I've replicated it on the following MapR environments:

MapR version: 4.0.1.27334.GA -- Hive version: hive-0.13 
MapR version: 5.1.0.37549.GA -- Hive version: hive-1.2.0

I made two HBase tables and populated them like so (Note that in MapR, we use partitions as the base name. If you want to replicate this on a non-MapR environment, remove the /app/my_partition/ part):

create '/app/my_partition/HiveParent', 'f' 
create '/app/my_partition/HiveChild', 'f' 

put '/app/my_partition/HiveParent', 'foo|a|', 'f:foo', 'a'
put '/app/my_partition/HiveParent', 'foo|b|', 'f:foo', 'b' 

put '/app/my_partition/HiveChild', 'foo|a|1|', 'f:foo', 'a'
put '/app/my_partition/HiveChild', 'foo|a|1|', 'f:bar', '1'
put '/app/my_partition/HiveChild', 'foo|a|2|', 'f:foo', 'a'
put '/app/my_partition/HiveChild', 'foo|a|2|', 'f:bar', '2' 

put '/app/my_partition/HiveChild', 'foo|b|1|', 'f:foo', 'b'
put '/app/my_partition/HiveChild', 'foo|b|1|', 'f:bar', '1'
put '/app/my_partition/HiveChild', 'foo|b|2|', 'f:foo', 'b'
put '/app/my_partition/HiveChild', 'foo|b|2|', 'f:bar', '2'

I created the Hive on HBase tables in the hive shell like so:

CREATE EXTERNAL TABLE HiveParent(rk string, foo string) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
     'serialization.format'='1',
     'hbase.columns.mapping'='f:foo'
) TBLPROPERTIES ( 'hbase.table.name'='/app/my_partition/HiveParent' ); 

CREATE EXTERNAL TABLE HiveChild(rk string, foo string, bar string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
     'serialization.format'='1',
     'hbase.columns.mapping'='f:foo,f:bar'
) TBLPROPERTIES ( 'hbase.table.name'='/app/my_partition/HiveChild' );

All of the following queries work successfully in Hive (Hive Server 1) and Beeline (Hive Server 2):

SELECT * FROM HiveParent; 
SELECT foo FROM HiveParent;
SELECT foo FROM HiveParent WHERE foo IN ('a', 'b');
SELECT * FROM HiveChild;
SELECT foo, bar FROM HiveChild;
SELECT foo, bar FROM HiveChild WHERE foo IN ('a', 'b');

These queries only return rows in Hive (Hive Server 1). However, they return 0 rows in Beeline (Hive Server 2):

SELECT * FROM HiveParent INNER JOIN HiveChild ON (HiveParent.foo = HiveChild.foo);  
SELECT * FROM HiveParent, HiveChild WHERE HiveParent.foo = HiveChild.foo; 
SELECT * FROM HiveChild WHERE HiveChild.foo IN (SELECT HiveParent.foo FROM HiveParent);

It is concerning that these queries return no rows instead of throwing an exception.

 

The following are hacks that get it to work but which are not a acceptable solutions.

--------

set hive.auto.convert.join = false 

What does hive.auto.convert.join do?

If hive.auto.convert.join is set to true the optimizer not only converts joins to mapjoins but also merges MJ* patterns as much as possible

So, hive by default attempts to convert joins into map-side joins. A Map-side join is one in which the smaller of the two tables is fully loaded into the mappers memory and is much more efficient. By setting hive.auto.convert.join to false, we are disabling map-side joins and forcing all joins to be reduce-side joins, which are much more inefficient. This also doesn't explain why map-side joins on two hive-on-hbase tables fail.

-------

 

We found this solution out incidentally but it might be worth looking into. On one environment, beeline was giving OutOfMemory errors and was hanging. The sysadmins increased the memory allocated to HS2, and after that the INNER JOIN worked fine. However, this is unsettling, because Beeline did not throw any errors but simply returned no rows.

Increasing the memory on the MapR5.1/Hive-1.2 environment from 4GB to 8GB worked. However, on the MapR4.0.1/Hive-0.13 environment, this did not work.

 

In hive-env.sh:

export HADOOP_HEAPSIZE=8000 

 

Again, this isn't an acceptable solution.

-------

 

(Note, I posted this originally on Stackoverflow but cross posted it here because it wasn't getting enough attention)

Outcomes