Understanding Hive Outer Join Behavior

Document created by Hao Zhu Employee on Feb 7, 2016
Version 1Show Document
  • View in full screen mode

Author: Hao Zhu

 

Original Publication Date: June 5, 2015

 

Environment :

Spark 1.3.1

Goal

This article introduces the behavior of Hive outer join.

Solution

From Hive Outer Join Behavior, here are the definitions of Preserved Row table, Null Supplying table, During Join predicate and After Join predicate(Where predicate).

Take a left outer join for example:

select t1.*,t2.name as name_t2 
from t1 left outer join t2 on (t1.id=t2.id and t2.name='aa' )
where  t1.name='a';

Then

  • Preserved Row table: t1.
  • Null Supplying table: t2.
  • During Join predicate: t2.name='aa'.
  • After Join predicate(Where predicate): t1.name='a'.

 

The outer join result depends on the filter pushing down rule which is:

  • During Join predicates cannot be pushed past Preserved Row tables.
  • After Join predicates cannot be pushed past Null Supplying tables.

 

If the filters are put in different location( join predicate or where predicate), the results could be totally different. Sometimes even if the results are the same, the query execution plan could be different.

 

To explorer the behaviors, let's create below 2 Hive tables with sample data:

create table t1 (id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ","; 
create table t2 (id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ",";

Data for t1:

1,a 1,b 2,a 2,b 3,a 3,b 4,b

Data for t2:

1,aa 4,dd

To show a consistent plan using distributed join instead of map join, we disables hive.auto.convert.join in below tests.

set hive.auto.convert.join=false;

Case 1: Both tables have After Join predicates.

SQL:

select t1.*,t2.name as name_t2 
from t1 left outer join t2 on (t1.id=t2.id)
where t1.name='a' and t2.name='aa';

Result:

1 a aa

SQL plan:

STAGE PLANS:

  Stage: Stage-1

  Map Reduce

  Map Operator Tree:

  TableScan

  alias: t1

  Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE

  Filter Operator

  predicate: (name = 'a') (type: boolean)
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  Reduce Output Operator

  key expressions: id (type: int)

  sort order: +

  Map-reduce partition columns: id (type: int)

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  value expressions: id (type: int), name (type: string)

  TableScan

  alias: t2

  Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE

  Reduce Output Operator

  key expressions: id (type: int)

  sort order: +

  Map-reduce partition columns: id (type: int)

  Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE

  value expressions: name (type: string)

  Reduce Operator Tree:

  Join Operator

  condition map:

  Left Outer Join0 to 1

  condition expressions:

  0 {VALUE._col0} {VALUE._col1}

  1 {VALUE._col1}

  outputColumnNames: _col0, _col1, _col5

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  Filter Operator

  predicate: (_col5 = 'aa') (type: boolean)
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  Select Operator

  expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string)

  outputColumnNames: _col0, _col1, _col2

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  File Output Operator

  compressed: false

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  table:

  input format: org.apache.hadoop.mapred.TextInputFormat

  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

 

  Stage: Stage-0

  Fetch Operator

  limit: -1

Only After Join Predicate of Preserved Row table "t1" is pushed before the join.

After Join Predicate of Null Supplying table "t2" is executed as a filter on the output of the join.

The logic is:

( (t1 where t1.name='a') left outer join t2 )  where t2.name='aa'

Case 2: Null Supplying table has During Join predicate, Preserved Row table has After Join predicate

SQL:

select t1.*,t2.name as name_t2 
from t1 left outer join t2 on (t1.id=t2.id and t2.name='aa' )
where t1.name='a';

Result:

 

1 a aa

2 a NULL

3 a NULL

SQL Plan:

STAGE PLANS:

  Stage: Stage-1

  Map Reduce

  Map Operator Tree:

  TableScan

  alias: t1

  Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE

  Filter Operator

  predicate: (name = 'a') (type: boolean)
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  Reduce Output Operator

  key expressions: id (type: int)

  sort order: +

  Map-reduce partition columns: id (type: int)

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  value expressions: id (type: int), name (type: string)

  TableScan

  alias: t2

  Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE

  Filter Operator

  predicate: (name = 'aa') (type: boolean)
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  Reduce Output Operator

  key expressions: id (type: int)

  sort order: +

  Map-reduce partition columns: id (type: int)

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  value expressions: name (type: string)

  Reduce Operator Tree:

  Join Operator

  condition map:

  Left Outer Join0 to 1

  condition expressions:

  0 {VALUE._col0} {VALUE._col1}

  1 {VALUE._col1}

  outputColumnNames: _col0, _col1, _col5

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  Select Operator

  expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string)

  outputColumnNames: _col0, _col1, _col2

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  File Output Operator

  compressed: false

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  table:

  input format: org.apache.hadoop.mapred.TextInputFormat

  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

 

  Stage: Stage-0

  Fetch Operator

  limit: -1

Both predicates are pushed before the join.

The logic is:

(t1 where t1.name='a') left outer join (t2 where t2.name='aa')

Case 3: Null Supplying table has After Join predicate, Preserved Row table has During Join predicate

SQL:

 

select t1.*,t2.name as name_t2

from t1 left outer join t2 on (t1.id=t2.id and t1.name='a' )

where t2.name='aa';

Result:

1 a aa

SQL Plan:

 

STAGE PLANS:

  Stage: Stage-1

  Map Reduce

  Map Operator Tree:

  TableScan

  alias: t1

  Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE

  Reduce Output Operator

  key expressions: id (type: int)

  sort order: +

  Map-reduce partition columns: id (type: int)

  Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE

  value expressions: id (type: int), name (type: string)

  TableScan

  alias: t2

  Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE

  Reduce Output Operator

  key expressions: id (type: int)

  sort order: +

  Map-reduce partition columns: id (type: int)

  Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE

  value expressions: name (type: string)

  Reduce Operator Tree:

  Join Operator

  condition map:

  Left Outer Join0 to 1

  condition expressions:

  0 {VALUE._col0} {VALUE._col1}

  1 {VALUE._col1}

  filter predicates:

  0 {(VALUE._col1 = 'a')}
  1

  outputColumnNames: _col0, _col1, _col5

  Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE

  Filter Operator

  predicate: (_col5 = 'aa') (type: boolean)
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  Select Operator

  expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string)

  outputColumnNames: _col0, _col1, _col2

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  File Output Operator

  compressed: false

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  table:

  input format: org.apache.hadoop.mapred.TextInputFormat

  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

 

  Stage: Stage-0

  Fetch Operator

  limit: -1

Neither of the two predicates are pushed before the join.

The logic is:

(t1 left outer join t2 )  where t1.name='a' and t2.name='aa'

Case 4:  Both tables have During Join predicates.

SQL:

select t1.*,t2.name as name_t2 
from t1 left outer join t2 on
(t1.id=t2.id and t1.name='a' and t2.name='aa');

Result:

1 a aa 
1 b NULL
2 a NULL
2 b NULL
3 a NULL
3 b NULL
4 b NULL

SQL Plan:

STAGE PLANS:

  Stage: Stage-1

  Map Reduce

  Map Operator Tree:

  TableScan

  alias: t1

  Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE

  Reduce Output Operator

  key expressions: id (type: int)

  sort order: +

  Map-reduce partition columns: id (type: int)

  Statistics: Num rows: 0 Data size: 28 Basic stats: PARTIAL Column stats: NONE

  value expressions: id (type: int), name (type: string)

  TableScan

  alias: t2

  Statistics: Num rows: 0 Data size: 10 Basic stats: PARTIAL Column stats: NONE

  Filter Operator

  predicate: (name = 'aa') (type: boolean)
  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  Reduce Output Operator

  key expressions: id (type: int)

  sort order: +

  Map-reduce partition columns: id (type: int)

  Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE

  value expressions: name (type: string)

  Reduce Operator Tree:

  Join Operator

  condition map:

  Left Outer Join0 to 1

  condition expressions:

  0 {VALUE._col0} {VALUE._col1}

  1 {VALUE._col1}

  filter predicates:

  0 {(VALUE._col1 = 'a')}
  1

  outputColumnNames: _col0, _col1, _col5

  Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE

  Select Operator

  expressions: _col0 (type: int), _col1 (type: string), _col5 (type: string)

  outputColumnNames: _col0, _col1, _col2

  Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE

  File Output Operator

  compressed: false

  Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE

  table:

  input format: org.apache.hadoop.mapred.TextInputFormat

  output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

 

  Stage: Stage-0

  Fetch Operator

  limit: -1

 

Only During Join Predicate of Null Supplying table "t2" is pushed before the join.

During Join Predicate of Preserved Row table "t1" is executed as a filter on the output of the join.

The logic is:

( t1 left outer join (t2 where t2.name='aa') )  where t1.name='a'

Key takeaways:

  • The SQL result depends on how you writes the SQL.
  • During Join predicates of Null Supplying tables can be pushed.
  • After Join predicates of Preserved Row tables can be pushed.
  • Sometimes if the predicates of both sides need to be pushed, please refer to case 2.  For example, imagine both "t1" and "t2" are partition tables based on partition key "name", you may want partition pruning to happen before join.

Attachments

    Outcomes