AnsweredAssumed Answered

Apache Hive - Query Optimization techinques

Question asked by Lokesh on Sep 5, 2017
Latest reply on Sep 5, 2017 by MichaelSegel

Hi All,

 

Q) Support Cost Base Optimization and predicate pushdown features are enabled in Hive. Which is true about 2 queries?

 

Query 1:

Select employees.id, b.sales from employees LEFT JOIN sales ON (employees.id = sales.employee_id) WHERE day_id between '2015-01-01' AND '2015-03-31';

 

Query 2:

Select employees.id, b.sales from employees LEFT JOIN sales ON (employees.id = sales.employee_id AND day_id between '2015-01-01' AND '2015-03-31');

 

A) Both queries will run with equal efficiency.

B) Query 1 will be faster.

C) Query 2 will be faster.

D) Both query 1 and query 2 will run faster by setting hive.optimize.ppd = false;

 

I know query 2 will run faster because of pre-filtering that happens on 'day_id' column and hence hive will have lesser data load while joining.

I got to know this by reading into some blog but i am not convinced can somebody please explain principles of CBO and Predicate pushdown? Or at least point me to good website with good introduction into this concepts!

 

Regards,

Lokesh

Outcomes