AnsweredAssumed Answered

wrong results when a partition has to many rows

Question asked by dimamah on May 27, 2013
Latest reply on Jun 3, 2013 by abhinav
hi,
The problem is joining a big partition with more than 2^31 rows. 
When the partition has more than 2147483648 rows (even 2147483649) the output of the join is a single row. 
When the partition has less than 2147483648 rows (event 2147483647) the output is correct. 

Test case : 
---------------- 
create a table with 2147483649 rows in a partition with the value : "1" , join this table to another table with a single row,single column with the value "1" on the partition_key. 
later delete 2 rows and run the same join. 
1st : only a single row is created 
2nd : 2147483647 rows 


    create table max_sint_rows (s1 string)
    partitioned by (p1 string)
    ROW FORMAT DELIMITED
       LINES TERMINATED BY  '\n';

    Create table small_table (p1 string)
    ROW FORMAT DELIMITED
       LINES TERMINATED BY  '\n';

    alter table max_sint_rows add partition (p1="1");
    

Write 2147483649 random rows to max_sint_rows
Write the value “1†into small_table
    
    create table output_rows_over as
    select a.s1
    from  max_sint_rows a join small_table b 
    on (a.p1=b.p1);

in the reducer’s syslog we get this output :  

    2013-05-27 21:51:05,864 INFO ExecReducer: ExecReducer: processing 2147000000 rows: used memory = 715266312
    2013-05-27 21:51:06,634 INFO org.apache.hadoop.mapred.FileInputFormat: Total input paths to process : 1
    2013-05-27 21:51:06,952 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 forwarding 1 rows
    2013-05-27 21:51:06,952 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarding 1 rows
    2013-05-27 21:51:06,953 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Final Path: FS maprfs:/user/hadoop/tmp/hive/hive_2013-05-27_20-50-23_849_6140580929822990686/_tmp.-ext-10001/000004_1
    2013-05-27 21:51:06,953 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: Writing to temp file: FS maprfs:/user/hadoop/tmp/hive/hive_2013-05-27_20-50-23_849_6140580929822990686/_task_tmp.-ext-10001/_tmp.000004_1
    2013-05-27 21:51:06,961 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: New Final Path: FS maprfs:/user/hadoop/tmp/hive/hive_2013-05-27_20-50-23_849_6140580929822990686/_tmp.-ext-10001/000004_1
    2013-05-27 21:51:06,990 INFO ExecReducer: ExecReducer: processed 2147483650 rows: used memory = 828336712
    2013-05-27 21:51:06,990 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 finished. closing...
    2013-05-27 21:51:06,990 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 forwarded 1 rows
    2013-05-27 21:51:14,171 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: SKEWJOINFOLLOWUPJOBS:0
    2013-05-27 21:51:14,171 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 finished. closing...
    2013-05-27 21:51:14,171 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarded 1 rows
    2013-05-27 21:51:14,171 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 finished. closing...
    2013-05-27 21:51:14,171 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 forwarded 0 rows
    2013-05-27 21:51:14,186 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:1
    2013-05-27 21:51:14,186 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 Close done
    2013-05-27 21:51:14,186 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 Close done
    2013-05-27 21:51:14,189 INFO org.apache.hadoop.mapred.Task: Task:attempt_201305071944_2359_r_000004_1 is done. And is in the process of commiting
    2013-05-27 21:51:14,265 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201305071944_2359_r_000004_1' done.
    2013-05-27 21:51:14,269 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-

Notice the TABLE_ID_1_ROWCOUNT:1 
and in fact the table has only one random row. 

Now delete 2 rows from max_sint_rows and rerun : 

    create table output_rows_under as
    select a.s1
    from  max_sint_rows a join small_table b 
    on (a.p1=b.p1);

we get 2147483647 rows in output_rows_under 
and the syslog of the reducer is :  

    2013-05-27 23:43:14,595 INFO ExecReducer: ExecReducer: processed 2147483648 rows: used memory = 243494552
    2013-05-27 23:43:14,595 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 finished. closing...
    2013-05-27 23:43:14,595 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 forwarded 2147483647 rows
    2013-05-27 23:43:14,595 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: SKEWJOINFOLLOWUPJOBS:0
    2013-05-27 23:43:14,595 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 finished. closing...
    2013-05-27 23:43:14,595 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 forwarded 2147483647 rows
    2013-05-27 23:43:14,595 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 finished. closing...
    2013-05-27 23:43:14,595 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: 7 forwarded 0 rows
    2013-05-27 23:43:14,681 INFO org.apache.hadoop.hive.ql.exec.FileSinkOperator: TABLE_ID_1_ROWCOUNT:2147483647
    2013-05-27 23:43:14,681 INFO org.apache.hadoop.hive.ql.exec.SelectOperator: 6 Close done
    2013-05-27 23:43:14,681 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 5 Close done
    2013-05-27 23:43:14,683 INFO org.apache.hadoop.mapred.Task: Task:attempt_201305071944_2360_r_000004_0 is done. And is in the process of commiting
    2013-05-27 23:43:14,742 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201305071944_2360_r_000004_0' done.
    2013-05-27 23:43:14,746 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1








Outcomes