I recently installed Impala on a 3-node MapR cluster. When I run a simple query.The performance is not as good as Impala+HDFS. Here is the query:
FROM ft_test, ft_wafer
WHERE ft_test_parquet.id = ft_wafer_parquet.id
and month = 1
and day = 8
and param = 2913;
It took about 3s. But when using the same query but with HDFS. It takes less than 1 sec for a 30Gb table size.
What I already did is: using parquet, partitioning, compute stats.
I attached the profile of the query. From what I see. Most of the time was spent on Scan HDFS, which is very weird because this is not a time-consuming part usually. Please take a look. Any input would be nice. Thanks.