操作Hive在做多table join的時候
有一個調教performance的要點,在這邊做個筆記
在做table join的時候where 條件可以寫在join外
SELECT * FROM A JOIN B ON A.id=B.id WHERE A.dt='20150408' AND B.dt='20150408'
也可以寫sub-query
SELECT * FROM
(SELECT * FROM A WHERE dt='20150408') a
JOIN
(SELECT * FROM B WHERE dt='20150408') b
ON a.id=b.id
第二種寫法比較複雜,但是在Hive,當資料量夠大的時候可以造成相當程度的時間差距
原因是第一種寫法並在篩選資料之前就會先做full table join
之前有段時間還認為這種小地方Hive應該會自己調整的....
以下是自己的環境數個大小table join兩種寫法的耗費時間,table名稱做了點處理
SELECT
*
FROM
b
LEFT OUTER JOIN
a
ON
a.id=b.id
LEFT OUTER JOIN
c
ON
b.id=c.id
LEFT OUTER JOIN
d
ON
b.id=d.id
EFT OUTER JOIN
e
ON
d.id=e.id
LEFT OUTER JOIN
f
ON
b.id=f.id
WHERE
a.dt='20150330' AND b.dt='20150330' AND c.dt='20150330' AND d.dt='20150330' AND e.dt='20150330' AND f.dt='20150330'
Query time: 14m58s
SELECT
*
FROM
(select * from b where dt='20150330') b
LEFT OUTER JOIN
(select * from a where dt='20150330') a
ON
a.id=b.id
LEFT OUTER JOIN
(select * from c where dt='20150330') c
ON
b.id=c.id
LEFT OUTER JOIN
(select * from d where dt='20150330') d
ON
b.id=d.id
EFT OUTER JOIN
(select * from e where dt='20150330') e
ON
d.id=e.id
LEFT OUTER JOIN
(select * from f where dt='20150330') f
ON
b.id=f.id
Query time: 2m36s
這是幾個百萬比資料table做的join
差距可以差到將近七倍,資料量小的時候沒感覺,資料量大差異就出來了