[Hive] JOIN時的效能調教

操作Hive在做多table join的時候

有一個調教performance的要點,在這邊做個筆記

在做table join的時候where 條件可以寫在join外

SELECT * FROM A JOIN B ON A.id=B.id WHERE A.dt='20150408' AND B.dt='20150408'

也可以寫sub-query

SELECT * FROM 
    (SELECT * FROM A WHERE dt='20150408') a 
JOIN 
    (SELECT * FROM B WHERE dt='20150408') b 
ON a.id=b.id

第二種寫法比較複雜,但是在Hive,當資料量夠大的時候可以造成相當程度的時間差距

原因是第一種寫法並在篩選資料之前就會先做full table join

之前有段時間還認為這種小地方Hive應該會自己調整的....

以下是自己的環境數個大小table join兩種寫法的耗費時間,table名稱做了點處理

SELECT
    *
FROM
    b
LEFT OUTER JOIN
    a
ON
    a.id=b.id
LEFT OUTER JOIN
    c
ON
    b.id=c.id
LEFT OUTER JOIN
    d
ON
    b.id=d.id
EFT OUTER JOIN
    e
ON
    d.id=e.id
LEFT OUTER JOIN
    f
ON
    b.id=f.id
WHERE
    a.dt='20150330' AND b.dt='20150330' AND c.dt='20150330' AND d.dt='20150330' AND e.dt='20150330' AND f.dt='20150330'

Query time: 14m58s

SELECT
    *
FROM
     (select * from b where dt='20150330') b
LEFT OUTER JOIN
     (select * from a where dt='20150330') a
ON
    a.id=b.id
LEFT OUTER JOIN
     (select * from c where dt='20150330') c
ON
    b.id=c.id
LEFT OUTER JOIN
     (select * from d where dt='20150330') d
ON
    b.id=d.id
EFT OUTER JOIN
    (select * from e where dt='20150330') e
ON
    d.id=e.id
LEFT OUTER JOIN
    (select * from f where dt='20150330') f
ON
    b.id=f.id

Query time: 2m36s

這是幾個百萬比資料table做的join

差距可以差到將近七倍,資料量小的時候沒感覺,資料量大差異就出來了

comments powered by Disqus