Sunday, September 16, 2012

Hive join tips !

You can specified the [biggest] table to be streamed during a join operation :
SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1);

And you can specified the [smallest] table to be join in a map join if you want to avoid the reducer
SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key;

And if you want to do ... WHERE a.key IN (SELECT ...) use :
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key);