学习笔记（3):大数据之Hive-连接查询

技术2025-12-07 16

立即学习:https://edu.csdn.net/course/play/8005/164135?utm_source=blogtoedu

建表

create table customers(id int,name string,age int); insert into customers(id,name,age) values(1,'gxf',23); create table orders(id int,cid int,orderno int,price float); insert into orders(id,cid,orderno,price) values(1,1,1,1.2); insert into orders(id,cid,orderno,price) values(1,1,2,3.2);

左半连接

左半连接left semi-join, select 和 where 子句不能引用到右边表字段。

左表的记录在右表中一旦找到对应的记录，右侧表立即停止，效率比内连接效率高

hive不支持右半连接操作

select c.id,c.name from customers c left semi join orders o on c.id = o.cid

笛卡尔链接m*n

select c.id,c.name from customers c join orders o;

map端连接

map端连接，通过mapper的手段，将一张小表完全载入内存中。

Hive中的 Map Join 即map side join

工作原理是在Map端把小表加载到内存中，然后读取大表，和内存中的小表完成连接操作。MapJoin使用了分布式缓存技术。

Map Join的优点：

不消耗集群的reduce资源。减少了reduce操作，加快了程序执行。降低网络负载。

Map Join的缺点：

占用内存(所以加载到内存中的表不能过大，因为每个计算节点都会加载一次)。生成较多的小文件。 select /*+mapjoin(c)*/ c.id,c.name,o.orderno from customers c join orders o; select /*+mapjoin(o)*/ c.id,c.name,o.orderno from customers c join orders o; set hive.mapjoin.smalltable.filesize=25000000; --设置小表阀值

注意： set 命令只对当前会话有用，要持久化需要修改 hive-site.xml

union all 联合操作

select id, name from customers union all select id, orderno from orders;

Processed: 0.016, SQL: 9