目录
0 需求分析
小打卡面试题:计算小打卡平台的圈主建了多少圈子,名下有多少用户
1 需求实现
3 小 结
已知,数据如下:
圈子可以理解为微信公众号,用户可以理解为公众号的粉丝 tb_habit圈子表:近千万行数据
master_id(圈住的用户id)habit_id(圈住所建的圈子)open_id1habit_id1open_id1habit_id2open_id1habit_id3open_id2habit_id4open_id2habit_id5open_id3habit_id6open_id3habit_id7.........................user_habit_relation用户与圈子关系表:亿行数据
habit_id(所建的圈子)user_id(加入圈子的用户)habit_id1user_id1habit_id1user_id3habit_id1user_id4habit_id3user_id2habit_id3user_id1habit_id2user_id5habit_id2user_id1habit_id2user_id7habit_id4user_id11habit_id4user_id12habit_id4user_id1habit_id6user_id17
需求:请用hive sql计算出如下结果(同一个圈主名下,同一个用户加多个圈子只计算一次):
(1)数据准备
圈住用户id 所建圈子id open_id1,habit_id1 open_id1,habit_id2 open_id1,habit_id3 open_id2,habit_id4 open_id2,habit_id5 open_id3,habit_id6 open_id3,habit_id7 圈子id 加入圈子的用户id habit_id1,user_id1 habit_id1,user_id3 habit_id1,user_id4 habit_id3,user_id2 habit_id3,user_id1 habit_id2,user_id5 habit_id2,user_id1 habit_id2,user_id7 habit_id4,user_id11 habit_id4,user_id12 habit_id4,user_id1 habit_id6,user_id17(2)创建hive表
drop table if exists dan_test.tb_habit CREATE TABLE dan_test.tb_habit ( master_id string, habit_id string ) ROW format delimited FIELDS TERMINATED BY ","; ———————————————— drop table if exists dan_test.user_habit_relation CREATE TABLE dan_test.user_habit_relation ( habit_id string, user_id string ) ROW format delimited FIELDS TERMINATED BY ",";(3)导入数据
load data local inpath "/home/centos/dan_test/tb_habit.txt" into table tb_habit; load data local inpath "/home/centos/dan_test/user_habit_relation.txt" into table user_habit_relation;(4)需求实现
分析:建立圈子关系树。如下图所示:
很明显本题需要先对两张表进行join(join思维:寻找完整表的数据)通过join将两张表关联起来,采用left join以圈子表为主表,建立圈主与粉丝之间的映射关系,此时记录数将会被补全,对应的会出现重复的数据如habit_id,因而再求个数时需要去重,如上图所示由于用户1出现了不同的圈子但又是隶属于同一个圈主的,因而在算用户数时需要去重。具体SQL如下:
select c.master_id ,count(distinct c.a_habit_id) ,count(distinct c.user_id) from( select a.master_id as master_id ,a.habit_id as a_habit_id ,b.habit_id as b_habit_id ,b.user_id as user_id from tb_habit a left join user_habit_relation b on a.habit_id = b.habit_id ) c group by c.master_id OK open_id1 3 6 open_id2 2 3 open_id3 2 1 Time taken: 10.707 seconds, Fetched: 3 row(s)子表结果如下:
select a.master_id as master_id ,a.habit_id as a_habit_id ,b.habit_id as b_habit_id ,b.user_id as user_id from tb_habit a left join user_habit_relation b on a.habit_id = b.habit_id --------------------------------- OK open_id1 habit_id1 habit_id1 user_id1 open_id1 habit_id1 habit_id1 user_id4 open_id1 habit_id1 habit_id1 user_id3 open_id1 habit_id2 habit_id2 user_id5 open_id1 habit_id2 habit_id2 user_id7 open_id1 habit_id2 habit_id2 user_id1 open_id1 habit_id3 habit_id3 user_id2 open_id1 habit_id3 habit_id3 user_id1 open_id2 habit_id4 habit_id4 user_id11 open_id2 habit_id4 habit_id4 user_id1 open_id2 habit_id4 habit_id4 user_id12 open_id2 habit_id5 NULL NULL open_id3 habit_id6 habit_id6 user_id17 open_id3 habit_id7 NULL NULL Time taken: 10.573 seconds, Fetched: 14 row(s)本题的解题关键:建立圈子关系映射图,如上图所示,通过关系树构建关联表,然后按照题意找出需求结果。本题是典型的SQL思维join思维解题方法的典型代表。
本题是典型的SQL思维之join思维的典型代表题,主要采用join思维进行解题,根据题意建立映射关系图,然后选择合适的join进行求解。
本题拓展案例:
需求:计算平台的每一个用户发过多少朋友圈、获得多少点赞
已知,数据如下:
T1:数十万行数据
需求:请用sql计算出如下结果:
有了本题的分析作为基础,相信聪明的读者应该不难,具体答案评论区见分晓。该拓展案例映射关系如下图所示,SQL由读者给出。
