Hive复合类型操作

技术2025-07-31 40

概览

hive目前支持4种复合类型

array_type : ARRAY < data_type > map_type : MAP < primitive_type, data_type > struct_type : STRUCT < col_name : data_type [COMMENT col_comment], ...> union_type : UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)

常用的为前3种, 这篇文章仅介绍这3种类型的用法

array类型

假设有个选课数据表，有2个字段——学号id, 所选课程列表

定义表

create table if not exists course_select( stu_id string comment 'the id of a student', select_courses array<string> comment 'selected courses of a student') row format delimited fields terminated by '\t' collection items terminated by ',';

注：书写hive sql时，不要使用tab键进行缩进，会产生错误，应使用空格方式缩进

准备数据

cat course_select_data.txt

08102111 小说,戏曲,军事理论 08102244 语文,古代汉语 08102308 小说,外贸,经济学 hive> load data local inpath '/var/lib/hadoop-hdfs/hive-app-data/select_courses.dat' > overwrite into table course_select; Loading data to table default.course_select Table default.course_select stats: [numFiles=1, totalSize=98] OK Time taken: 0.861 seconds

查看数据

hive> set hive.cli.print.header=true; # 显示表结构的配置参数 hive> select * from course_select; OK course_select.stu_id course_select.select_courses 08102111 ["小说","戏曲","军事理论"] 08102244 ["语文","古代汉语"] 08102308 ["小说","外贸","经济学"] # 查询array类型字段中的第一个元素值 hive> select stu_id, select_courses[0] from course_select; stu_id _c1 08102111 小说 08102244 语文 08102308 小说

struct 类型

设定有这样一个学生信息表, 学号id, 学生信息

定义表

create external table if not exists student_tb( stu_id int comment 'the id of a student', basic_info struct<name:string,age:int> comment 'the info of a student') row format delimited fields terminated by '\t' collection items terminated by ',';

准备数据

cat student_info.dat

08102101 张三,23 08102121 李二,25 08102141 马冬梅,20 08102108 XiaoXiu,24 hive> load data local inpath '/var/lib/hadoop-hdfs/hive-app-data/student_info.dat' > overwrite into table student_tb; Loading data to table default.student_tb Table default.student_tb stats: [numFiles=1, totalSize=101] OK Time taken: 0.436 seconds hive> select * from student_tb; OK student_tb.stu_id student_tb.basic_info 8102101 {"name":"张三","age":23} NULL NULL NULL NULL NULL NULL

出现问题了！共有4行数据，只有第一行数据被成功解析了

经过排查，发现第2，3，4行数据字段分隔符是空格，不是表创建时指定的 '\t'

删除原表，重新操作

# 删除原表 hive> drop table student_tb; OK Time taken: 0.169 seconds # 重新定义表 hive> create external table if not exists student_tb( > stu_id int comment 'the id of a student', > basic_info struct<name:string,age:int> comment 'the info of a student') > row format delimited > fields terminated by '\t' > collection items terminated by ','; OK Time taken: 0.082 seconds # 修正数据字段分隔符后,重新导入数据 hive> load data local inpath '/var/lib/hadoop-hdfs/hive-app-data/student_info.dat' > overwrite into table student_tb; Loading data to table default.student_tb Table default.student_tb stats: [numFiles=1, numRows=0, totalSize=80, rawDataSize=0] OK Time taken: 0.759 seconds # 再次查询，这回数据都解析成功了！ hive> select * from student_tb; OK student_tb.stu_id student_tb.basic_info 8102101 {"name":"张三","age":23} 8102121 {"name":"李二","age":25} 8102141 {"name":"马冬梅","age":20} 8102108 {"name":"XiaoXiu","age":24} Time taken: 0.083 seconds, Fetched: 4 row(s)

查询struct类型字段的子字段

hive> select stu_id, basic_info.age from student_tb; Total MapReduce CPU Time Spent: 2 seconds 430 msec OK stu_id age 8102101 23 8102121 25 8102141 20 8102108 24 Time taken: 16.571 seconds, Fetched: 4 row(s)

map 类型

假设有这样一个表, 它的字段有——学号id, 姓名, 家庭成员, 年龄

表定义

create external table if not exists stu_info_map( stu_id int comment 'the id of a student', name string, familly map<string,string>, age int) row format delimited fields terminated by ',' collection items terminated by '#' map keys terminated BY ':';

准备数据

cat stu_info_map.dat

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28 2,lisi,father:mayun#mother:huangyi#brother:guanyu,22 3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29 4,mayun,father:mayongzhen#mother:angelababy,26 hive> load data local inpath '/var/lib/hadoop-hdfs/hive-app-data/stu_info_map.dat' > overwrite into table stu_info_map; Loading data to table default.stu_info_map Table default.stu_info_map stats: [numFiles=1, totalSize=222] OK Time taken: 0.409 seconds hive> select * from stu_info_map; OK stu_info_map.stu_id stu_info_map.name stu_info_map.familly stu_info_map.age 1 zhangsan {"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"} 28 2 lisi {"father":"mayun","mother":"huangyi","brother":"guanyu"} 22 3 wangwu {"father":"wangjianlin","mother":"ruhua","sister":"jingtian"} 29 4 mayun {"father":"mayongzhen","mother":"angelababy"} 26 Time taken: 0.071 seconds, Fetched: 4 row(s)

操作map类型字段

# 查询每条数据的map都包含的father信息 hive> select name, familly['father'] from stu_info_map; Total MapReduce CPU Time Spent: 2 seconds 950 msec OK name _c1 zhangsan xiaoming lisi mayun wangwu wangjianlin mayun mayongzhen Time taken: 18.269 seconds, Fetched: 4 row(s) # 查询每条数据的sister信息, 该信息有可能不存在 hive> select name, familly['sister'] from stu_info_map; Total MapReduce CPU Time Spent: 3 seconds 110 msec OK name _c1 zhangsan NULL lisi NULL wangwu jingtian mayun NULL Time taken: 17.363 seconds, Fetched: 4 row(s

复合字段分隔符语法讲解

# 数据表顶级字段之间的分隔符 fields terminated by ',' # 复合字段内元素之间的分隔符, 比如array中, 相邻元素以 '#' 分割 collection items terminated by '#' # 复合字段内单个元素内部组成部分的分隔符，比如map中一个kv对的k与v的分隔符 ':' map keys terminated BY ':';

Processed: 0.011, SQL: 9