Hive复合类型操作

    技术2025-07-31  12

    概览

    hive目前支持4种复合类型

    array_type : ARRAY < data_type > map_type : MAP < primitive_type, data_type > struct_type : STRUCT < col_name : data_type [COMMENT col_comment], ...> union_type : UNIONTYPE < data_type, data_type, ... > -- (Note: Available in Hive 0.7.0 and later)

    常用的为前3种, 这篇文章仅介绍这3种类型的用法

     

    array类型

    假设有个选课数据表,有2个字段——学号id, 所选课程列表

    定义表

    create table if not exists course_select( stu_id string comment 'the id of a student', select_courses array<string> comment 'selected courses of a student') row format delimited fields terminated by '\t' collection items terminated by ',';

    注:书写hive sql时,不要使用tab键进行缩进,会产生错误,应使用空格方式缩进

    准备数据

    cat course_select_data.txt

    08102111 小说,戏曲,军事理论 08102244 语文,古代汉语 08102308 小说,外贸,经济学 hive> load data local inpath '/var/lib/hadoop-hdfs/hive-app-data/select_courses.dat' > overwrite into table course_select; Loading data to table default.course_select Table default.course_select stats: [numFiles=1, totalSize=98] OK Time taken: 0.861 seconds

    查看数据

    hive> set hive.cli.print.header=true; # 显示表结构的配置参数 hive> select * from course_select; OK course_select.stu_id course_select.select_courses 08102111 ["小说","戏曲","军事理论"] 08102244 ["语文","古代汉语"] 08102308 ["小说","外贸","经济学"] # 查询array类型字段中的第一个元素值 hive> select stu_id, select_courses[0] from course_select; stu_id _c1 08102111 小说 08102244 语文 08102308 小说

     

    struct 类型

    设定有这样一个学生信息表,  学号id, 学生信息

    定义表

    create external table if not exists student_tb( stu_id int comment 'the id of a student', basic_info struct<name:string,age:int> comment 'the info of a student') row format delimited fields terminated by '\t' collection items terminated by ',';

    准备数据

    cat student_info.dat

    08102101 张三,23 08102121 李二,25 08102141 马冬梅,20 08102108 XiaoXiu,24 hive> load data local inpath '/var/lib/hadoop-hdfs/hive-app-data/student_info.dat' > overwrite into table student_tb; Loading data to table default.student_tb Table default.student_tb stats: [numFiles=1, totalSize=101] OK Time taken: 0.436 seconds hive> select * from student_tb; OK student_tb.stu_id student_tb.basic_info 8102101 {"name":"张三","age":23} NULL NULL NULL NULL NULL NULL

    出现问题了!共有4行数据,只有第一行数据被成功解析了

    经过排查,发现第2,3,4行数据字段分隔符是空格,不是表创建时指定的 '\t'

    删除原表,重新操作

    # 删除原表 hive> drop table student_tb; OK Time taken: 0.169 seconds # 重新定义表 hive> create external table if not exists student_tb( > stu_id int comment 'the id of a student', > basic_info struct<name:string,age:int> comment 'the info of a student') > row format delimited > fields terminated by '\t' > collection items terminated by ','; OK Time taken: 0.082 seconds # 修正数据字段分隔符后,重新导入数据 hive> load data local inpath '/var/lib/hadoop-hdfs/hive-app-data/student_info.dat' > overwrite into table student_tb; Loading data to table default.student_tb Table default.student_tb stats: [numFiles=1, numRows=0, totalSize=80, rawDataSize=0] OK Time taken: 0.759 seconds # 再次查询,这回数据都解析成功了! hive> select * from student_tb; OK student_tb.stu_id student_tb.basic_info 8102101 {"name":"张三","age":23} 8102121 {"name":"李二","age":25} 8102141 {"name":"马冬梅","age":20} 8102108 {"name":"XiaoXiu","age":24} Time taken: 0.083 seconds, Fetched: 4 row(s)

    查询struct类型字段的子字段

    hive> select stu_id, basic_info.age from student_tb; Total MapReduce CPU Time Spent: 2 seconds 430 msec OK stu_id age 8102101 23 8102121 25 8102141 20 8102108 24 Time taken: 16.571 seconds, Fetched: 4 row(s)

     

    map 类型

    假设有这样一个表, 它的字段有——学号id, 姓名, 家庭成员, 年龄

    表定义

    create external table if not exists stu_info_map( stu_id int comment 'the id of a student', name string, familly map<string,string>, age int) row format delimited fields terminated by ',' collection items terminated by '#' map keys terminated BY ':';

    准备数据

    cat stu_info_map.dat

    1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28 2,lisi,father:mayun#mother:huangyi#brother:guanyu,22 3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29 4,mayun,father:mayongzhen#mother:angelababy,26 hive> load data local inpath '/var/lib/hadoop-hdfs/hive-app-data/stu_info_map.dat' > overwrite into table stu_info_map; Loading data to table default.stu_info_map Table default.stu_info_map stats: [numFiles=1, totalSize=222] OK Time taken: 0.409 seconds hive> select * from stu_info_map; OK stu_info_map.stu_id stu_info_map.name stu_info_map.familly stu_info_map.age 1 zhangsan {"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"} 28 2 lisi {"father":"mayun","mother":"huangyi","brother":"guanyu"} 22 3 wangwu {"father":"wangjianlin","mother":"ruhua","sister":"jingtian"} 29 4 mayun {"father":"mayongzhen","mother":"angelababy"} 26 Time taken: 0.071 seconds, Fetched: 4 row(s)

    操作map类型字段

    # 查询每条数据的map都包含的father信息 hive> select name, familly['father'] from stu_info_map; Total MapReduce CPU Time Spent: 2 seconds 950 msec OK name _c1 zhangsan xiaoming lisi mayun wangwu wangjianlin mayun mayongzhen Time taken: 18.269 seconds, Fetched: 4 row(s) # 查询每条数据的sister信息, 该信息有可能不存在 hive> select name, familly['sister'] from stu_info_map; Total MapReduce CPU Time Spent: 3 seconds 110 msec OK name _c1 zhangsan NULL lisi NULL wangwu jingtian mayun NULL Time taken: 17.363 seconds, Fetched: 4 row(s

     

    复合字段分隔符语法讲解

     

      # 数据表顶级字段之间的分隔符 fields terminated by ','   # 复合字段内元素之间的分隔符, 比如array中, 相邻元素以 '#' 分割   collection items terminated by '#' # 复合字段内单个元素内部组成部分的分隔符,比如map中一个kv对的k与v的分隔符 ':'   map keys terminated BY ':';

     

    Processed: 0.009, SQL: 9