hive窗口函数::基本操作

    技术2022-07-11  72

    窗口函数及其命令总结

    退出安全模式启动服务创建.csv文件进入hive执行建表和导数据操作函数命令window子句ntilerank()、dense_rank()、row_number()lag,leadfirst_value,last_value

    退出安全模式

    [root @cjh1 ~]# hadoop dfsadmin -safemode leave

    启动服务

    [root @cjh1 ~]# start-all.sh [root @cjh1 ~]#hive --service hiveserver2 # 这是后台启动hive2 [root @cjh1 ~]#nohup hiveserver2 1>/dev/null 2>&1 & [root @cjh1 ~]#beeline -u jdbc:hive2://192.168.153.141:10000

    创建.csv文件

    # 在根目录下创建order.csv,内容如下: jack,2015-01-01,10 tony,2015-01-02,15 jack,2015-02-03,23 tony,2015-01-04,29 jack,2015-01-05,46 jack,2015-04-06,42 tony,2015-01-07,50 jack,2015-01-08,55 mart,2015-04-08,62 mart,2015-04-09,68 neil,2015-05-10,12 mart,2015-04-11,75 neil,2015-06-12,80 mart,2015-04-13,94

    进入hive执行建表和导数据操作

    # 建表 0: jdbc:hive2://192.168.153.141:10000> create table t_window( . . . . . . . . . . . . . . . . . . .> name string, . . . . . . . . . . . . . . . . . . .> orderdate date, . . . . . . . . . . . . . . . . . . .> cost int . . . . . . . . . . . . . . . . . . .> ) . . . . . . . . . . . . . . . . . . .> row format delimited fields terminated by ','; # 导入本地文件 order.csv 数据到新建的表中 0: jdbc:hive2://192.168.153.141:10000> load data local inpath '/root/order.csv' overwrite into table t_window;

    函数命令

    window子句

    preceding:往前 following:往后 current row:当前行 unbounded :起点(一般结合preceding,following使用) unbounded preceding:表示该窗口最前面的行(起点) unbounded following:表示该窗口最后面的行(终点)

    比如: rows between unbounded preceding and current row(表示从起点到当前行) rows between 1 preceding and current row(表示往前1行到当前行) rows between 1 preceding and 1 following(表示往前1行到往后1行) rows between current row and unbounded following(表示当前行到终点)

    select name, orderdate,cost, sum(cost) over() as sample1, sum(cost) over(partition by name) as sample2, sum(cost) over(partition by name order by orderdate) as sample3, sum(cost) over(partition by name order by orderdate rows between unbounded preceding and current row) as sample4, sum(cost) over(partition by name order by orderdate rows between 1 preceding and current row) as sample5, sum(cost) over(partition by name order by orderdate rows between 1 preceding and 1 following) as sample6, sum(cost) over(partition by name order by orderdate rows between current row and unbounded following) as sample7 from t_window;

    以下是带有分析功能的窗口函数。

    ntile

    NTILE(n) :用于将分组数据按照顺序切分成n片,返回当前切片值,如果切片不均匀,默认增加第一个切片的分布。NTILE不支持ROWS BETWEEN

    select name, orderdate, cost, ntile(3) over() as sample1, ntile(3) over(partition by name) as sample2, ntile(3) over(order by cost) as sample3, ntile(3) over(partition by name order by cost) as sample3 from t_window;

    rank()、dense_rank()、row_number()

    1、rank():生成数据项在分组中的排名,排名相等会在名次中留下空位。 2、dense_rank():生成数据项在分组中的排名,排名相等会在名次中不会留下空位。 3、row_number():从1开始,按照顺序,生成分组内记录的序列,row_number()的值不会存在重复,当排序的值相同时,按照表中记录的顺序进行排列;通常用于获取分组内排序第一的记录;获取一个session中的第一条refer等。

    select name, orderdate, cost, rank() over(order by name) as r1, dense_rank() over(order by name) as r2, row_number() over(order by name) as r3 from t_window;

    lag,lead

    1、LAG(col,n,DEFAULT) :用于统计窗口内往上第n行值第一个参数为列名,第二个参数为往上第n行(可选,默认为1,不可为负数),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL)

    2、LEAD(col,n,DEFAULT) :用于统计窗口内往下第n行值第一个参数为列名,第二个参数为往下第n行(可选,默认为1,不可为负数),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL)

    select name,orderdate,cost, lag(orderdate,1) over(partition by name order by orderdate) as time1, lag(orderdate,2) over(partition by name order by orderdate) as time2, lead(orderdate,1) over(partition by name order by orderdate) as time3 from t_window;

    first_value,last_value

    1、first_value:取分组内排序后,截止到当前行,第一个值,这最多需要两个参数。第一个参数是您想要第一个值的列,第二个(可选)参数必须是false默认为布尔值的布尔值。如果设置为true,则跳过空值。

    2、last_value:取分组内排序后,截止到当前行,最后一个值,这最多需要两个参数。第一个参数是您想要第一个值的列,第二个(可选)参数必须是false默认为布尔值的布尔值。如果设置为true,则跳过空值。

    select name,orderdate,cost, first_value(orderdate) over(partition by name order by orderdate) as time1, last_value(orderdate) over(partition by name order by orderdate) as time2 from t_window;

    ps:望多多支持,后续更新中。。。

    Processed: 0.017, SQL: 9