clickhouse实现排序排名
如何在ClickHouse中实现ROW_NUMBER OVER 和DENSE_RANK OVER等同效果的查询,它们在一些其他数据库中可用于RANK排序。
CH中并没有直接提供对应的开窗函数,需要利用一些特殊函数变相实现,主要会用到下面几个数组函数,它们分别是:
arrayEnumerate arrayEnumerateDense arrayEnumerateUniq
这些函数均接受一个数组作为输入参数,并返回数组中元素出现的位置,例如:
SELECT arrayEnumerate([10, 20, 30, 10, 40]) AS row_number, arrayEnumerateDense([10, 20, 30, 10, 40]) AS dense_rank, arrayEnumerateUniq([10, 20, 30, 10, 40]) AS uniq_rank ┌─row_number──┬─dense_rank──┬─uniq_rank───┐ │ [1,2,3,4,5] │ [1,2,3,1,4] │ [1,1,1,2,1] │ └─────────────┴─────────────┴─────────────┘
--建表语句
CREATE TABLE res_report.xxx_loan (
`loan_id` String,--登录ID
`customer_id` String,--用户ID
`loan_dt` Nullable (Date),--登录日期
`end_date` Nullable (Date),--退出日期
`due_days` Nullable (Int64)--与分组排序等暂无关字段,此处可以代指一些业务字段
) ENGINE = MergeTree ()
ORDER BY
VID SETTINGS index_granularity = 8192;
我们的目标是实现开窗查询:
ROW_NUMBER() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerate
DENSE_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateDense
UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateUniq
代码如下
SELECT customer_id , groupArray(loan_dt) AS loan_dt, groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date, groupArray(due_days) AS due_days, groupArray(loan_id) AS loan_id, arrayEnumerate(loan_id) AS row_number, arrayEnumerateDense(loan_id) AS dense_rank, arrayEnumerateUniq(loan_id) AS uniq_rank FROM ( SELECT * FROM res_report.xxx_loan ORDER BY loan_dt ,loan_id ) GROUP BY customer_id
数组展开,利用ARRAY JOIN将数组展开,并按照customer_id 、loan_id 列排序:
SELECT customer_id ,loan_id ,loan_dt ,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt ,due_days ,row_number ,dense_rank ,uniq_rank from ( SELECT customer_id , groupArray(loan_dt) AS loan_dt, groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date, groupArray(due_days) AS due_days, groupArray(loan_id) AS loan_id, arrayEnumerate(loan_id) AS row_number, arrayEnumerateDense(loan_id) AS dense_rank, arrayEnumerateUniq(loan_id) AS uniq_rank FROM ( SELECT * FROM res_report.xxx_loan ORDER BY loan_dt ,loan_id ) GROUP BY customer_id ) ARRAY JOIN loan_dt, loan_id, end_date, due_days, row_number, dense_rank, uniq_rank ORDER BY customer_id ASC, row_number ASC , dense_rank ASC
技巧:因为end_date可能为空值,会导致array长度不一致。报错。需要用特数值填充然后最后再转换回来。
lag/lead实现:
neighbor(column, offset[, default_value])
函数的结果取决于受影响的数据块和数据块中数据的顺序。
如果使用ORDER BY进行子查询,并从子查询外部调用该函数,则可以得到预期的结果。
参数
列-列名或标量表达式。
偏移量—从列的当前行向前或向后的行数。Int64。
默认值-可选。如果偏移量超出块的范围,则返回的值。受影响的数据块的类型。
返回值
如果偏移值不在块边界之外,则当前行偏移距离中的列的值。
如果偏移值超出块边界,则列的默认值。如果给定了默认的_值,则将使用它。
类型:受影响数据块的类型或默认值类型。
代码如下
SELECT customer_id ,loan_id ,loan_dt ,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt ,due_days ,row_number as row_num ,dense_rank ,uniq_rank ,if(neighbor(row_num , 1)>1,neighbor(loan_dt , 1),null) as lead_loan_dt ,if(row_num<>1,neighbor(end_dt, -1),null) as lag_end_dt from ( SELECT customer_id , groupArray(loan_dt) AS loan_dt, groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date, groupArray(due_days) AS due_days, groupArray(loan_id) AS loan_id, arrayEnumerate(loan_id) AS row_number, arrayEnumerateDense(loan_id) AS dense_rank, arrayEnumerateUniq(loan_id) AS uniq_rank FROM ( SELECT * FROM res_report.ipeso_loan ORDER BY loan_dt ,loan_id ) GROUP BY customer_id ) ARRAY JOIN loan_dt, loan_id, end_date, due_days, row_number, dense_rank, uniq_rank ORDER BY customer_id ASC, row_number ASC , dense_rank ASC