clickhouse实现排序排名


如何在ClickHouse中实现ROW_NUMBER OVER 和DENSE_RANK OVER等同效果的查询,它们在一些其他数据库中可用于RANK排序。

CH中并没有直接提供对应的开窗函数,需要利用一些特殊函数变相实现,主要会用到下面几个数组函数,它们分别是:

arrayEnumerate
arrayEnumerateDense
arrayEnumerateUniq

这些函数均接受一个数组作为输入参数,并返回数组中元素出现的位置,例如:

SELECT    
 arrayEnumerate([10, 20, 30, 10, 40]) AS row_number,     
 arrayEnumerateDense([10, 20, 30, 10, 40]) AS dense_rank,    
 arrayEnumerateUniq([10, 20, 30, 10, 40]) AS uniq_rank
 
┌─row_number──┬─dense_rank──┬─uniq_rank───┐
│ [1,2,3,4,5] │ [1,2,3,1,4] │ [1,1,1,2,1] │
└─────────────┴─────────────┴─────────────┘

--建表语句

CREATE TABLE res_report.xxx_loan (
`loan_id` String,--登录ID
`customer_id` String,--用户ID
`loan_dt` Nullable (Date),--登录日期
`end_date` Nullable (Date),--退出日期
`due_days` Nullable (Int64)--与分组排序等暂无关字段,此处可以代指一些业务字段
) ENGINE = MergeTree ()
ORDER BY
VID SETTINGS index_granularity = 8192;

我们的目标是实现开窗查询:

ROW_NUMBER() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerate
DENSE_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateDense
UNIQ_RANK() OVER( PARTITION BY id ORDER BY val )==>arrayEnumerateUniq

代码如下

SELECT     
customer_id ,     
groupArray(loan_dt) AS loan_dt, 
groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
groupArray(due_days) AS due_days,    
groupArray(loan_id) AS loan_id,    
arrayEnumerate(loan_id) AS row_number,     
arrayEnumerateDense(loan_id) AS dense_rank,     
arrayEnumerateUniq(loan_id) AS uniq_rank
FROM (    
    SELECT   *  FROM res_report.xxx_loan     ORDER BY loan_dt ,loan_id  
)
GROUP BY customer_id

 数组展开,利用ARRAY JOIN将数组展开,并按照customer_id 、loan_id  列排序:

SELECT 
customer_id
,loan_id
,loan_dt
,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt
,due_days
,row_number 
,dense_rank
,uniq_rank
from 
(
SELECT     
customer_id ,     
groupArray(loan_dt) AS loan_dt, 
groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
groupArray(due_days) AS due_days,    
groupArray(loan_id) AS loan_id,    
arrayEnumerate(loan_id) AS row_number,     
arrayEnumerateDense(loan_id) AS dense_rank,     
arrayEnumerateUniq(loan_id) AS uniq_rank
FROM (    
    SELECT   *  FROM res_report.xxx_loan     ORDER BY loan_dt ,loan_id  
)
GROUP BY customer_id
)
ARRAY JOIN  
    loan_dt,
    loan_id, 
    end_date,
    due_days,
    row_number,     
    dense_rank,     
    uniq_rank
ORDER BY     
customer_id ASC,     
row_number ASC ,     
dense_rank ASC

 技巧:因为end_date可能为空值,会导致array长度不一致。报错。需要用特数值填充然后最后再转换回来。

lag/lead实现:

neighbor(column, offset[, default_value])

函数的结果取决于受影响的数据块和数据块中数据的顺序。

如果使用ORDER BY进行子查询,并从子查询外部调用该函数,则可以得到预期的结果。

参数

列-列名或标量表达式。

偏移量—从列的当前行向前或向后的行数。Int64。

默认值-可选。如果偏移量超出块的范围,则返回的值。受影响的数据块的类型。

返回值

如果偏移值不在块边界之外,则当前行偏移距离中的列的值。

如果偏移值超出块边界,则列的默认值。如果给定了默认的_值,则将使用它。

类型:受影响数据块的类型或默认值类型。

代码如下

SELECT 
customer_id
,loan_id
,loan_dt
,IF(end_date=toDate('2099-12-31'),null,end_date) as end_dt
,due_days
,row_number as row_num
,dense_rank
,uniq_rank
,if(neighbor(row_num , 1)>1,neighbor(loan_dt , 1),null)  as lead_loan_dt
,if(row_num<>1,neighbor(end_dt, -1),null)  as lag_end_dt
from 
(
SELECT     
customer_id ,     
groupArray(loan_dt) AS loan_dt, 
groupArray(ifnull(end_date,toDate('2099-12-31'))) AS end_date,  
groupArray(due_days) AS due_days,    
groupArray(loan_id) AS loan_id,    
arrayEnumerate(loan_id) AS row_number,     
arrayEnumerateDense(loan_id) AS dense_rank,     
arrayEnumerateUniq(loan_id) AS uniq_rank
FROM (    
    SELECT   *  FROM res_report.ipeso_loan     ORDER BY loan_dt ,loan_id  
)
GROUP BY customer_id
)
ARRAY JOIN  
    loan_dt,
    loan_id, 
    end_date,
    due_days,
    row_number,     
    dense_rank,     
    uniq_rank
ORDER BY     
customer_id ASC,     
row_number ASC ,     
dense_rank ASC