分组内求topN问题
问题雏形
已知员工表 employee 的结构和数据,
empno 员工号 | ename 员工姓名 | hiredate 入职日期 | sal 薪水 | deptno 部门编号 |
---|---|---|---|---|
VARCHAR(20) | VARCHAR(20) | VARCHAR(20) | int | int |
7521 | WARD | 1981-2-22 | 1250 | 30 |
7566 | JONES | 1981-4-2 | 2975 | 20 |
7876 | ADAMS | 1987-7-13 | 1100 | 20 |
7369 | SMITH | 1980-12-17 | 800 | 20 |
7934 | MILLER | 1982-1-23 | 1300 | 10 |
7844 | TURNER | 1981-9-8 | 1500 | 30 |
7782 | CLARK | 1981-6-9 | 2450 | 10 |
7839 | KING | 1981-11-17 | 5000 | 10 |
7902 | FORD | 1981-12-3 | 3000 | 20 |
7499 | ALLEN | 1981-2-20 | 1600 | 30 |
7654 | MARTIN | 1981-9-28 | 1250 | 30 |
7900 | JAMES | 1981-12-3 | 950 | 30 |
7788 | SCOTT | 1987-7-13 | 3000 | 20 |
7698 | BLAKE | 1981-5-1 | 2850 | 30 |
问题
- 求出每个部门工资最高的前三名员工的信息
- 在上面问题基础上,再计算这些员工的工资占所属部门总工资的百分比
数据准备
create table employee(
empno VARCHAR(20),
ename VARCHAR(20),
hiredate VARCHAR(20),
sal int,
deptno int
);
insert into employee values
('7521', 'WARD', '1981-2-22', 1250, 30),
('7566', 'JONES', '1981-4-2', 2975, 20),
('7876', 'ADAMS', '1987-7-13', 1100, 20),
('7369', 'SMITH', '1980-12-17', 800, 20),
('7934', 'MILLER', '1982-1-23', 1300, 10),
('7844', 'TURNER', '1981-9-8', 1500, 30),
('7782', 'CLARK', '1981-6-9', 2450, 10),
('7839', 'KING', '1981-11-17', 5000, 10),
('7902', 'FORD', '1981-12-3', 3000, 20),
('7499', 'ALLEN', '1981-2-20', 1600, 30),
('7654', 'MARTIN', '1981-9-28', 1250, 30),
('7900', 'JAMES', '1981-12-3', 950, 30),
('7788', 'SCOTT', '1987-7-13', 3000, 20),
('7698', 'BLAKE', '1981-5-1', 2850, 30);
select * from employee;
实现
SELECT
*, CONCAT(ROUND(sal / s_sal,2)*100, '%') as rate
FROM(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY deptno ORDER BY sal DESC) as rn,
sum(sal) OVER(PARTITION BY deptno) as s_sal
FROM employee
) as n
WHERE rn <= 3
依旧是用窗口函数进行计算,除排序外还需要进行组内求和
延申问题
有一个订单表 t_order 。他的字段有,user_id(用户 id), order_id(订单编号), cdate(订单日期),city_id(城市),sale_num(商品个数),sku_id(商品编号)。
user_id | order_id | cdate | city_id | sale_num | sku_id |
---|---|---|---|---|---|
1 | o1 | 2022-05-06 | 北京 | 2 | aj鞋001 |
1 | o1 | 2022-05-06 | 北京 | 3 | ck裤001 |
2 | o2 | 2022-05-06 | 北京 | 1 | xtep衣001 |
3 | o3 | 2022-05-06 | 北京 | 2 | hw手机001 |
4 | o4 | 2022-05-06 | 北京 | 1 | mi耳机001 |
5 | o5 | 2022-05-06 | 上海 | 2 | aj鞋002 |
6 | o6 | 2022-05-06 | 上海 | 3 | ck裤002 |
7 | o7 | 2022-05-06 | 上海 | 1 | xtep衣002 |
8 | o8 | 2022-05-06 | 武汉 | 2 | hw手机002 |
9 | o9 | 2022-05-06 | 武汉 | 1 | mi耳机002 |
1 | o1 | 2022-05-07 | 深圳 | 2 | aj鞋001 |
1 | o1 | 2022-05-07 | 深圳 | 3 | ck裤001 |
2 | o2 | 2022-05-07 | 深圳 | 1 | xtep衣001 |
3 | o3 | 2022-05-07 | 深圳 | 2 | hw手机001 |
4 | o4 | 2022-05-07 | 广州 | 1 | mi耳机001 |
5 | o5 | 2022-05-07 | 广州 | 2 | aj鞋002 |
6 | o6 | 2022-05-07 | 广州 | 3 | ck裤002 |
7 | o7 | 2022-05-07 | 广州 | 1 | xtep衣002 |
8 | o8 | 2022-05-07 | 北京 | 2 | hw手机002 |
9 | o9 | 2022-05-07 | 北京 | 1 | mi耳机002 |
问题
请计算 2022-05-01 至今每日订单量 top 2 的城市及其订单量(订单量需要对 order_id 去重)
数据准备
create table t_order (
user_id VARCHAR(20),
order_id VARCHAR(20),
cdate VARCHAR(20),
city_id VARCHAR(20),
sale_num int,
sku_id VARCHAR(20)
);
insert into t_order values
('1','o1','2022-05-06','北京',2,'aj鞋001'),
('1','o1','2022-05-06','北京',3,'ck裤001'),
('2','o2','2022-05-06','北京',1,'xtep衣001'),
('3','o3','2022-05-06','北京',2,'hw手机001'),
('4','o4','2022-05-06','北京',1,'mi耳机001'),
('5','o5','2022-05-06','上海',2,'aj鞋002'),
('6','o6','2022-05-06','上海',3,'ck裤002'),
('7','o7','2022-05-06','上海',1,'xtep衣002'),
('8','o8','2022-05-06','武汉',2,'hw手机002'),
('9','o9','2022-05-06','武汉',1,'mi耳机002'),
('11','o11','2022-05-07','深圳',2,'aj鞋001'),
('11','o11','2022-05-07','深圳',3,'ck裤001'),
('12','o12','2022-05-07','深圳',1,'xtep衣001'),
('13','o13','2022-05-07','深圳',2,'hw手机001'),
('14','o14','2022-05-07','广州',1,'mi耳机001'),
('15','o15','2022-05-07','广州',2,'aj鞋002'),
('16','o16','2022-05-07','广州',3,'ck裤002'),
('17','o17','2022-05-07','广州',1,'xtep衣002'),
('18','o18','2022-05-07','北京',2,'hw手机002'),
('19','o19','2022-05-07','北京',1,'mi耳机002');
select * from t_order;
实现
WITH t1 as(
select
cdate,city_id,COUNT(DISTINCT order_id) as cnt
from t_order
WHERE cdate >= '2022-05-01'
GROUP BY cdate,city_id
)
,t2 as(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY cdate ORDER BY cnt DESC) as rn
FROM t1
)
SELECT * FROM t2 WHERE rn <= 2
需要注意题目要求的逻辑是求订单量,所以是需要统计订单数。需要先对原始数据进行处理后再来进行组内排序从而得到结果。
总结
【分组内取 topN 】问题的通用场景是【获取每个 xxx 组内按 yyy 排序的前 n 个 zzz】。
它的解法公式是:
select zzz
from
(select *,
row_number() over (partition by xxx order by yyy) as rn
from employee) t1
where rn<=N;
也就是先 row_number() over(partition by 组名xxx order by yyy) as rn
,再 where筛选rn<=N名
,最后 select获取zzz
。
如果是【获取每个 xxx 组内按 yyy 排序的第 1 个 zzz】场景,解决办法则是将上面公式的最后一行改成 where rn=1
即可。