PostgreSQL 直转横统计 一些方式的探讨
前言
在之前的一些分享中,有一些关于直转横的方式,也有固定时段与不固定时段的统计方式.
这篇会对直转横统计及统计后直转横做一些探讨.
直接统计的传统方式
测试资料
create table t231122a ( id int not null generated always as identity primary key, gal text, dt date, qty int);insert into t231122a (gal, dt, qty)select gal , date '2023-10-31' + interval '1 day' * n , floor(random() * 100) from generate_series(1, 4) n , (values ('水卜樱'), ('石川澪'), ('七沢米亚')) a(gal);-- 第二次insert into t231122a (gal, dt, qty)select gal , date '2023-10-31' + interval '1 day' * n , floor(random() * 100) from generate_series(1, 4) n , (values ('水卜樱'), ('石川澪'), ('七沢米亚')) a(gal);select * from t231122a; id | gal | dt | qty ----+----------+------------+----- 1 | 水卜樱 | 2023-11-01 | 83 2 | 石川澪 | 2023-11-01 | 49 3 | 七沢米亚 | 2023-11-01 | 46 4 | 水卜樱 | 2023-11-02 | 84 5 | 石川澪 | 2023-11-02 | 81 6 | 七沢米亚 | 2023-11-02 | 48 7 | 水卜樱 | 2023-11-03 | 25 8 | 石川澪 | 2023-11-03 | 86 9 | 七沢米亚 | 2023-11-03 | 60 10 | 水卜樱 | 2023-11-04 | 27 11 | 石川澪 | 2023-11-04 | 92 12 | 七沢米亚 | 2023-11-04 | 37 13 | 水卜樱 | 2023-11-01 | 62 14 | 石川澪 | 2023-11-01 | 76 15 | 七沢米亚 | 2023-11-01 | 65 16 | 水卜樱 | 2023-11-02 | 82 17 | 石川澪 | 2023-11-02 | 61 18 | 七沢米亚 | 2023-11-02 | 58 19 | 水卜樱 | 2023-11-03 | 47 20 | 石川澪 | 2023-11-03 | 80 21 | 七沢米亚 | 2023-11-03 | 63 22 | 水卜樱 | 2023-11-04 | 8 23 | 石川澪 | 2023-11-04 | 61 24 | 七沢米亚 | 2023-11-04 | 21(24 rows)
使用 case 的方式
select gal , sum(case when dt = date '2023-11-01' then qty else 0 end) as d1 , sum(case when dt = date '2023-11-02' then qty else 0 end) as d2 , sum(case when dt = date '2023-11-03' then qty else 0 end) as d3 , sum(case when dt = date '2023-11-04' then qty else 0 end) as d4 from t231122a group by gal; gal | d1 | d2 | d3 | d4 ----------+-----+-----+-----+----- 水卜樱 | 145 | 166 | 72 | 35 石川澪 | 125 | 142 | 166 | 153 七沢米亚 | 111 | 106 | 123 | 58(3 rows)
case方式,大多数主流资料库都有支援.缺点是繁琐,语法写起来不方便.
时常会看到一些写的很长的,典型的是12个月份的.
filter 子句的方式
select gal , sum(qty) filter (where dt = date '2023-11-01') as d1 , sum(qty) filter (where dt = date '2023-11-02') as d2 , sum(qty) filter (where dt = date '2023-11-03') as d3 , sum(qty) filter (where dt = date '2023-11-04') as d4 from t231122a group by gal; gal | d1 | d2 | d3 | d4 ----------+-----+-----+-----+----- 水卜樱 | 145 | 166 | 72 | 35 石川澪 | 125 | 142 | 166 | 153 七沢米亚 | 111 | 106 | 123 | 58(3 rows)
filter 是PostgreSQL特有的,语法会比case 方便清晰.
不只是sum()之类的可以使用.
select gal , array_agg(qty) filter (where dt = date '2023-11-01') as d1 , array_agg(qty) filter (where dt = date '2023-11-02') as d2 , array_agg(qty) filter (where dt = date '2023-11-03') as d3 , array_agg(qty) filter (where dt = date '2023-11-04') as d4 from t231122a group by gal; gal | d1 | d2 | d3 | d4 ----------+---------+---------+---------+--------- 水卜樱 | {83,62} | {84,82} | {25,47} | {27,8} 石川澪 | {49,76} | {81,61} | {86,80} | {92,61} 七沢米亚 | {46,65} | {48,58} | {60,63} | {37,21}
这样在做发展的过程中,可以方便对照,确认数字的正确性.
统计后再直转横
辅助表的方式
可以透过辅助表join之后取得结果,在我一些分享中有介绍过,在此就不赘述.
pivot 指令或是 tablefunc
Oracle / SQL Server 有 pivot
PostgreSQL 使用 tablefunc
https://www.postgresql.org/docs/current/tablefunc.html
聚合成 json / hstore 再展开
这两种方式,我有分享,请参考传送门
聚合后再展开方式的延伸探讨
在常见的应用中,除了之前有提到的12个月分,还有一个月份,或是一段日期也许是20天或是45天等等.像是商品的每日销售额,或是议题的点击量.
在之前的分享中,是以tag为例,但是因为日期是数字开头,这样不能建立栏位名称,需要变通一下.所以就用来举例,顺带把方式变通方式一併介绍.
使用array 聚合
with t1 as (select gal, dt, sum(qty) as qtys from t231122a group by gal, dt)select gal , array_agg(dt order by dt) as dtarr , array_agg(qtys order by dt) as qtyarr from t1 group by gal; gal | dtarr | qtyarr ----------+-----------------------------------------------+------------------- 水卜樱 | {2023-11-01,2023-11-02,2023-11-03,2023-11-04} | {145,166,72,35} 石川澪 | {2023-11-01,2023-11-02,2023-11-03,2023-11-04} | {125,142,166,153} 七沢米亚 | {2023-11-01,2023-11-02,2023-11-03,2023-11-04} | {111,106,123,58}(3 rows)
array 再转为 hstore
with t1 as (select gal, dt, sum(qty) as qtys from t231122a group by gal, dt), t2 as (select gal , array_agg(dt order by dt) as dtarr , array_agg(qtys order by dt) as qtyarr from t1 group by gal)select gal , hstore(dtarr::text[], qtyarr::text[]) as hs1 from t2;-[ RECORD 1 ]---------------------------------------------------------------------------gal | 水卜樱hs1 | "2023-11-01"=>"145", "2023-11-02"=>"166", "2023-11-03"=>"72", "2023-11-04"=>"35"-[ RECORD 2 ]---------------------------------------------------------------------------gal | 石川澪hs1 | "2023-11-01"=>"125", "2023-11-02"=>"142", "2023-11-03"=>"166", "2023-11-04"=>"153"-[ RECORD 3 ]---------------------------------------------------------------------------gal | 七沢米亚hs1 | "2023-11-01"=>"111", "2023-11-02"=>"106", "2023-11-03"=>"123", "2023-11-04"=>"58"
将日期格式变化
with t1 as (select gal, dt, sum(qty) as qtys from t231122a group by gal, dt), t2 as (select gal , array_agg('d'|| to_char(dt, 'mmDD') order by dt) as dtarr , array_agg(qtys::text order by dt) as qtyarr from t1 group by gal)select gal , hstore(dtarr, qtyarr) as hs1 from t2; gal | hs1 ----------+---------------------------------------------------------------- 水卜樱 | "d1101"=>"145", "d1102"=>"166", "d1103"=>"72", "d1104"=>"35" 石川澪 | "d1101"=>"125", "d1102"=>"142", "d1103"=>"166", "d1104"=>"153" 七沢米亚 | "d1101"=>"111", "d1102"=>"106", "d1103"=>"123", "d1104"=>"58"(3 rows)
使用hsotre展开
DO LANGUAGE plpgsql $$DECLARE v_sqlstring TEXT = '';BEGINv_sqlstring := concat('create temp table tmpxt231122b as select ', (select string_agg(concat('NULL::int AS d', to_char(dt, 'mmDD')), ' ,' order by dt) from generate_series(date '2023-11-01' , date '2023-11-04', interval '1 day') dt ));EXECUTE(v_sqlstring);END $$;---# \d tmpxt231122b Table "pg_temp_3.tmpxt231122b" Column | Type | Collation | Nullable | Default --------+---------+-----------+----------+--------- d1101 | integer | | | d1102 | integer | | | d1103 | integer | | | d1104 | integer | | | with t1 as (select gal, dt, sum(qty) as qtys from t231122a group by gal, dt), t2 as (select gal , array_agg('d'|| to_char(dt, 'mmDD') order by dt) as dtarr , array_agg(qtys order by dt) as qtyarr from t1 group by gal), t3 as (select gal , hstore(dtarr, qtyarr::text[]) as hs1 from t2)select gal , (populate_record(null::tmpxt231122b, hs1)).* from t3; gal | d1101 | d1102 | d1103 | d1104 ----------+-------+-------+-------+------- 水卜樱 | 145 | 166 | 72 | 35 石川澪 | 125 | 142 | 166 | 153 七沢米亚 | 111 | 106 | 123 | 58(3 rows)
统计聚合后先存起来
在前面的例子中,已经把日期的方式做了展示,array当基础,可以转为hstore,需要时再做展开.
因为统计需要消耗计算资源与时间,加以做统计时纪录表可以正在做写入等动作,所以我们可以先将统计资料存起来,避免多次存取.
create table t231122c ( id int not null generated always as identity primary key, gal text, qtys hstore, qtyarr int[]);with t1 as (select gal, dt, sum(qty) as qtys from t231122a group by gal, dt), t2 as (select gal , array_agg('d'|| to_char(dt, 'mmDD') order by dt) as dtarr , array_agg(qtys order by dt) as qtyarr from t1 group by gal)insert into t231122c(gal, qtys, qtyarr)select gal , hstore(dtarr, qtyarr::text[]) , qtyarr from t2; select * from t231122c;-[ RECORD 1 ]----------------------------------------------------------id | 1gal | 水卜樱qtys | "d1101"=>"145", "d1102"=>"166", "d1103"=>"72", "d1104"=>"35"qtyarr | {145,166,72,35}-[ RECORD 2 ]----------------------------------------------------------id | 2gal | 石川澪qtys | "d1101"=>"125", "d1102"=>"142", "d1103"=>"166", "d1104"=>"153"qtyarr | {125,142,166,153}-[ RECORD 3 ]----------------------------------------------------------id | 3gal | 七沢米亚qtys | "d1101"=>"111", "d1102"=>"106", "d1103"=>"123", "d1104"=>"58"qtyarr | {111,106,123,58}
直转横之后的应用
资料库适合做大量资料的处理,但是每次都要下指令.当我们要做资料分析时,需要资料库把资料捞出来,这时候看到资料,才能再做后续的处理,例如挑选某些日期某些类别再做加总等等.所以才会有先把资料聚合后直转横,再传给试算表或是给R,Python Panda 再来做处理,像是某个日期的加总或平均,较为方便直观.
但是试算表之类的速度较慢,若要用资料库,可能建立如31个日期栏位,再把直转横的结果存看来,但是这样要做纵向的加总平均,SQL 指令会很繁琐.
vector 的引进
最近vector已经开始引进了.在PostgreSQL 有 pgvector 这个extension.传送门
vector 的应用
-- 将array 转为 vectorselect id , array_to_vector(qtyarr,4,true) , array_to_vector(qtyarr,4,false) , qtyarr::vector from t231122c; id | array_to_vector | array_to_vector | qtyarr ----+-------------------+-------------------+------------------- 1 | [145,166,72,35] | [145,166,72,35] | [145,166,72,35] 2 | [125,142,166,153] | [125,142,166,153] | [125,142,166,153] 3 | [111,106,123,58] | [111,106,123,58] | [111,106,123,58](3 rows)-- 纵向加总with t1 as (select qtyarr::vector as v1 from t231122c)select sum(v1) from t1; sum ------------------- [381,414,361,246]select qtyarr from t231122c; qtyarr ------------------- {145,166,72,35} {125,142,166,153} {111,106,123,58}(3 rows)-- 验算select 145+125+111; ?column? ---------- 381-- 纵向平均with t1 as (select qtyarr::vector as v1 from t231122c)select sum(v1) , avg(v1) from t1; sum | avg -------------------+------------------------- [381,414,361,246] | [127,138,120.333336,82](1 row)select 381 / 3; ?column? ---------- 127
可以观察到,我们可以透过array转为vector,可以很方便的做纵向的加总与平均.
以array为中心
我们可以做单日加总后,存到array,同时将array转为hstore与vector存到分析工作用的table.
需要转为单日栏位的,可以透过前面介绍的方式,转出来.
也可以转为json,方便传给其他系统.
也可以利用vector,方便做大量的纵向加总平均.这样当我们在做分析时,
例如品项也可以带tag,而这些tag也是利用array存放,还记得前面分享的佐山爱吗?这样就能方便分类计算,试算表需要一行行拉,资料库一下就30天一起算,更不用说强大的过滤条件.
至于array横向的计算,之前有一篇做多栏位计算,就是利用array,所以array的横向计算是不成问题的,另外也有extension提供了一些函数,这个会另外安排分享.
结语
连续几篇,介绍了一些统计的方法,希望这些分享对大家能有一些帮助.
感谢
感谢水卜樱,石川澪,七沢米亚,当然还有爱酱.