PostgreSQL 直转横统计 一些方式的探讨

PostgreSQL 直转横统计 一些方式的探讨

前言

在之前的一些分享中,有一些关于直转横的方式,也有固定时段与不固定时段的统计方式.
这篇会对直转横统计及统计后直转横做一些探讨.

直接统计的传统方式

测试资料

create table t231122a (  id int not null generated always as identity primary key, gal text, dt date, qty int);insert into t231122a (gal, dt, qty)select gal     , date '2023-10-31' + interval '1 day' * n     , floor(random() * 100)  from generate_series(1, 4) n     , (values ('水卜樱'), ('石川澪'), ('七沢米亚')) a(gal);-- 第二次insert into t231122a (gal, dt, qty)select gal     , date '2023-10-31' + interval '1 day' * n     , floor(random() * 100)  from generate_series(1, 4) n     , (values ('水卜樱'), ('石川澪'), ('七沢米亚')) a(gal);select *  from t231122a; id |   gal    |     dt     | qty ----+----------+------------+-----  1 | 水卜樱   | 2023-11-01 |  83  2 | 石川澪   | 2023-11-01 |  49  3 | 七沢米亚 | 2023-11-01 |  46  4 | 水卜樱   | 2023-11-02 |  84  5 | 石川澪   | 2023-11-02 |  81  6 | 七沢米亚 | 2023-11-02 |  48  7 | 水卜樱   | 2023-11-03 |  25  8 | 石川澪   | 2023-11-03 |  86  9 | 七沢米亚 | 2023-11-03 |  60 10 | 水卜樱   | 2023-11-04 |  27 11 | 石川澪   | 2023-11-04 |  92 12 | 七沢米亚 | 2023-11-04 |  37 13 | 水卜樱   | 2023-11-01 |  62 14 | 石川澪   | 2023-11-01 |  76 15 | 七沢米亚 | 2023-11-01 |  65 16 | 水卜樱   | 2023-11-02 |  82 17 | 石川澪   | 2023-11-02 |  61 18 | 七沢米亚 | 2023-11-02 |  58 19 | 水卜樱   | 2023-11-03 |  47 20 | 石川澪   | 2023-11-03 |  80 21 | 七沢米亚 | 2023-11-03 |  63 22 | 水卜樱   | 2023-11-04 |   8 23 | 石川澪   | 2023-11-04 |  61 24 | 七沢米亚 | 2023-11-04 |  21(24 rows)

使用 case 的方式

select gal     , sum(case        when dt = date '2023-11-01' then qty        else 0        end) as d1     , sum(case        when dt = date '2023-11-02' then qty        else 0        end) as d2     , sum(case        when dt = date '2023-11-03' then qty        else 0        end) as d3     , sum(case        when dt = date '2023-11-04' then qty        else 0        end) as d4  from t231122a group by gal;   gal    | d1  | d2  | d3  | d4  ----------+-----+-----+-----+----- 水卜樱   | 145 | 166 |  72 |  35 石川澪   | 125 | 142 | 166 | 153 七沢米亚 | 111 | 106 | 123 |  58(3 rows)

case方式,大多数主流资料库都有支援.缺点是繁琐,语法写起来不方便.
时常会看到一些写的很长的,典型的是12个月份的.

filter 子句的方式

select gal     , sum(qty) filter (where dt = date '2023-11-01') as d1     , sum(qty) filter (where dt = date '2023-11-02') as d2     , sum(qty) filter (where dt = date '2023-11-03') as d3     , sum(qty) filter (where dt = date '2023-11-04') as d4  from t231122a group by gal;   gal    | d1  | d2  | d3  | d4  ----------+-----+-----+-----+----- 水卜樱   | 145 | 166 |  72 |  35 石川澪   | 125 | 142 | 166 | 153 七沢米亚 | 111 | 106 | 123 |  58(3 rows)

filter 是PostgreSQL特有的,语法会比case 方便清晰.
不只是sum()之类的可以使用.

select gal     , array_agg(qty) filter (where dt = date '2023-11-01') as d1     , array_agg(qty) filter (where dt = date '2023-11-02') as d2     , array_agg(qty) filter (where dt = date '2023-11-03') as d3     , array_agg(qty) filter (where dt = date '2023-11-04') as d4  from t231122a group by gal;   gal    |   d1    |   d2    |   d3    |   d4    ----------+---------+---------+---------+--------- 水卜樱   | {83,62} | {84,82} | {25,47} | {27,8} 石川澪   | {49,76} | {81,61} | {86,80} | {92,61} 七沢米亚 | {46,65} | {48,58} | {60,63} | {37,21}

这样在做发展的过程中,可以方便对照,确认数字的正确性.

统计后再直转横

辅助表的方式

可以透过辅助表join之后取得结果,在我一些分享中有介绍过,在此就不赘述.

pivot 指令或是 tablefunc

Oracle / SQL Server 有 pivot
PostgreSQL 使用 tablefunc
https://www.postgresql.org/docs/current/tablefunc.html

聚合成 json / hstore 再展开

这两种方式,我有分享,请参考传送门

聚合后再展开方式的延伸探讨

在常见的应用中,除了之前有提到的12个月分,还有一个月份,或是一段日期也许是20天或是45天等等.像是商品的每日销售额,或是议题的点击量.
在之前的分享中,是以tag为例,但是因为日期是数字开头,这样不能建立栏位名称,需要变通一下.所以就用来举例,顺带把方式变通方式一併介绍.

使用array 聚合

with t1 as (select gal, dt, sum(qty) as qtys  from t231122a group by gal, dt)select gal     , array_agg(dt order by dt) as dtarr     , array_agg(qtys order by dt) as qtyarr  from t1 group by gal;   gal    |                     dtarr                     |      qtyarr       ----------+-----------------------------------------------+------------------- 水卜樱   | {2023-11-01,2023-11-02,2023-11-03,2023-11-04} | {145,166,72,35} 石川澪   | {2023-11-01,2023-11-02,2023-11-03,2023-11-04} | {125,142,166,153} 七沢米亚 | {2023-11-01,2023-11-02,2023-11-03,2023-11-04} | {111,106,123,58}(3 rows)

array 再转为 hstore

with t1 as (select gal, dt, sum(qty) as qtys  from t231122a group by gal, dt), t2 as (select gal     , array_agg(dt order by dt) as dtarr     , array_agg(qtys order by dt) as qtyarr  from t1 group by gal)select gal     , hstore(dtarr::text[], qtyarr::text[]) as hs1  from t2;-[ RECORD 1 ]---------------------------------------------------------------------------gal | 水卜樱hs1 | "2023-11-01"=>"145", "2023-11-02"=>"166", "2023-11-03"=>"72", "2023-11-04"=>"35"-[ RECORD 2 ]---------------------------------------------------------------------------gal | 石川澪hs1 | "2023-11-01"=>"125", "2023-11-02"=>"142", "2023-11-03"=>"166", "2023-11-04"=>"153"-[ RECORD 3 ]---------------------------------------------------------------------------gal | 七沢米亚hs1 | "2023-11-01"=>"111", "2023-11-02"=>"106", "2023-11-03"=>"123", "2023-11-04"=>"58"

将日期格式变化

with t1 as (select gal, dt, sum(qty) as qtys  from t231122a group by gal, dt), t2 as (select gal     , array_agg('d'|| to_char(dt, 'mmDD') order by dt) as dtarr     , array_agg(qtys::text order by dt) as qtyarr  from t1 group by gal)select gal     , hstore(dtarr, qtyarr) as hs1  from t2;     gal    |                              hs1                               ----------+---------------------------------------------------------------- 水卜樱   | "d1101"=>"145", "d1102"=>"166", "d1103"=>"72", "d1104"=>"35" 石川澪   | "d1101"=>"125", "d1102"=>"142", "d1103"=>"166", "d1104"=>"153" 七沢米亚 | "d1101"=>"111", "d1102"=>"106", "d1103"=>"123", "d1104"=>"58"(3 rows)

使用hsotre展开

DO LANGUAGE plpgsql $$DECLARE v_sqlstring TEXT = '';BEGINv_sqlstring := concat('create temp table tmpxt231122b as select ',                (select string_agg(concat('NULL::int AS d', to_char(dt, 'mmDD')), ' ,'                         order by dt)                   from generate_series(date '2023-11-01'                        , date '2023-11-04', interval '1 day') dt                   ));EXECUTE(v_sqlstring);END $$;---# \d tmpxt231122b          Table "pg_temp_3.tmpxt231122b" Column |  Type   | Collation | Nullable | Default --------+---------+-----------+----------+--------- d1101  | integer |           |          |  d1102  | integer |           |          |  d1103  | integer |           |          |  d1104  | integer |           |          | with t1 as (select gal, dt, sum(qty) as qtys  from t231122a group by gal, dt), t2 as (select gal     , array_agg('d'|| to_char(dt, 'mmDD') order by dt) as dtarr     , array_agg(qtys order by dt) as qtyarr  from t1 group by gal), t3 as (select gal     , hstore(dtarr, qtyarr::text[]) as hs1  from t2)select gal     , (populate_record(null::tmpxt231122b, hs1)).*  from t3;   gal    | d1101 | d1102 | d1103 | d1104 ----------+-------+-------+-------+------- 水卜樱   |   145 |   166 |    72 |    35 石川澪   |   125 |   142 |   166 |   153 七沢米亚 |   111 |   106 |   123 |    58(3 rows)

统计聚合后先存起来

在前面的例子中,已经把日期的方式做了展示,array当基础,可以转为hstore,需要时再做展开.
因为统计需要消耗计算资源与时间,加以做统计时纪录表可以正在做写入等动作,所以我们可以先将统计资料存起来,避免多次存取.

create table t231122c (  id int not null generated always as identity primary key, gal text, qtys hstore, qtyarr int[]);with t1 as (select gal, dt, sum(qty) as qtys  from t231122a group by gal, dt), t2 as (select gal     , array_agg('d'|| to_char(dt, 'mmDD') order by dt) as dtarr     , array_agg(qtys order by dt) as qtyarr  from t1 group by gal)insert into t231122c(gal, qtys, qtyarr)select gal     , hstore(dtarr, qtyarr::text[])     , qtyarr  from t2;  select *  from t231122c;-[ RECORD 1 ]----------------------------------------------------------id     | 1gal    | 水卜樱qtys   | "d1101"=>"145", "d1102"=>"166", "d1103"=>"72", "d1104"=>"35"qtyarr | {145,166,72,35}-[ RECORD 2 ]----------------------------------------------------------id     | 2gal    | 石川澪qtys   | "d1101"=>"125", "d1102"=>"142", "d1103"=>"166", "d1104"=>"153"qtyarr | {125,142,166,153}-[ RECORD 3 ]----------------------------------------------------------id     | 3gal    | 七沢米亚qtys   | "d1101"=>"111", "d1102"=>"106", "d1103"=>"123", "d1104"=>"58"qtyarr | {111,106,123,58}

直转横之后的应用

资料库适合做大量资料的处理,但是每次都要下指令.当我们要做资料分析时,需要资料库把资料捞出来,这时候看到资料,才能再做后续的处理,例如挑选某些日期某些类别再做加总等等.所以才会有先把资料聚合后直转横,再传给试算表或是给R,Python Panda 再来做处理,像是某个日期的加总或平均,较为方便直观.
但是试算表之类的速度较慢,若要用资料库,可能建立如31个日期栏位,再把直转横的结果存看来,但是这样要做纵向的加总平均,SQL 指令会很繁琐.

vector 的引进

最近vector已经开始引进了.在PostgreSQL 有 pgvector 这个extension.传送门

vector 的应用

-- 将array 转为 vectorselect id     , array_to_vector(qtyarr,4,true)     , array_to_vector(qtyarr,4,false)     , qtyarr::vector  from t231122c; id |  array_to_vector  |  array_to_vector  |      qtyarr       ----+-------------------+-------------------+-------------------  1 | [145,166,72,35]   | [145,166,72,35]   | [145,166,72,35]  2 | [125,142,166,153] | [125,142,166,153] | [125,142,166,153]  3 | [111,106,123,58]  | [111,106,123,58]  | [111,106,123,58](3 rows)-- 纵向加总with t1 as (select qtyarr::vector as v1  from t231122c)select sum(v1)  from t1;        sum        ------------------- [381,414,361,246]select qtyarr  from t231122c;      qtyarr       ------------------- {145,166,72,35} {125,142,166,153} {111,106,123,58}(3 rows)-- 验算select 145+125+111; ?column? ----------      381-- 纵向平均with t1 as (select qtyarr::vector as v1  from t231122c)select sum(v1)     , avg(v1)  from t1;          sum        |           avg           -------------------+------------------------- [381,414,361,246] | [127,138,120.333336,82](1 row)select 381 / 3; ?column? ----------      127

可以观察到,我们可以透过array转为vector,可以很方便的做纵向的加总与平均.

以array为中心

我们可以做单日加总后,存到array,同时将array转为hstore与vector存到分析工作用的table.
需要转为单日栏位的,可以透过前面介绍的方式,转出来.
也可以转为json,方便传给其他系统.
也可以利用vector,方便做大量的纵向加总平均.这样当我们在做分析时,
例如品项也可以带tag,而这些tag也是利用array存放,还记得前面分享的佐山爱吗?这样就能方便分类计算,试算表需要一行行拉,资料库一下就30天一起算,更不用说强大的过滤条件.
至于array横向的计算,之前有一篇做多栏位计算,就是利用array,所以array的横向计算是不成问题的,另外也有extension提供了一些函数,这个会另外安排分享.

结语

连续几篇,介绍了一些统计的方法,希望这些分享对大家能有一些帮助.

感谢

感谢水卜樱,石川澪,七沢米亚,当然还有爱酱.


关于作者: 网站小编

码农网专注IT技术教程资源分享平台,学习资源下载网站,58码农网包含计算机技术、网站程序源码下载、编程技术论坛、互联网资源下载等产品服务,提供原创、优质、完整内容的专业码农交流分享平台。

热门文章