高级查询
除了基本的连接、聚合、分组操作外,MatrixDB还提供了许多高级分析函数,如窗口函数、CTE等。本节将介绍常用的高级分析方法。
下面以统计磁盘使用量为例,演示如何使用高级分析。
简化起见,我们要统计的磁盘信息量信息仅包括读写速度,即:read,write。
建指标表模式如下:
CREATE TABLE disk(
time timestamp with time zone,
tag_id int,
read float,
write float
)
Distributed by (tag_id);
1. 窗口函数
窗口函数在与当前元组相关的一组元组上执行聚合运算。与聚合函数不同的是,聚合函数作用于由GROUP BY子句定义的分组,而窗口函数作用于由一个OVER子句定义的与当前元组相关的元组集合。
下面列出了一些常用的使用窗口函数的案例:
1.1 计算百分位数
percentile_cont用来计算百分位数
ymatrix=# SELECT tag_id,
percentile_cont(0.2) WITHIN GROUP (ORDER BY read) as read,
percentile_cont(0.3) WITHIN GROUP (ORDER BY write) as write
FROM disk
GROUP BY tag_id
ORDER BY tag_id;
tag_id | read | write
--------+-------+-------
1 | 19.87 | 29.86
2 | 19.95 | 29.88
3 | 20.06 | 29.93
(3 rows)
参数表示百分比,如果传0.5相当于计算中位数。
1.2 累积和
通过嵌套使用SUM方法,可以计算累积和
如下sql计算了2021-04-10 21:00:00到2021-04-10 21:00:10间tag_id=1的磁盘读写累积和:
ymatrix=# SELECT time,
SUM(SUM(read)) OVER (ORDER BY time) AS read,
SUM(SUM(write)) OVER (ORDER BY time) AS write
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:00:10'::timestamp
AND tag_id = 1
GROUP BY time
ORDER BY time;
time | read | write
------------------------+--------+--------------------
2021-04-10 21:00:00+08 | 81.07 | 73.3
2021-04-10 21:00:01+08 | 110.63 | 121.77
2021-04-10 21:00:02+08 | 202.12 | 201.36
2021-04-10 21:00:03+08 | 263.74 | 257.88
2021-04-10 21:00:04+08 | 361.6 | 299.3
2021-04-10 21:00:05+08 | 394.49 | 327.33000000000004
2021-04-10 21:00:06+08 | 438.3 | 334.98
2021-04-10 21:00:07+08 | 523.35 | 431.39
2021-04-10 21:00:08+08 | 583.15 | 461.84
2021-04-10 21:00:09+08 | 609.01 | 533.03
2021-04-10 21:00:10+08 | 669.52 | 535.9
(11 rows)
1.3 移动平均值
移动平均值用来计算该条记录与前n条的平均值
如下sql计算了tag_id为1的磁盘,在2021-4-10 21点到21点1分每10秒的平均读写(对于前9条数据,只是计算了满足条件行数的平均值):
ymatrix=# SELECT time,
round(AVG(read) OVER(ORDER BY time ROWS BETWEEN 9 PRECEDING AND CURRENT ROW)) AS read,
round(AVG(write) OVER(ORDER BY time ROWS BETWEEN 9 PRECEDING AND CURRENT ROW)) AS write
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1
ORDER BY time DESC;
time | read | write
------------------------+------+-------
2021-04-10 21:01:00+08 | 57 | 57
2021-04-10 21:00:59+08 | 49 | 60
2021-04-10 21:00:58+08 | 52 | 56
2021-04-10 21:00:57+08 | 51 | 57
2021-04-10 21:00:56+08 | 53 | 65
2021-04-10 21:00:55+08 | 48 | 64
2021-04-10 21:00:54+08 | 49 | 64
2021-04-10 21:00:53+08 | 47 | 54
2021-04-10 21:00:52+08 | 44 | 54
2021-04-10 21:00:51+08 | 41 | 56
......
1.4 增量
增量通常用来计算对于一个单调序列增幅或降幅,也可以简单的用来计算与前一条数据的变化。
如下语句计算了tag_id为1的磁盘,在2021-4-10 21点到21点1分期间磁盘读的变化值,正数为相比上一秒增长,负数为相比上一秒下降:
ymatrix=# SELECT
time,
(
CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
ELSE round(read - lag(read) OVER (ORDER BY time))
END
) AS read
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1
ORDER BY time;
time | read
------------------------+------
2021-04-10 21:00:00+08 |
2021-04-10 21:00:01+08 | -52
2021-04-10 21:00:02+08 | 62
2021-04-10 21:00:03+08 | -30
2021-04-10 21:00:04+08 | 36
2021-04-10 21:00:05+08 | -65
2021-04-10 21:00:06+08 | 11
2021-04-10 21:00:07+08 | 41
2021-04-10 21:00:08+08 | -25
2021-04-10 21:00:09+08 | -34
......
1.5 增速
在增量的基础上,再除以时间间隔,就可以得到增速(因为样例数据是1秒采样一次,所以看到的和增量的结果相同):
ymatrix=# SELECT
time,
(
CASE WHEN lag(read) OVER (ORDER BY time) IS NULL THEN NULL
ELSE round(read - lag(read) OVER (ORDER BY time))
END
) / extract(epoch from time - lag(time) OVER (ORDER BY time)) AS read_rate,
extract(epoch from time - lag(time) OVER (ORDER BY time)) AS "time lag"
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1
ORDER BY time;
time | read_rate | time lag
------------------------+-----------+----------
2021-04-10 21:00:00+08 | |
2021-04-10 21:00:01+08 | -52 | 1
2021-04-10 21:00:02+08 | 62 | 1
2021-04-10 21:00:03+08 | -30 | 1
2021-04-10 21:00:04+08 | 36 | 1
2021-04-10 21:00:05+08 | -65 | 1
2021-04-10 21:00:06+08 | 11 | 1
2021-04-10 21:00:07+08 | 41 | 1
2021-04-10 21:00:08+08 | -25 | 1
2021-04-10 21:00:09+08 | -34 | 1
......
1.6 变化点
变化点列出相比于前一条有变化的记录,该类型查询适合在比较平稳的数据集中找发生变化的点:
ymatrix=# SELECT time, read FROM (
SELECT time,
read,
read - lag(read) OVER (ORDER BY TIME) AS diff
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 21:01:00'::timestamp
AND tag_id = 1 ) ht
WHERE diff IS NULL OR diff != 0
ORDER BY time;
time | read
------------------------+-------
2021-04-10 21:00:00+08 | 81.07
2021-04-10 21:00:01+08 | 29.56
2021-04-10 21:00:02+08 | 91.49
2021-04-10 21:00:03+08 | 61.62
2021-04-10 21:00:04+08 | 97.86
2021-04-10 21:00:05+08 | 32.89
2021-04-10 21:00:06+08 | 43.81
2021-04-10 21:00:07+08 | 85.05
2021-04-10 21:00:08+08 | 59.8
2021-04-10 21:00:09+08 | 25.86
(10 rows)
2. CTE
通用表表达式允许SQL语句定义临时的视图以帮助简化大型查询语句的结构。
如下CTE计算了所有设备中读速度平均值的最大值和最小值:
WITH avg_read (tag_id, avg_read) AS (
SELECT tag_id, AVG(read) AS read FROM disk GROUP BY tag_id
) SELECT MAX(avg_read), MIN(avg_read) FROM avg_read;
3. 时序函数
MatrixDB提供的时序组件matrixts
中还提供了时序场景中经常使用的时序函数,首先要创建该组件:
stats=# CREATE EXTENSION matrixts;
CREATE EXTENSION
3.1 time_bucket
time_bucket可以计算出给定时间段的平均值
如下sql计算了tag_id为1的磁盘,在2021-4-10 21点到22点之间每5分钟的平均读写速度:
ymatrix=# SELECT time_bucket('5 minutes', time) AS five_min,
AVG(read) as read,
AVG(write) as write
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 22:00:00'::timestamp
AND tag_id = 1
GROUP BY five_min
ORDER BY five_min;
five_min | read | write
------------------------+--------------------+--------------------
2021-04-10 21:00:00+08 | 48.614599999999996 | 49.48656666666666
2021-04-10 21:05:00+08 | 50.73533333333335 | 49.992566666666654
2021-04-10 21:10:00+08 | 51.6102333333333 | 49.99359999999999
2021-04-10 21:15:00+08 | 49.29116666666669 | 53.89146666666666
2021-04-10 21:20:00+08 | 49.67863333333332 | 50.47406666666665
2021-04-10 21:25:00+08 | 51.09013333333332 | 47.766733333333335
2021-04-10 21:30:00+08 | 49.55949999999999 | 50.440766666666654
2021-04-10 21:35:00+08 | 48.86253333333333 | 50.57290000000001
2021-04-10 21:40:00+08 | 51.061299999999974 | 47.028766666666684
2021-04-10 21:45:00+08 | 52.10353333333333 | 49.861466666666665
2021-04-10 21:50:00+08 | 51.780566666666694 | 51.4159
2021-04-10 21:55:00+08 | 51.83549999999998 | 49.124366666666674
2021-04-10 22:00:00+08 | 93.96 | 91.07
(13 rows)
3.2 time_bucket_gapfill
如果时间段中有数据缺失,为了避免产生错误查询结果,可以使用time_bucket_gapfill函数为缺失数据做填充。有两种填充策略:
- locf:用聚合组中之前出现的值填充
- interpolate:对缺失的值做线性插值填充
例如:
SELECT time_bucket_gapfill('5 minutes', time) AS five_min,
locf(AVG(read)) as locf_read,
interpolate(AVG(read)) as interpolate_read
FROM disk
WHERE time BETWEEN '2021-04-10 21:00:00'::timestamp AND '2021-04-10 22:00:00'::timestamp
AND tag_id = 1
GROUP BY five_min
ORDER BY five_min;
3.3 first/last
first返回时间最早的值:
ymatrix=# SELECT tag_id,
first(read, time) AS read,
first(write, time) AS write
FROM disk
GROUP BY tag_id
ORDER BY tag_id;
tag_id | read | write
--------+-------+-------
1 | 11.51 | 86.61
2 | 50.07 | 25.9
3 | 83.72 | 10.5
(3 rows)
last返回时间最晚的值:
ymatrix=# SELECT tag_id,
last(read, time) AS read,
last(write, time) AS write
FROM disk
GROUP BY tag_id
ORDER BY tag_id;
tag_id | read | write
--------+-------+-------
1 | 5.32 | 4.96
2 | 5.73 | 34.73
3 | 49.03 | 86.02
(3 rows)