实践:AlertManager

前面章节我们介绍了AlertManager的API。本节,我们将使用PostMan模拟Prometheus向AlertManager发送告警,然后用我们自已写的程序接收AlertManager发出来的通知。

本文中,我们的实验主要来验证AlertManager中Group的机制,以及的三个配置参数的效果:group_waitgroup_intervalrepeat_interval

启动自定义程序

启动我们自已的程序,监听10000端口,提供POST /webhook API,用来接收AlertManager的通知。程序代码见文章附录

Group机制

设置AlertManager的配置如下,然后启动AlertManager

route:
  group_by: ["alertname"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'webhook'

receivers:
- name: webhook
  webhook_configs:
  - url: http://192.168.2.101:10000/webhook
    send_resolved: true

接着,用PostMan调用AlertManager的API POST /api/v2/alerts发送一个告警,告警内容(Body参数)如下,注意下面的时间要设置好(StartsAt可以是一个过去的时间,EndsAt设置为你通过PostMan发送这个请求时的后一个小时或更久):

[
    {
        "Labels": {
            "alertname": "NodeCpuPressure",
            "IP": "192.168.2.101"
        },
        "Annotations": {
            "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
        },
        "StartsAt": "2020-02-17T23:00:00.000+08:00", 
        "EndsAt": "2020-02-18T23:00:00.000+08:00"
    }
]

然后,我们调AlertManager的API来查询Alerts(GET /api/v2/alerts)与Groups(GET /api/v2/alerts/groups),可以通过浏览器直接调或者通过命令行curl来调。

查询到的Alerts结果如下:

[
    {
        "annotations": {
            "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
        },
        "endsAt": "2020-02-18T23:00:00.000+08:00",
        "fingerprint": "27e1a08813b1ec3b",
        "receivers": [
            {
                "name": "webhook"
            }
        ],
        "startsAt": "2020-02-17T23:00:00.000+08:00",
        "status": {
            "inhibitedBy": [],
            "silencedBy": [],
            "state": "active"
        },
        "updatedAt": "2020-02-17T23:38:38.610+08:00",
        "labels": {
            "IP": "192.168.2.101",
            "alertname": "NodeCpuPressure"
        }
    }
]

查询到的Groups结果如下:

[
    {
        "alerts": [
            {
                "annotations": {
                    "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
                },
                "endsAt": "2020-02-18T23:00:00.000+08:00",
                "fingerprint": "27e1a08813b1ec3b",
                "receivers": [
                    {
                        "name": "webhook"
                    }
                ],
                "startsAt": "2020-02-17T23:00:00.000+08:00",
                "status": {
                    "inhibitedBy": [],
                    "silencedBy": [],
                    "state": "active"
                },
                "updatedAt": "2020-02-17T23:38:38.610+08:00",
                "labels": {
                    "IP": "192.168.2.101",
                    "alertname": "NodeCpuPressure"
                }
            }
        ],
        "labels": {
            "alertname": "NodeCpuPressure"
        },
        "receiver": {
            "name": "webhook"
        }
    }
]

我们发现,AlertManger自动创建了一个Group,其Labels为{alertname=NodeCpuPressure},里面包含了刚才的告警。

接着,我们再发一个Alert,其内容如下:

[
    {
        "Labels": {
            "alertname": "NodeMemoryPressure",
            "IP": "192.168.2.101"
        },
        "Annotations": {
            "summary": "NodeMemoryPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
        },
        "StartsAt": "2020-02-17T23:00:00.000+08:00", 
        "EndsAt": "2020-02-18T23:00:00.000+08:00"
    }
]

然后再查询Group,结果如下,说明又创建了一个Group,其Labels为{alertname=NodeCpuPressure}

[
    {
        "alerts": [
            {
                "annotations": {
                    "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
                },
                "endsAt": "2020-02-18T23:00:00.000+08:00",
                "fingerprint": "27e1a08813b1ec3b",
                "receivers": [
                    {
                        "name": "webhook"
                    }
                ],
                "startsAt": "2020-02-17T23:00:00.000+08:00",
                "status": {
                    "inhibitedBy": [],
                    "silencedBy": [],
                    "state": "active"
                },
                "updatedAt": "2020-02-17T23:38:38.610+08:00",
                "labels": {
                    "IP": "192.168.2.101",
                    "alertname": "NodeCpuPressure"
                }
            }
        ],
        "labels": {
            "alertname": "NodeCpuPressure"
        },
        "receiver": {
            "name": "webhook"
        }
    },
    {
        "alerts": [
            {
                "annotations": {
                    "summary": "NodeMemoryPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
                },
                "endsAt": "2020-02-18T23:00:00.000+08:00",
                "fingerprint": "1a354c7333c5b062",
                "receivers": [
                    {
                        "name": "webhook"
                    }
                ],
                "startsAt": "2020-02-17T23:00:00.000+08:00",
                "status": {
                    "inhibitedBy": [],
                    "silencedBy": [],
                    "state": "active"
                },
                "updatedAt": "2020-02-17T23:41:27.790+08:00",
                "labels": {
                    "IP": "192.168.2.101",
                    "alertname": "NodeMemoryPressure"
                }
            }
        ],
        "labels": {
            "alertname": "NodeMemoryPressure"
        },
        "receiver": {
            "name": "webhook"
        }
    }
]

此时,我们我们再发送以下的“解除告警”(即把EndsAt设置为一个过去的时间)

[
    {
        "Labels": {
            "alertname": "NodeCpuPressure",
            "IP": "192.168.2.101"
        },
        "Annotations": {
            "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
        },
        "StartsAt": "2020-02-17T23:00:00.000+08:00", 
        "EndsAt": "2020-02-17T23:01:00.000+08:00"
    }
]

再查看Alert与Group,发现都只剩下一个了

[
    {
        "alerts": [
            {
                "annotations": {
                    "summary": "NodeMemoryPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
                },
                "endsAt": "2020-02-18T23:00:00.000+08:00",
                "fingerprint": "1a354c7333c5b062",
                "receivers": [
                    {
                        "name": "webhook"
                    }
                ],
                "startsAt": "2020-02-17T23:00:00.000+08:00",
                "status": {
                    "inhibitedBy": [],
                    "silencedBy": [],
                    "state": "active"
                },
                "updatedAt": "2020-02-17T23:41:27.790+08:00",
                "labels": {
                    "IP": "192.168.2.101",
                    "alertname": "NodeMemoryPressure"
                }
            }
        ],
        "labels": {
            "alertname": "NodeMemoryPressure"
        },
        "receiver": {
            "name": "webhook"
        }
    }
]

group_wait

停止alertmanager,清空alertmanager的数据目录,然后还是使用上面的配置,启动alertmanager。此时alertmanager中没有任何Alert与Group

接着,向AlertManager发送一个告警,内容如下:

[
    {
        "Labels": {
            "alertname": "NodeCpuPressure",
            "IP": "192.168.2.101"
        },
        "Annotations": {
            "summary": "NodeCpuPressure, IP: 192.168.2.101, Value: 90%, Threshold: 85%"
        },
        "StartsAt": "2020-02-17T23:00:00.000+08:00", 
        "EndsAt": "2020-02-18T23:00:00.000+08:00"
    }
]

然后在30秒内,再调用API发送一个如下的Alert

[
    {
        "Labels": {
            "alertname": "NodeCpuPressure",
            "IP": "192.168.2.102"
        },
        "Annotations": {
            "summary": "NodeCpuPressure, IP: 192.168.2.102, Value: 95%, Threshold: 85%"
        },
        "StartsAt": "2020-02-17T23:00:00.000+08:00", 
        "EndsAt": "2020-02-18T23:00:00.000+08:00"
    }
]

然后,等到第一个告警发送后的30秒后,我们便会在我们自已程序那里看到接收到的通知,内容如下:

附录

webhook-receiver.go

package main

import (
    "time"
    "io/ioutil"
    "net/http"
    "fmt"
)

type MyHandler struct{}

func (am *MyHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    body, err := ioutil.ReadAll(r.Body)
    if err != nil {
        fmt.Printf("read body err, %v\n", err)
        return
    }
    fmt.Println(time.Now())
    fmt.Printf("%s\n\n", string(body))
}

func main() {
    http.Handle("/webhook", &MyHandler{})
    http.ListenAndServe(":10000", nil)
}

Last updated