> For the complete documentation index, see [llms.txt](https://pshizhsysu.gitbook.io/prometheus/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://pshizhsysu.gitbook.io/prometheus/ff08-san-ff09-prometheus-gao-jing-chu-li/kuo-zhan-yue-du/shi-jian-ff1a-jie-shou-prometheus-de-gao-jing.md).

# 实践：接收Prometheus的告警

本节我们将写一个程序，来接收Prometheus发送的告警，并将告警打印出来。

首先，我们安装好Prometheus、NodeExporter。Prometheus的配置prometheus.yml如下：

```
global:
  scrape_interval: 15s 
  evaluation_interval: 15s

rule_files:
  - "/usr/local/prometheus/rule_files/node.yml"

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
    - targets: ['192.168.2.101:9100']
      labels:
        IP: '192.168.2.101'

alerting:
  alertmanagers:
  - static_configs:
    - targets: [ '192.168.2.101:19093' ]
```

告警规则文件`node.yml`内容如下：

```
groups:
- name: Node
  rules:
  - alert: NodeCpuPressure
    expr: 100 * (1 - avg(irate(node_cpu_seconds_total{mode="idle"}[2m])) by(IP)) > 80
    for: 2m
    annotations:
      summary: "NodeCpuPressure, IP: {{$labels.IP}}, Value: {{$value}}%, Threshold: 80%"
```

在上面的prometheus.yml中，我们配置的alertmanager的地址为`192.168.2.101:19093`，这个地址就是接下来我们要启动的自已的程序，用来接收prometheus的告警信息并打印出来。

在主机上的某个目录下创建文件alertmanager.go，文件内容见本文附录，然后执行命令`go run alertmanager.go`将程序跑起来。

接着，我们启动Prometheus与NodeExporter。

然后，我们拉高主机CPU的使用率，不一会，我们便可以在Prometheus的Alert页面看到一个告警处理Pending状态

![](/files/-MA1FerERDyVYooDffgF)

再等一会，等告警Firing之后，我们自已的程序会接收到该告警并打印出来：可以发现，`startsAt`就是上面截图中的`Active Since`的时间，也就是Alert变成Pending的时间。`endsAt`刚好是`startAt`加了三分钟（这个暂时不清楚为什么是三分钟）。

```
$ go run alertmanager.go
2020-02-17 20:48:58.411459407 +0800 CST m=+191.432028937
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 100%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:51:58.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]
```

让CPU一直保持在80%以上，我们会发现Prometheus会持续地向AlertManager发送Alert。我们发现，发送的时间间隔为75秒（为什么是75秒还不清楚，很有可能是`evaluation_interval`的五倍），`startsAt`一直没变，而`endsAt`一直在增加，增加的时间为Prometheus发送该告警的时间再加3m。

```
2020-02-17 20:48:58.411459407 +0800 CST m=+191.432028937
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 100%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:51:58.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]

2020-02-17 20:50:13.405880837 +0800 CST m=+266.426450346
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 100%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:53:13.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]

2020-02-17 20:51:28.405287147 +0800 CST m=+341.425856660
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 100%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:54:28.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]

2020-02-17 20:52:43.40958641 +0800 CST m=+416.430155922
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 100%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:55:43.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]

2020-02-17 20:53:58.405264448 +0800 CST m=+491.425833979
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 100%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:56:58.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]

2020-02-17 20:55:13.406017704 +0800 CST m=+22.465382727
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 100%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:58:13.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]

2020-02-17 20:56:28.405989243 +0800 CST m=+97.465354315
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 100%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:59:28.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]
```

接着，我们把CPU降到80%以下（时间大概在`20:56:28`到`20:56:58`之间）。我们会发现，`20:56:28`时发送的告警（上面的最后一条），其`endsAt`的时间为`20:59:28`，说明发送该条告警时，CPU还处于80%以上。而过了30秒后，Promethus又发送了一条告警（下面第一条），这一次没有等到75秒，是因为发送该条告警时CPU已经低于80%了。

这里我们就有两个疑问了：（1）我们如何判断发送下面第一条告警时CPU已经低于80%了呢？（2）下面第一条告警中，CPU的值显示的是81.76666666656578%呀？

其实原因是很简单：我们配置了`evaluation_interval`的值为15秒，Prometheus在上面最后一条告警（`20:56:28`）后，在`20:56:43`和`20:56:58`都会做一次Evaluation，在`20:56:43`这一次计算出来的CPU的值为81.76666666656578%，由于上一条告警还没有过75秒，所以不发送。在`20:56:58`时该计算出来的CPU已经小于80%，相当于告警已经解除，于是Prometheus会立即发送一条告警出去，表明告警已解除。也就是说，下面第一条告警其实是一条“解除告警”，为什么呢？因为`endsAt`的时间就是发送该条告警的时间，当AlertManager接收到以后，发现这个时间已经是一个过去的时间了，也就是说，这条告警已经结束了。既然已经通过`endsAt`标识该告警已经过时了，那么CPU的值也就不重要了，不过Prometheus用了“解除告警”的上一次计算出来的值还有让人有点容易混淆。

```
2020-02-17 20:56:58.405968564 +0800 CST m=+127.465333664
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 81.76666666656578%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:56:58.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]

2020-02-17 20:58:13.408337229 +0800 CST m=+202.467702513
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 81.76666666656578%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:56:58.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]

2020-02-17 20:59:28.406675061 +0800 CST m=+277.466040229
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 81.76666666656578%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:56:58.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]

2020-02-17 21:00:43.40653634 +0800 CST m=+352.465901514
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 81.76666666656578%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:56:58.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]

2020-02-17 21:01:58.405259946 +0800 CST m=+427.464625058
[{"labels":{"IP":"192.168.2.101","alertname":"NodeCpuPressure"},"annotations":{"summary":"NodeCpuPressure, IP: 192.168.2.101, Value: 81.76666666656578%, Threshold: 80%"},"startsAt":"2020-02-17T20:48:58.402147666+08:00","endsAt":"2020-02-17T20:56:58.402147666+08:00","generatorURL":"http://peng01:9090/graph?g0.expr=100+%2A+%281+-+avg+by%28IP%29+%28irate%28node_cpu_seconds_total%7Bmode%3D%22idle%22%7D%5B2m%5D%29%29%29+%3E+80\u0026g0.tab=1"}]
```

另外，根据上面的输出我们还可以发现，当“解除告警”发出去以后，Prometheus还坚持把“解除告警”发送了好多次，而且时间间隔也为75秒。在实验中我们发现大概发送了十几次后，Prometheus就不再发送了。

## 附录

alertmanager.go的代码

```
package main

import (
    "time"
    "io/ioutil"
    "net/http"
    "fmt"
)

type MyHandler struct{}

func (mh *MyHandler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    body, err := ioutil.ReadAll(r.Body)
    if err != nil {
        fmt.Printf("read body err, %v\n", err)
        return
    }
    fmt.Println(time.Now())
    fmt.Printf("%s\n\n", string(body))
}

func main() {
    http.Handle("/api/v1/alerts", &MyHandler{})
    http.ListenAndServe(":19093", nil)
}
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://pshizhsysu.gitbook.io/prometheus/ff08-san-ff09-prometheus-gao-jing-chu-li/kuo-zhan-yue-du/shi-jian-ff1a-jie-shou-prometheus-de-gao-jing.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
