kube-promethesu调整coredns监控

K8s集群版本是二进制部署的1.20.4,kube-prometheus对应选择的版本是kube-prometheus-0.8.0

Coredns是在安装集群的时候部署的,采用的也是该版本的官方文档,kube-prometheus中也有coredns的监控配置信息,但是在prometheus的监控页面并没有发现coredns的servicemonitor.。所以我们需要一步步的去排查该问题。

先看下coredns的servicemonitor 

vim kubernetes-serviceMonitorCoreDNS.yaml

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  labels:
    app.kubernetes.io/name: coredns
  name: coredns
  namespace: monitoring
spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 15s
    port: metrics
  jobLabel: app.kubernetes.io/name
  namespaceSelector:
    matchNames:
    - kube-system
  selector:
    matchLabels:
      app.kubernetes.io/name: kube-dns

再来看下coredns的service配置

---
apiVersion: v1
kind: Service
metadata:
  name: kube-dns
  namespace: kube-system
  annotations:
    prometheus.io/port: "9153"
    prometheus.io/scrape: "true"
  labels:
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/name: "CoreDNS"
spec:
  selector:
    k8s-app: kube-dns
  clusterIP: 10.0.0.2
  ports:
  - name: dns
    port: 53
    protocol: UDP
  - name: dns-tcp
    port: 53
    protocol: TCP
  - name: metrics
    port: 9153
    protocol: TCP

从上面两段可以看到,servicemonitor去匹配的service是

labels:

app.kubernetes.io/name: coredns

而我们创建的coredns的service的labels

labels:

    k8s-app: kube-dns

    kubernetes.io/cluster-service: "true"

    addonmanager.kubernetes.io/mode: Reconcile

    kubernetes.io/name: "CoreDNS"

两边没有对应上,所以该servicemonitor无法匹配到对应的service,所以监控不到我们的coredns.

因coredns对服务的影响比较大,我们选择去修改servicemonitor

修改labels后重新apply

Kubectl apply -f kubernetes-serviceMonitorCoreDNS.yaml

coredns就加载出来了

配置coredns的监控信息

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app.kubernetes.io/name: kube-prometheus
    app.kubernetes.io/part-of: kube-prometheus
    prometheus: k8s
    role: alert-rules
  name: kubernetes-monitoring-coredns-rules
  namespace: monitoring

spec:
  groups:
  - name: coredns
    rules:
    - alert: CoreDNSDown
      annotations:
        message: CoreDNS has disappeared from Prometheus target discovery.
        runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsdown
      expr: |
        absent(up{job="kube-dns"} == 1)
      for: 15m
      labels:
        severity: critical
    - alert: CoreDNS的dns请求持续时间延迟高
      annotations:
        message: CoreDNS has 99th percentile latency of {{ $value }} seconds for server
          {{ $labels.server }} zone {{ $labels.zone }} .
        runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednslatencyhigh
      expr: |
        histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket{job="kube-dns"}[5m])) by(server, zone, le)) > 4
      for: 10m
      labels:
        severity: critical
    - alert: CoreDNS响应错误高
      annotations:
        message: CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of   requests.
        runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednserorshigh
      expr: |
        sum(rate(coredns_dns_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))
          /
        sum(rate(coredns_dns_responses_total{job="kube-dns"}[5m])) > 0.03
      for: 10m
      labels:
        severity: critical
    - alert: CoreDNS响应错误高
      annotations:
        message: CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of   requests.
        runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednserorshigh
      expr: |
        sum(rate(coredns_dns_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))
          /
        sum(rate(coredns_dns_responses_total{job="kube-dns"}[5m])) > 0.01
      for: 10m
      labels:
        severity: warning

    - alert: CoreDNS转发请求持续时间延迟高
      annotations:
        message: CoreDNS has 99th percentile latency of {{ $value }} seconds forwarding requests to {{ $labels.to }}.
        runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardlatencyhigh
      expr: |
        histogram_quantile(0.99, sum(rate(coredns_forward_request_duration_seconds_bucket{job="kube-dns"}[5m])) by(to, le)) > 4
      for: 10m
      labels:
        severity: critical
 
    - alert: CoreDNSForwardErrorsHigh
      annotations:
        message: CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of forward requests to {{ $labels.to }}.
        runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwarderrorshigh
      expr: |
        sum(rate(coredns_forward_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))
          /
        sum(rate(coredns_forward_responses_total{job="kube-dns"}[5m])) > 0.03
      for: 10m
      labels:
        severity: critical
 
    - alert: CoreDNSForwardErrorsHigh
      annotations:
        message: CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of forward requests to {{ $labels.to }}.
        runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwarderrorshigh
      expr: |
        sum(rate(coredns_forward_responses_total{job="kube-dns",rcode="SERVFAIL"}[5m]))
          /
        sum(rate(coredns_forward_responses_total{job="kube-dns"}[5m])) > 0.01
      for: 10m
      labels:
        severity: warning
  
    - alert: CoreDNSForwardHealthcheckFailureCount
      annotations:
        message: CoreDNS health checks have failed to upstream server {{ $labels.to }}.
        runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardhealthcheckfailurecount
      expr: |
        sum(rate(coredns_forward_healthcheck_failures_total{job="kube-dns"}[5m])) by (to) > 0
      for: 10m
      labels:
        severity: warning
  
    - alert: CoreDNSForwardHealthcheckBrokenCount
      annotations:
        message: CoreDNS health checks have failed for all upstream servers.
        runbook_url: https://github.com/povilasv/coredns-mixin/tree/master/runbook.md#alert-name-corednsforwardhealthcheckbrokencount
      expr: |
        sum(rate(coredns_forward_healthcheck_broken_total{job="kube-dns"}[5m])) > 0
      for: 10m
      labels:
        severity: warning
    - alert: CorednsPanicCount
      expr: increase(coredns_panics_total[1m]) > 0
      for: 0m
      labels:
        severity: critical
      annotations:
        summary: CoreDNS Panic Count (instance {{ $labels.instance }})
        description: "Number of CoreDNS panics encountered\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
 

相关推荐

  1. Prometheus监控Linux

    2024-06-12 04:46:01       39 阅读
  2. Prometheus监控nginx

    2024-06-12 04:46:01       37 阅读

最近更新

  1. TCP协议是安全的吗?

    2024-06-12 04:46:01       16 阅读
  2. 阿里云服务器执行yum,一直下载docker-ce-stable失败

    2024-06-12 04:46:01       16 阅读
  3. 【Python教程】压缩PDF文件大小

    2024-06-12 04:46:01       15 阅读
  4. 通过文章id递归查询所有评论(xml)

    2024-06-12 04:46:01       18 阅读

热门阅读

  1. Python学习笔记5:入门知识(五)

    2024-06-12 04:46:01       7 阅读
  2. c++【入门】买文具2

    2024-06-12 04:46:01       8 阅读
  3. 实验4-原地逆转链表(梅开二度)

    2024-06-12 04:46:01       7 阅读
  4. 反射...

    反射...

    2024-06-12 04:46:01      6 阅读
  5. Room数据库使用

    2024-06-12 04:46:01       8 阅读
  6. 【WPF】中的ListBox的ScrollIntoView方法使用

    2024-06-12 04:46:01       9 阅读
  7. 2001NOIP普及组真题 4. 装箱问题

    2024-06-12 04:46:01       16 阅读
  8. postgres常用查询

    2024-06-12 04:46:01       8 阅读