轻量化日志中心 Grafana Loki

需要集中的日志系统的原因

目前现状,每个服务生产上有三台,定位生产问题,需要连上一台机器,然后使用 cd / tail / less / grep / sed / awk 等 linux命令去日志里查找故障原因。如果在这台机器没搜索到线索,就去另外两台机器上查日志。
但在分布式系统中,众多服务分散部署在数十台甚至上百台不同的服务器上,想要快速方便的实现查找、分析和归档等功能,使用Linux命令等传统的方式查询到想要的日志就费事费力,更不要说对日志进行分析与归纳了。
为解决大量日志归档,文件搜索慢,如何多维度查询就得需要集中化日志管理,将所有服务器上的日志收集汇总,常见的解决思路就是建立集中式日志收集系统。

LPG

Loki + Promtail + GRafana

  • Loki:聚合并存储日志数据,可以作为Grafana的数据源,为Grafana提供可视化数据
  • Promtail:日志收集器,有点像Filebeat,可以手机日志文件中的日志,并把收集到的数据推送到Loki。
  • Grafana:从Loki中获取日志信息,进行可视化展示

Loki是 Grafana Labs 团队最新的开源项目,是一个水平可扩展,高可用,多租户的日志聚合系统。它的设计非常经济高效且易于操作,因为它不会为日志内容编制索引,而是为每个日志流编制一组标签,专门为 Prometheus和Kubernetes用户做了相关优化。
该项目受Prometheus启发,官方的介绍就是:Like Prometheus,But For Logs。类似Prometheus的日志系统。

项目地址:
https://github.com/grafana/loki/

与其他日志聚合系统相比,Loki具有以下特性:

  • 不对日志进行全文索引。通过存储压缩非结构化日志和仅索引元数据,Loki操作起来会更简单,更省成本
  • 通过使用与Prometheus相同的标签记录流对日志进行索引和分组,这使得日志的扩展和操作效率更高,能对接 alertmanager
  • 特别适合存储Kubernetes Pod日志;诸如Pod标签之类的元数据会被自动删除和编入索引
  • 受Grafana原生支持,避免Kibana和grafana来回切换
日志收集流程图

架构说明

在这里插入图片描述

部署
安装 docker
# 安装依赖
sudo yum install -y yum-utils device-mapper-persistent-data lvm2

# 添加源
sudo yum-config-manager --add-repo http://mirrors.aliyun.com/docker-ce/linux/centos/docker-ce.repo

# 更新
sudo yum makecache fast

# 安装docker-ce
sudo yum -y install docker-ce

# 创建配置文件目录
sudo mkdir -p /etc/docker

# 创建配置文件
touch /etc/docker/daemon.json

# 新增配置文件内容 /mitr/middleware/docker目录已创建
cat <<EOF | sudo tee /etc/docker/daemon.json
{
"registry-mirrors":["https://docker.mirrors.ustc.edu.cn"],
"data-root": "/mitr/middleware/docker"
}
EOF

# 加载启动并设置开机自启
systemctl daemon-reload && systemctl enable docker && systemctl start docker
安装grafana
# 拉取镜像
docker pull grafana/grafana

# 创建镜像并启动grafana 初始密码 admin/admin
docker run --name=grafana -id -p 3000:3000 grafana/grafana
安装Promtail
# 拉取镜像
docker pull grafana/promtail

# 创建配置文件目录
mkdir -p /etc/promtail

cat <<EOF |sudo tee /etc/promtail/promtail-config.yaml
server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://10.0.0.58:3100/loki/api/v1/push

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      host: localhost
      __path__: /var/log/*
EOF
# 创建并启动容器
docker run -di --name=promtail -v /etc/promtail/promtail-config.yaml:/etc/promtail/config.yml -v /var/log:/var/log -p 9080:9080 grafana/promtail
安装Loki

Loki配置文件参考

# 拉取镜像
docker pull grafana/loki

# 创建配置文件目录
mkdir -p /etc/loki

cat <<EOF |sudo tee /etc/loki/local-config.yaml
# 认证配置
# 通过X-Scope-OrgID头启用身份验证,该头必须存在
# 如果情况属实。如果为假,则OrgID将始终设置为"fake"
auth_enabled: false

# 端口配置
server:
  http_listen_port: 3100

ingester:
  # WAL(Write Ahead Log)记录传入的日志,并将他们存储在本地文件系统,在进程雪崩的情况下,保证已确认数据的持久性。
  wal:
    enabled: true
    dir: /loki/wal
  
  # 生命周期配置
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        # Backend storage to use for the ring. Supported values are: consul, etcd,
        # inmemory, memberlist
        # CLI flag: -ring.store
        store: inmemory

      # The number of ingesters to write to and read from 
      # default = 3
      replication_factor: 1
    # 在退出以确保指标被抓取之前休眠的持续时间 default = 30s  
    final_sleep: 0s
  chunk_idle_period: 1h       # Any chunk not receiving new logs in this time will be flushed
  max_chunk_age: 1h           # All chunks will be flushed when they hit this age, default is 1h
  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 30s    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  max_transfer_retries: 0     # Chunk transfers disabled

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

# 配置Loki存储数据的位置
storage_config:
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks

# 压缩块配置压缩器组件,该组件压缩索引碎片以提高性能
compactor:
  working_directory: /loki/boltdb-shipper-compactor
  shared_store: filesystem

limits_config:
  reject_old_samples: true
  # Maximum accepted sample age before rejecting
  reject_old_samples_max_age: 168h

chunk_store_config:
  # 向后查找日志行所需的最大时间。仅适用于即时日志查询 default = 30s
  max_look_back_period: 0s

table_manager:
  retention_deletes_enabled: true
  retention_period: 360h

# ruler:
#   storage:
#     type: local
#     local:
#       directory: /loki/rules
#   rule_path: /loki/rules-temp
#   alertmanager_url: http://localhost:9093
#   ring:
#     kvstore:
#       store: inmemory
#   enable_api: true  
EOF

# 创建并启动容器
docker run -di --name=loki -v loki-data:/loki -v /etc/loki/local-config.yaml:/etc/loki/local-config.yaml -p 3100:3100 grafana/loki
使用

Grafana配置Loki数据源
如下图:

源地址配置 http://10.0.0.74:3100 即可 保存

保存之后,切换到grafana左侧区域的Explore,即可进入到Loki页面

然后点击Log browser 就可以把当前系统采集的日志标签显示出来,再根据这些标签进行日志的过滤查询即可,比如这里选择 /var/log/message,10.0.0.74,就会把该文件下面的日志过滤展示出来


这里展示的是Promtail容器里面的 /var/log 目录中的日志
Promtail容器 /etc/promtail/config.yml

shellserver:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      __path__: /var/log/*log

这里的 job 就是varlog,文件路径就是 /var/log/*log
只索引标签:之前多次看到Loki和ES最大的不同是Loki只对标签进行索引而不对内容索引。
静态标签匹配模式
以简单的Promtail配置举例

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: message
      __path__: /var/log/messages

解读:

  • 上面这段配置代表启动一个日志采集任务
  • 这个任务有一个固定标签 job: message
  • 采集日志路径为 /var/log/messages,会以一个名为filename的固定标签
  • 在Promtail的web页面上可以看到类似Promtail的target信息页面

可以和使用Prometheus一样的标签匹配语句进行查询 {job=“message”}

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: syslog
      __path__: /var/log/syslog
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: apache
      __path__: /var/log/apache.log

如果我们配置了两个job,则可以使用 {job=~“message|varlogs”} 进行多job匹配;同时也支持正则匹配
标签匹配模式特点

  • 和Prometheus一致,相同标签对应的是一个流Prometheus处理series的模式
  • Prometheus中标签一致对应的同一个hash值和refid(正整数递增的id),也就是同一个series
  • 时序数据不断的append追加到这个memseries中
  • 当有人以标签发生变化时会产生新的hash值和refid,对应新的series

Loki处理日志的模式和Prometheus一致,loki一组标签值会生成一个stream。日志随着时间的递增会追加到这个stream中,最后压缩为chunk。当有人以标签发生变化时会产生新的hash值,对应新的stream
查询过程

  • loki现根据标签算出hash值在倒排索引中找到对应的chunk
  • 然后再根据查询语句中的关键词等进行过滤,这样可以提速
  • 因为这种根据标签算哈希在倒排中查找id,对应找到存储的块在Prometheus中已经被验证过了
  • 属于第开销 速度快

动态标签和高基数

  • 动态标签:标签的value不固定
  • 高基数标签:标签的value可能性太多了,达到10万,100万甚至更多

比如Apache的access日志
11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] “GET /1986.js HTTP/1.1” 200 932 “-” “Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6”
在Promtail中使用regex想要匹配action和status_code两个标签:

scrape_configs:
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: syslog
      __path__: /var/log/syslog
 - job_name: system
   pipeline_stages:
   static_configs:
   - targets:
      - localhost
     labels:
      job: apache
      __path__: /var/log/apache.log

  - job_name: system
    pipeline_stages:
       - regex:
         expression: "^(?P<ip>\\S+) (?P<identd>\\S+) (?P<user>\\S+) \\[(?P<timestamp>[\\w:/]+\\s[+\\-]\\d{4})\\] \"(?P<action>\\S+)\\s?(?P<path>\\S+)?\\s?(?P<protocol>\\S+)?\" (?P<status_code>\\d{3}|-) (?P<size>\\d+|-)\\s?\"?(?P<referer>[^\"]*)\"?\\s?\"?(?P<useragent>[^\"]*)?\"?$"
     - labels:
         action:
         status_code:
    static_configs:
    - targets:
       - localhost
      labels:
       job: apache
       env: dev
       __path__: /var/log/apache.log

那么对应action=get/post 和 status_code=200/400 则对应4个流:

11.11.11.11 - frank [25/Jan/2000:14:00:01 -0500] "GET /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.12 - frank [25/Jan/2000:14:00:02 -0500] "POST /1986.js HTTP/1.1" 200 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.13 - frank [25/Jan/2000:14:00:03 -0500] "GET /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"
11.11.11.14 - frank [25/Jan/2000:14:00:04 -0500] "POST /1986.js HTTP/1.1" 400 932 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 GTB6"

那四个日志行将变成4个单独的流,并开始填充4个单独的块,如果出现另一个独特的标签组合(如 status_code=500),则会创建另一个新流。
高基数问题:就像上面,如果给ip设置一个标签,来自用户的每个不同的ip请求不仅成为唯一的流。可以快速生成成千上万的流,这是高基数,这样可以杀死Loki
如果字段没有被当做标签被索引,会不会查询很慢,Loki的能力是将查询分解为小块并行分发,以便可以在短时间内查询大量日志数据
全文索引问题
大索引既复杂又昂贵。通常日志数据的全文索引的大小等于或大于日志数据本身的大小。
要查询日志数据,需要加载此索引,并且为了提高性能,它可能在内存中。这很难扩展,并随着日志增多,索引会迅速变大。
Loki的索引通常比摄取的日志量小一个数量级,索引的增长非常缓慢。
加速查询没标签字段:以👆提到的ip字段为例,使用过滤器表达式查询
{job=“apache”}|=“11.11.11.11”
loki查询时的分片(按时间范围分段 grep)

  • Loki把查询分解成较小的分片,并且与标签匹配的流打开每个区块,并开始寻找IP地址
  • 这些分片的大小和并行化的数量是可以配置的,并取决于提供的资源
  • 如果需要,可以将分片间隔配置文件5min,部署20个查询器,并在几秒钟内处理千兆字节的日志
  • 或者,可以设置200个查询器并处理TB的日志

两种索引模式对比

  • es的大索引,不管你查不查询,它都必须时刻存在。比如长时间占用过多的内存
  • loki的逻辑是查询时在启动多个分段并行查询

日志量少时少加标签

  • 因为每多加载一个chunk就有额外的开销
  • 举例,如果查询是 {app=“loki”,level!=“bug”}
  • 在没加level标签的情况下只需要加载一个chunk 即app=“loki”的标签
  • 如果加了level的情况,则需要把level=info,warn,error,critical 5个chunk都加载再查询

需要标签是再去添加

  • 当chunk_target_size=1M时 以1M的压缩大小来切割块
  • 对应原始日志大小在5M—10M,如果日志在max_chunk_age时间内能达到10M,考虑添加标签

日志应当按时间递增

  • 这个问题和tsdb中处理旧数据是一样的道理
  • 目前loki为了性能考虑直接拒绝掉旧数据
对接Prometheus监控

修改 prometheus.yml

scrape_configs:
    - job_name: 'Loki'
      static_configs:
        - targets: ["10.0.0.74:3100"]

然后加载Prometheus即可

Grafana dashboard

安装插件
grafana-cli plugins install grafana-piechart-panel
dashboard效果图

import Dashboard Json

{
   
  "annotations": {
   
    "list": [
      {
   
        "$$hashKey": "object:7",
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
   
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "description": "Loki metrics via 2.0",
  "editable": true,
  "gnetId": 13407,
  "graphTooltip": 0,
  "id": 28,
  "links": [],
  "panels": [
    {
   
      "datasource": "Prometheus",
      "description": "",
      "gridPos": {
   
        "h": 2,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "id": 21,
      "interval": "",
      "options": {
   
        "content": "<div class=\"left dashboard-header\">\n   <img src=\"https://marketplace-assets.digitalocean.com/logos/loki-logo.svg\" style=\"height:29px;\"/>\n  <span>Loki Global Metrics</span>\n</div>",
        "mode": "html"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "loki_build_info",
          "format": "table",
          "instant": false,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "transparent": true,
      "type": "text"
    },
    {
   
      "datasource": "Prometheus",
      "fieldConfig": {
   
        "defaults": {
   
          "decimals": 4,
          "links": [],
          "mappings": [],
          "thresholds": {
   
            "mode": "absolute",
            "steps": [
              {
   
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
   
        "h": 3,
        "w": 6,
        "x": 0,
        "y": 2
      },
      "id": 59,
      "interval": "",
      "options": {
   
        "colorMode": "background",
        "graphMode": "none",
        "justifyMode": "center",
        "orientation": "auto",
        "reduceOptions": {
   
          "calcs": [
            "last"
          ],
          "fields": "/^version$/",
          "values": false
        },
        "text": {
   },
        "textMode": "auto"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "loki_build_info",
          "format": "table",
          "instant": false,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Loki Version",
      "type": "stat"
    },
    {
   
      "datasource": "Prometheus",
      "fieldConfig": {
   
        "defaults": {
   
          "decimals": 0,
          "links": [],
          "mappings": [],
          "thresholds": {
   
            "mode": "absolute",
            "steps": [
              {
   
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
   
        "h": 3,
        "w": 6,
        "x": 6,
        "y": 2
      },
      "id": 10,
      "interval": "",
      "options": {
   
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "center",
        "orientation": "auto",
        "reduceOptions": {
   
          "calcs": [
            "last"
          ],
          "fields": "",
          "values": false
        },
        "text": {
   },
        "textMode": "auto"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "sum(log_messages_total)",
          "instant": false,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Message Total",
      "type": "stat"
    },
    {
   
      "datasource": "Prometheus",
      "fieldConfig": {
   
        "defaults": {
   
          "decimals": 4,
          "links": [],
          "mappings": [],
          "thresholds": {
   
            "mode": "absolute",
            "steps": [
              {
   
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
   
        "h": 3,
        "w": 4,
        "x": 12,
        "y": 2
      },
      "id": 23,
      "interval": "",
      "options": {
   
        "colorMode": "value",
        "graphMode": "area",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
   
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "text": {
   },
        "textMode": "auto"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "sum(loki_store_series_total)",
          "instant": false,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Store Series Total",
      "type": "stat"
    },
    {
   
      "datasource": "Prometheus",
      "fieldConfig": {
   
        "defaults": {
   
          "decimals": 4,
          "links": [],
          "mappings": [],
          "thresholds": {
   
            "mode": "absolute",
            "steps": [
              {
   
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "bytes"
        },
        "overrides": []
      },
      "gridPos": {
   
        "h": 3,
        "w": 5,
        "x": 16,
        "y": 2
      },
      "id": 41,
      "interval": "",
      "options": {
   
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "auto",
        "orientation": "auto",
        "reduceOptions": {
   
          "calcs": [
            "mean"
          ],
          "fields": "",
          "values": false
        },
        "text": {
   },
        "textMode": "auto"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "sum(loki_ingester_chunk_stored_bytes_total)",
          "instant": true,
          "interval": "",
          "legendFormat": "",
          "refId": "A"
        }
      ],
      "timeFrom": null,
      "timeShift": null,
      "title": "Store Chunks Total",
      "type": "stat"
    },
    {
   
      "datasource": "Prometheus",
      "fieldConfig": {
   
        "defaults": {
   
          "decimals": 4,
          "links": [],
          "mappings": [],
          "thresholds": {
   
            "mode": "absolute",
            "steps": [
              {
   
                "color": "green",
                "value": null
              }
            ]
          },
          "unit": "short"
        },
        "overrides": []
      },
      "gridPos": {
   
        "h": 3,
        "w": 3,
        "x": 21,
        "y": 2
      },
      "id": 24,
      "interval": "",
      "options": {
   
        "colorMode": "value",
        "graphMode": "none",
        "justifyMode": "center",
        "orientation": "auto",
        "reduceOptions": {
   
          "calcs": [
            "last"
          ],
          "fields": "",
          "values": false
        },
        "text": {
   },
        "textMode": "auto"
      },
      "pluginVersion": "8.1.3",
      "targets": [
        {
   
          "expr": "loki_panic_total",
          "instant": false,
          "interval": "",
          "legendFormat":</

相关推荐

  1. 量化微调使用场景对比

    2024-04-10 00:10:02       29 阅读
  2. 低频量化日报

    2024-04-10 00:10:02       52 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-04-10 00:10:02       98 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-04-10 00:10:02       106 阅读
  3. 在Django里面运行非项目文件

    2024-04-10 00:10:02       87 阅读
  4. Python语言-面向对象

    2024-04-10 00:10:02       96 阅读

热门阅读

  1. C++基础——结构体与类

    2024-04-10 00:10:02       36 阅读
  2. GIT泄露

    GIT泄露

    2024-04-10 00:10:02      34 阅读
  3. 广义表的学习

    2024-04-10 00:10:02       29 阅读
  4. M2 Pro安装 huggingface transformer

    2024-04-10 00:10:02       38 阅读
  5. Leetcode面试经典150_Q189轮转数组

    2024-04-10 00:10:02       35 阅读
  6. LeetCode|501. Find Mode in Binary Search Tree

    2024-04-10 00:10:02       26 阅读
  7. 爬虫之数据神器10---Peewee实现ORM的核心原理

    2024-04-10 00:10:02       30 阅读
  8. Day32 线程安全二

    2024-04-10 00:10:02       35 阅读
  9. Day31 线程安全一

    2024-04-10 00:10:02       24 阅读
  10. 2024.4.7力扣每日一题——王位继承顺序

    2024-04-10 00:10:02       39 阅读
  11. python--异常处理

    2024-04-10 00:10:02       42 阅读
  12. QB/T 4464-2013 家具用蜂窝板检测

    2024-04-10 00:10:02       36 阅读