DeepSpeed Monitoring & Comm. Logging

Monitoring

支持多种后端:Tensorboard、WandB、Comet、CSV文件;

TensorBoard例子:

自动监控:DeepSpeed自动把重要metric记录下来。只需在配置文件里enable相应的看板后端即可:

{
  "tensorboard": {
    "enabled": true,
    "output_path": "output/ds_logs/",
    "job_name": "train_bert"
  }
  "wandb": {
    "enabled": true,
    "team": "my_team",
    "group": "my_group",
    "project": "my_project"
  }
  "comet": {
    "enabled": true,
    "project": "my_project",
    "experiment_name": "my_experiment"
  }
  "csv_monitor": {
    "enabled": true,
    "output_path": "output/ds_logs/",
    "job_name": "train_bert"
  }
}

 自定义监控:

# Step 1: Import monitor (and DeepSpeed config, if needed)
from deepspeed.monitor.monitor import MonitorMaster
from deepspeed.runtime.config import DeepSpeedConfig

# Step 2: Initialized monitor with DeepSpeed config (get DeepSpeed config object, if needed)
ds_config = DeepSpeedConfig("ds_config.json")
monitor = MonitorMaster(ds_config.monitor_config)

for epoch in range(2):

    running_loss = 0.0
    for i, data in enumerate(trainloader):
        pre = time.time()
        inputs, labels = data[0].to(model_engine.local_rank), data[1].to(
            model_engine.local_rank)
        if fp16:
            inputs = inputs.half()
        outputs = model_engine(inputs)
        loss = criterion(outputs, labels)

        model_engine.backward(loss)
        model_engine.step()
        post = time.time()
        # Step 3: Create list of 3-tuple records (single entry in this case)
        events = [("Time per step", post-pre, model_engine.global_samples)]
        # Step 4: Call monitor.write_events on the list from step 3
        monitor.write_events(events)

 [("Time per step", post-pre, model_engine.global_samples)],<表名,纵轴值,横轴值>

 

通信Logging

注意:加了logging, 所有通信将改为同步,对性能会有伤害。

所有deepspeed.comm下的通信,都将被统计上。

在配置文件里打开:

"comms_logger": {
  "enabled": true,
  "verbose": false,
  "prof_all": true,
  "debug": false
}

verbose: 边跑,边把发生的通信,一条条写下来。例:

[2022-06-26 01:39:55,722] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 | comm op: reduce_scatter_tensor | time (ms): 9.46 | msg size: 678.86 MB | algbw (Gbps): 1204.52  | busbw (Gbps): 1129.23
[2022-06-26 01:39:56,470] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 | comm op: all_gather_into_tensor | time (ms): 0.11 | msg size: 6.0 MB | algbw (Gbps): 954.41  | busbw (Gbps): 894.76
[2022-06-26 01:39:56,471] [INFO] [logging.py:69:log_dist] [Rank 0] rank=0 | comm op: all_gather_into_tensor | time (ms): 0.08 | msg size: 6.0 MB | algbw (Gbps): 1293.47  | busbw (Gbps): 1212.63

algbw: algorithm bandwidth, 发生的通信size/实际通信时间;

busbw: 硬件理论带宽;是个固定值;

algbw如果比busbw小太多,说明糟糕,有待进一步优化;

总结式:deepspeed.comm.log_summary()

Comm. Op            Message Size        Count               Total Latency(ms)   Avg Latency(ms)     tput_avg (Gbps)     busbw_avg (Gbps)
broadcast
                    2.0 KB              146                 11.12               0.08                0.43                0.41
                    98.25 MB            1                   8317.12             8317.12             0.20                0.19
reduce_scatter_tensor
                    678.86 MB           40                  602.29              9.69                1468.06             1376.31

展示通信等待时长:

dist.log_summary(show_straggler=True)

 这么计算的:(一次组播通信里,每个rank的完成时间,减去,所有rank里完成最快的,这些"等待"时间,加和到一起)

straggler = sum(t_collectives - allreduce(t_collectives, MIN))

相关推荐

最近更新

  1. TCP协议是安全的吗?

    2024-06-12 08:58:07       16 阅读
  2. 阿里云服务器执行yum,一直下载docker-ce-stable失败

    2024-06-12 08:58:07       16 阅读
  3. 【Python教程】压缩PDF文件大小

    2024-06-12 08:58:07       15 阅读
  4. 通过文章id递归查询所有评论(xml)

    2024-06-12 08:58:07       18 阅读

热门阅读

  1. Unity3D MMORPG背包系统数据获取与通讯详解

    2024-06-12 08:58:07       8 阅读
  2. 设计模式之外观模式

    2024-06-12 08:58:07       10 阅读
  3. pyautogui 等待元素出现的方法

    2024-06-12 08:58:07       10 阅读
  4. app-ios 内嵌h5的缓存问题

    2024-06-12 08:58:07       5 阅读
  5. 简单聊聊Vue

    2024-06-12 08:58:07       10 阅读
  6. 微信小程序·审核

    2024-06-12 08:58:07       9 阅读