pytorch 分布式 Node/Worker/Rank等基础概念

分布式训练相关基本参数的概念如下:

Definitions

  1. Node - A physical instance or a container; maps to the unit that the job manager works with.

  2. Worker - A worker in the context of distributed training.

  3. WorkerGroup - The set of workers that execute the same function (e.g. trainers).

  4. LocalWorkerGroup - A subset of the workers in the worker group running on the same node.

  5. RANK - The rank of the worker within a worker group.

  6. WORLD_SIZE - The total number of workers in a worker group.

  7. LOCAL_RANK - The rank of the worker within a local worker group.

  8. LOCAL_WORLD_SIZE - The size of the local worker group.

  9. rdzv_id - A user-defined id that uniquely identifies the worker group for a job. This id is used by each node to join as a member of a particular worker group.

  1. rdzv_backend - The backend of the rendezvous (e.g. c10d). This is typically a strongly consistent key-value store.

  2. rdzv_endpoint - The rendezvous backend endpoint; usually in form <host>:<port>.

Node runs LOCAL_WORLD_SIZE workers which comprise a LocalWorkerGroup. The union of all LocalWorkerGroups in the nodes in the job comprise the WorkerGroup.

翻译:

Node: 通常代表有几台机器

Worker: 指一个训练进程

WORD_SIZE: 总训练进程数,通常与所有机器加起来的GPU数相等(通常每个GPU跑一个训练进程)

RANK:  每个Worker的标号,用来标识每个每个训练进程(所有机器)

LOCAL_RANK :  在同一台机器上woker的标识,例如一台8卡机器上的woker标识就是0-7

总结:

一个节点(一台机器) 跑 LOCAL_WORLD_SIZE 个数的workers, 这些workers 构成了LocalWorkerGroup(组的概念), 

所有机器上的LocalWorkerGroup 就组成了WorkerGroup 

ps: Local 就是代表一台机器上的相关概念, 当只有一台机器时,Local的数据和不带local的数据时一致的

reference:

torchrun (Elastic Launch) — PyTorch 2.1 documentation

相关推荐

  1. pytorch 分布式 Node/Worker/Rank基础概念

    2024-01-09 07:12:04       63 阅读
  2. 1.mysql基本概念环境配置

    2024-01-09 07:12:04       25 阅读
  3. PyTorch基本概念及使用场景

    2024-01-09 07:12:04       39 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-01-09 07:12:04       98 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-01-09 07:12:04       106 阅读
  3. 在Django里面运行非项目文件

    2024-01-09 07:12:04       87 阅读
  4. Python语言-面向对象

    2024-01-09 07:12:04       96 阅读

热门阅读

  1. nginx.conf 文件配置

    2024-01-09 07:12:04       51 阅读
  2. 服务器超线程的好处

    2024-01-09 07:12:04       55 阅读
  3. 10个linux文件管理命令

    2024-01-09 07:12:04       59 阅读
  4. 【嵌入式-网络编程】vmware中使用UDP广播失败问题

    2024-01-09 07:12:04       64 阅读
  5. vue中高德地图使用

    2024-01-09 07:12:04       61 阅读
  6. docker的安装使用以及优势

    2024-01-09 07:12:04       54 阅读