Server - Kubernetes (K8S) 运行 PyTorchJob 的 YAML 配置

欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://blog.csdn.net/caroline_wendy/article/details/136499768

K8S

PyTorchJob 是 Kubernetes 中的自定义资源,用于在 Kubernetes 上运行 PyTorch 训练任务,这是 Kubeflow 组件的一部分,具有稳定的状态,PyTorchJob 允许像管理 Kubernetes 中的其他内置资源一样创建和管理 PyTorch 作业。要使用 PyTorchJob,需要先安装 PyTorch Operator。默认情况下,PyTorch Operator 会作为控制器部署在 training operator 中。

YAML 配置如下,其中:

  • kindPyTorchJob
  • metadata/name,运行的 Job 名称,不要重名
  • 节点使用 Workerreplicas 重复的节点数量,resources 配置 GPU 数量,即支持2机1卡,或1机2卡
  • command 是运行命令

源码:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorch-simple-001
spec:
  pytorchReplicaSpecs:
    Worker:
      replicas: 1
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
          labels:
            file-mount: "true"
            user-mount: "true"
        spec:
#          hostNetwork: false  # New
          containers:
            - name: pytorch
              command:
                - /bin/sh
                - -cl
                - "bash k8s/run_grid0_for_gpu1.sh > nohup.test.log 2>&1"
              image: "harbor.[xxx].com/cryoem:v1.3.1"
              imagePullPolicy: Always
              securityContext:  # New
                privileged: false
                capabilities:
                  add: [ "IPC_LOCK" ]
              resources:
                limits:
                  rdma/hca : 1
                  cpu: 12
                  memory: "100G"
                  nvidia.com/gpu: 2
              workingDir: "workspace/cryoem-project/"
              volumeMounts:
                - name: cache-volume  # change the name to your volume on k8s
                  mountPath: /dev/shm
          nodeSelector:
            gpu.device: "a100"  # support 'a10' or 'a100'
            group: "algo2"
          tolerations:
          - effect: NoSchedule
            key: role
            operator: Equal
            value: "algo2"
          volumes:
           - name: cache-volume  # change the name to your volume on k8s
             emptyDir:
                 medium: Memory
                 sizeLimit: "960G"

查看运行情况:

kubectl get pytorchjobs
# kubectl delete pytorchjobs pytorch-simple-001
kubectl get pods
kubectl exec -it -n [your name] pytorch-simple-001-worker-0 bash

运行结果:

Thu Mar  7 07:39:13 2024       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A800-SXM...  On   | 00000000:58:00.0 Off |                    0 |
| N/A   52C    P0   259W / 400W |   7833MiB / 81920MiB |     93%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A800-SXM...  On   | 00000000:D0:00.0 Off |                    0 |
| N/A   52C    P0   235W / 400W |  12917MiB / 81920MiB |     93%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

相关推荐

  1. k8s YAML文件详解

    2024-03-11 09:20:03       31 阅读
  2. Golang k8s相关yaml区别

    2024-03-11 09:20:03       34 阅读
  3. k8s发布nacos-server,nodeport配置注意事项

    2024-03-11 09:20:03       25 阅读

最近更新

  1. TCP协议是安全的吗?

    2024-03-11 09:20:03       19 阅读
  2. 阿里云服务器执行yum,一直下载docker-ce-stable失败

    2024-03-11 09:20:03       19 阅读
  3. 【Python教程】压缩PDF文件大小

    2024-03-11 09:20:03       20 阅读
  4. 通过文章id递归查询所有评论(xml)

    2024-03-11 09:20:03       20 阅读

热门阅读

  1. Node.js_会话控制

    2024-03-11 09:20:03       23 阅读
  2. 《BERT基础教程:Transformer大模型实战》读书笔记

    2024-03-11 09:20:03       23 阅读
  3. 流量分析-webshell管理工具

    2024-03-11 09:20:03       23 阅读
  4. go gin中间件关于 c.next()、c.abort()和return的使用

    2024-03-11 09:20:03       29 阅读
  5. Docker基础—CentOS中Docker安装部署

    2024-03-11 09:20:03       21 阅读
  6. neo4j

    2024-03-11 09:20:03       24 阅读
  7. RabbitMQ

    RabbitMQ

    2024-03-11 09:20:03      18 阅读
  8. Docker入门指南: 创新的容器化技术

    2024-03-11 09:20:03       19 阅读
  9. 主流开发语言与环境介绍

    2024-03-11 09:20:03       24 阅读
  10. elementPlus的坑

    2024-03-11 09:20:03       22 阅读
  11. 各种环境下载链接

    2024-03-11 09:20:03       20 阅读
  12. 轻量脚本语言Lua的配置与c++调用

    2024-03-11 09:20:03       19 阅读
  13. linux系统Docker容器Dockerfile简单描述

    2024-03-11 09:20:03       19 阅读
  14. 创建旅游景点图数据库Neo4J技术验证

    2024-03-11 09:20:03       19 阅读