Python跑pytorch程序抢占公共GPU自动运行脚本

2024-01-12 14:56:03
开发
31

问题描述

当我们有一个服务器，服务器上面有4-5个GPU，那么我们需要时刻看哪个GPU空着，当发现服务器空闲了，我们就可以跑自己的深度学习了。

然而，人盯着总是费时费力的，所以可以让Python看到哪个GPU空闲就插进去吗？
进行下面步骤即可。

第一步、安装GPU信息查看包

名字为：nvidia_ml_py-12.535.133-py3-none-any.whl，
下载地址：https://pypi.org/project/nvidia-ml-py/
在这里插入图片描述

当然，如果网络足够好，可以直接利用pip安装：

pip install nvidia-ml-py

第二步、编码select_gpu

即判定抢占GPU的代码

import torch
from pynvml import *
import time
import sys


# count传GPU个数，threshold是阈值，低于此阈值说明GPU是空闲的，second是每几秒进行继续轮训
def select_gpu(count=torch.cuda.device_count(), threshold=1024, second=5):
    nvmlInit()
    if count == 0:
        return 'cpu'
    # 需要在多个GPU轮训找出空闲的GPU
    current = 0
    while True:
        # 检查当前GPU是否可用
        handle = nvmlDeviceGetHandleByIndex(current)
        info = nvmlDeviceGetMemoryInfo(handle)
        used_memory = info.used // (1024 * 1024)
        if used_memory < threshold:  # 如果刺入小于阈值的内存，那么说明此GPU并没有被占用，可抢占
            sys.stderr.write(
                f'此时GPU{
     current}使用内存为[{
     used_memory}MB]，低于阈值[{
     threshold}]才可抢占----GPU{
     current}可抢占，将抢占GPU：{
     current}号GPU----{
     time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())}\n')
            nvmlShutdown()
            return current
        else:
            sys.stderr.write(
                f'此时GPU{
     current}使用内存为[{
     used_memory}MB]，低于阈值[{
     threshold}]才可抢占----GPU{
     current}不可抢占，继续轮训----{
     time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())}\n')
        time.sleep(second)
        current = (current + 1) % count

select_gpu共三个参数
select_gpu 用于抢占GPU
参数1：count传GPU个数，默认传torch.cuda.device_count()，即存在的GPU个数
参数2：threshold是阈值，低于此阈值说明GPU是空闲的，默认1024MB [GPU一般什么都不跑，也会被占用几十MB]
参数3：second是每几秒进行继续轮训，默认5秒

返回值：为可以选用GPU的编号。

第三步、编写运行深度学习代码，以YOLOv8为例

from ultralytics import YOLO
from select_gpu_util import select_gpu
import sys

if __name__ == '__main__':
    sys.stderr.write('程序已运行\n')
    # select_gpu 用于抢占GPU
    # 参数1：count传GPU个数，默认传torch.cuda.device_count()，即存在的GPU个数
    # 参数2：threshold是阈值，低于此阈值说明GPU是空闲的，默认1024MB [GPU一般什么都不跑，也会被占用几十MB]
    # 参数3：second是每几秒进行继续轮训，默认5秒
    device = select_gpu(threshold=3000)  # 获取能够抢占的GPU
    model = YOLO("ultralytics/cfg/models/v8/yolov8.yaml")  # build a new model from scratch
    # Train the model
    if sys.platform.startswith('win'):  # Windows环境下
        results = model.train(data="YAML/xxx.yaml", epochs=200, device=device, workers=0, batch=1)
    else:  # Linux系统环境下
        results = model.train(data="YAML/xxx.yaml", epochs=200, device=device, workers=2, batch=8)

# 温馨提示：启动命令为 nohup python -u train_shwd+GAM.py >shwdlog/gamdaoer.log &
#                                 -u的作用是不设置缓冲区，让所有的文本直接输出到log

第四步、运行代码，查看运行效果

运行一般让程序不挂起、并且后台运行，运行命令为：

nohup python -u train_shwd+GAM.py >shwdlog/gamdaoer.log &

-u的作用是不设置缓冲区，让所有的文本直接输出到log，若不-u会很难受，不信自行尝试
效果图：
在这里插入图片描述

原文地址:https://blog.csdn.net/qq_29762001/article/details/135550702 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1745701153825492992.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部