【踩坑】探究PyTorch中创建稀疏矩阵的内存占用过大的问题

转载请注明出处:小锋学长生活大爆炸[xfxuezhagn.cn]

如果本文帮助到了你,欢迎[点赞、收藏、关注]哦~


目录

问题复现

原因分析

解决方案

碎碎念


问题复现

        创建一个COO格式的稀疏矩阵,根据计算公式,他应该只占用约5120MB的内存:

        但通过nvidia-smi查看,实际上占用了10240MB:

        网上对此的讨论又是没有找到,只好又是自己一点点摸索。

原因分析

        对于CUDA的内存问题,那就可以使用torch.cuda.memory_stats()来看他的内存使用情况:

coo_matrix = sparse_matrix.to_sparse_coo()
print(torch.cuda.memory_stats())

        输出结果:

OrderedDict([('active.all.allocated', 24), ('active.all.current', 6), ('active.all.freed', 18), ('active.all.peak', 8), ('active.large_pool.allocated', 11), ('active.large_pool.current', 6), ('active.large_pool.freed', 5), ('active.large_pool.peak', 8), ('active.small_pool.allocated', 13), ('active.small_pool.current', 0), ('active.small_pool.freed', 13), ('active.small_pool.peak', 2), ('active_bytes.all.allocated', 15313152512), ('active_bytes.all.current', 8598454272), ('active_bytes.all.freed', 6714698240), ('active_bytes.all.peak', 13967163392), ('active_bytes.large_pool.allocated', 15312696832), ('active_bytes.large_pool.current', 8598454272), ('active_bytes.large_pool.freed', 6714242560), ('active_bytes.large_pool.peak', 13967163392), ('active_bytes.small_pool.allocated', 455680), ('active_bytes.small_pool.current', 0), ('active_bytes.small_pool.freed', 455680), ('active_bytes.small_pool.peak', 80896), ('allocated_bytes.all.allocated', 15313152512), ('allocated_bytes.all.current', 8598454272), ('allocated_bytes.all.freed', 6714698240), ('allocated_bytes.all.peak', 13967163392), ('allocated_bytes.large_pool.allocated', 15312696832), ('allocated_bytes.large_pool.current', 8598454272), ('allocated_bytes.large_pool.freed', 6714242560), ('allocated_bytes.large_pool.peak', 13967163392), ('allocated_bytes.small_pool.allocated', 455680), ('allocated_bytes.small_pool.current', 0), ('allocated_bytes.small_pool.freed', 455680), ('allocated_bytes.small_pool.peak', 80896), ('allocation.all.allocated', 24), ('allocation.all.current', 6), ('allocation.all.freed', 18), ('allocation.all.peak', 8), ('allocation.large_pool.allocated', 11), ('allocation.large_pool.current', 6), ('allocation.large_pool.freed', 5), ('allocation.large_pool.peak', 8), ('allocation.small_pool.allocated', 13), ('allocation.small_pool.current', 0), ('allocation.small_pool.freed', 13), ('allocation.small_pool.peak', 2), ('inactive_split.all.allocated', 3), ('inactive_split.all.current', 1), ('inactive_split.all.freed', 2), ('inactive_split.all.peak', 2), ('inactive_split.large_pool.allocated', 1), ('inactive_split.large_pool.current', 1), ('inactive_split.large_pool.freed', 0), ('inactive_split.large_pool.peak', 1), ('inactive_split.small_pool.allocated', 2), ('inactive_split.small_pool.current', 0), ('inactive_split.small_pool.freed', 2), ('inactive_split.small_pool.peak', 1), ('inactive_split_bytes.all.allocated', 20376064), ('inactive_split_bytes.all.current', 12451840), ('inactive_split_bytes.all.freed', 7924224), ('inactive_split_bytes.all.peak', 14548480), ('inactive_split_bytes.large_pool.allocated', 15808000), ('inactive_split_bytes.large_pool.current', 12451840), ('inactive_split_bytes.large_pool.freed', 3356160), ('inactive_split_bytes.large_pool.peak', 12451840), ('inactive_split_bytes.small_pool.allocated', 4568064), ('inactive_split_bytes.small_pool.current', 0), ('inactive_split_bytes.small_pool.freed', 4568064), ('inactive_split_bytes.small_pool.peak', 2096640), ('max_split_size', -1), ('num_alloc_retries', 0), ('num_ooms', 0), ('oversize_allocations.allocated', 0), ('oversize_allocations.current', 0), ('oversize_allocations.freed', 0), ('oversize_allocations.peak', 0), ('oversize_segments.allocated', 0), ('oversize_segments.current', 0), ('oversize_segments.freed', 0), ('oversize_segments.peak', 0), ('requested_bytes.all.allocated', 15313145274), ('requested_bytes.all.current', 8598453372), ('requested_bytes.all.freed', 6714691902), ('requested_bytes.all.peak', 13967161592), ('requested_bytes.large_pool.allocated', 15312695031), ('requested_bytes.large_pool.current', 8598453372), ('requested_bytes.large_pool.freed', 6714241659), ('requested_bytes.large_pool.peak', 13967161592), ('requested_bytes.small_pool.allocated', 450243), ('requested_bytes.small_pool.current', 0), ('requested_bytes.small_pool.freed', 450243), ('requested_bytes.small_pool.peak', 80000), ('reserved_bytes.all.allocated', 14250147840), ('reserved_bytes.all.current', 14250147840), ('reserved_bytes.all.freed', 0), ('reserved_bytes.all.peak', 14250147840), ('reserved_bytes.large_pool.allocated', 14248050688), ('reserved_bytes.large_pool.current', 14248050688), ('reserved_bytes.large_pool.freed', 0), ('reserved_bytes.large_pool.peak', 14248050688), ('reserved_bytes.small_pool.allocated', 2097152), ('reserved_bytes.small_pool.current', 2097152), ('reserved_bytes.small_pool.freed', 0), ('reserved_bytes.small_pool.peak', 2097152), ('segment.all.allocated', 10), ('segment.all.current', 10), ('segment.all.freed', 0), ('segment.all.peak', 10), ('segment.large_pool.allocated', 9), ('segment.large_pool.current', 9), ('segment.large_pool.freed', 0), ('segment.large_pool.peak', 9), ('segment.small_pool.allocated', 1), ('segment.small_pool.current', 1), ('segment.small_pool.freed', 0), ('segment.small_pool.peak', 1)])

        这里快速推进。实际上我们只需要看reserved_bytes和active_bytes。其中,active_bytes.all.current 表示当前正在使用的所有活跃内存总量。在输出中,这个值为 8598454272 字节,约等于 8192 MBreserved_bytes.all.current 表示当前已保留的所有内存总量。在输出中,这个值为 14250147840 字节,约等于 13595 MB

        因此,很明显这多出来的内存占用,实际上是reserved_bytes搞的

  • 活跃内存(Active Memory):指当前正在使用的显存量,包括已经分配并且正在使用的内存。
  • 保留内存(Reserved Memory):指已经分配但尚未使用的显存量。这些内存空间可能会被保留以备将来使用,或者是由于内存碎片而导致的无法立即分配给新的内存请求。总的来说,保留的所有内存总量是由系统根据实时的内存使用情况和策略进行动态调整和触发的。它的目的是优化内存的分配和释放,以提高系统的性能和稳定性。

解决方案

        知道了原因,那么就很好处理了。我们可以通过torch.cuda.empty_cache()清空缓存来删掉这部分保留的内存:

coo_matrix = sparse_matrix.to_sparse_coo()
print('memory_allocated: ', torch.cuda.memory_allocated())
print('memory_reserved: ', torch.cuda.memory_reserved())
torch.cuda.empty_cache()
print('empty_cache done!')
print('memory_allocated: ', torch.cuda.memory_allocated())
print('memory_reserved: ', torch.cuda.memory_reserved())

输出:

memory_allocated:  8598454272
memory_reserved:  14250147840
empty_cache done!
memory_allocated:  8598454272
memory_reserved:  8613003264

        可以看到已经成功删除了多的部分。


碎碎念

        1、可能还有其他方法,欢迎评论讨论~

        2、如果不是后面不会再有GPU内存申请了,这个保留内存实际还是建议保留的。比如以下这个连续创建矩阵的,那么在创建第二个矩阵的时候,就不会再去申请新的内存,而是会放在保留内存里。因此这样会更高效一点:

A = create_dense_matrix(size, device=env.device)
B = create_dense_matrix(size, device=env.device)

相关推荐

  1. kotlin gradle

    2024-07-09 21:50:01       36 阅读
  2. 关于Pytorch和Numpy稀疏矩阵sparse知识点

    2024-07-09 21:50:01       244 阅读
  3. hql(hive sql)join及

    2024-07-09 21:50:01       45 阅读
  4. Ubuntu:那些年?注意事项分享

    2024-07-09 21:50:01       56 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-07-09 21:50:01       52 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-07-09 21:50:01       54 阅读
  3. 在Django里面运行非项目文件

    2024-07-09 21:50:01       45 阅读
  4. Python语言-面向对象

    2024-07-09 21:50:01       55 阅读

热门阅读

  1. WordPress禁止用户注册某些用户名

    2024-07-09 21:50:01       22 阅读
  2. go内存返还系统相关代码

    2024-07-09 21:50:01       18 阅读
  3. 如何使用 Puppeteer 避免机器人检测?

    2024-07-09 21:50:01       22 阅读
  4. 面试经典 150 题

    2024-07-09 21:50:01       19 阅读
  5. MySQL8之mysql-community-common的作用

    2024-07-09 21:50:01       22 阅读
  6. 机器学习综述

    2024-07-09 21:50:01       21 阅读
  7. Qt | Qt常用类列举和说明

    2024-07-09 21:50:01       22 阅读
  8. chatgpt工作原理

    2024-07-09 21:50:01       22 阅读
  9. 在make类构建系统配置文件中定义函数宏

    2024-07-09 21:50:01       20 阅读