如何用python代码检索电脑重复度高的文件

2024-03-30 21:10:02
开发
14

一、案例

当电脑中重复文件过多，各种聊天软件中获取的各种重复文件，又都自动下载在电脑中，不断占用电脑内存，同时又不容易通过每个关键字来搜索重复文件，删除重复文件以减小内存的情况。

可以通过获取文件的哈希值进行对比查找重复文件，并展示哈希值以及重复文件路径，只需要按照搜索的目录来进行检索，通过计算的哈希值，可快速查找到文件名重复度很高的文件，并将其文件路径进行归类展示，便于后续的删除等操作，具体代码如下：

import os
import hashlib
from collections import defaultdict
from typing import List, Dict, Any, Tuple

def get_file_hash(file_path: str, block_size: int = 65536) -> str:
"""获取文件的哈希值"""
hasher = hashlib.sha256()
with open(file_path, 'rb') as file:
while True:
data = file.read(block_size)
if not data:
break
hasher.update(data)
return hasher.hexdigest()

def find_duplicate_files(starting_directory: str) -> List[Tuple[str, List[str]]]:
"""查找重复文件"""
file_hashes: Dict[str, List[str]] = defaultdict(list)

for root, _, files in os.walk(starting_directory):
for file_name in files:
file_path = os.path.join(root, file_name)
try:

原文地址:https://blog.csdn.net/helloshili2011/article/details/137154377 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1774061521333260288.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部