文档去重(TF-IDF,MinHash, SimHash)

2个doc有些相似有些不相似,如何衡量这个相似度;

直接用Jaccard距离,计算量太大

TF-IDF: TF*IDF

TF:该词在该文档中的出现次数,

IDF:该词在所有文档中的多少个文档出现是DF,lg(N/(1+DF))

MinHash

代码:

import random
from typing import List, Set, Tuple

class MinHash:
    def __init__(self, num_hashes: int = 100):
        self.num_hashes = num_hashes
        self.hash_functions = self._generate_hash_functions()

    def _generate_hash_functions(self) -> List[Tuple[int, int, int]]:
        """Generate hash functions of the form (a*x + b) % c."""
        max_prime = 2**31 - 1  # Mersenne prime
        return [(random.randint(1, max_prime), random.randint(0, max_prime), max_prime)
                for _ in range(self.num_hashes)]

    def _minhash(self, document: Set[str]) -> List[int]:
        """Compute MinHash signature for a document."""
        signature = [float('inf')] * self.num_hashes
        for word in document:
            for i, (a, b, c) in enumerate(self.hash_functions):
                hash_value = (a * hash(word) + b) % c
                signature[i] = min(signature[i], hash_value)
        return signature

    def jaccard_similarity(self, sig1: List[int], sig2: List[int]) -> float:
        """Estimate Jaccard similarity from MinHash signatures."""
        return sum(s1 == s2 for s1, s2 in zip(sig1, sig2)) / self.num_hashes

    def deduplicate(self, documents: List[str], threshold: float = 0.5) -> List[str]:
        """Deduplicate documents based on MinHash similarity."""
        # Preprocess documents into sets of words
        doc_sets = [set(doc.lower().split()) for doc in documents]
        
        # Compute MinHash signatures for all documents
        signatures = [self._minhash(doc_set) for doc_set in doc_sets]
        
        # Find unique documents
        unique_docs = []
        for i, doc in enumerate(documents):
            is_duplicate = False
            for j in range(len(unique_docs)):
                if self.jaccard_similarity(signatures[i], signatures[j]) >= threshold:
                    is_duplicate = True
                    break
            if not is_duplicate:
                unique_docs.append(doc)
        
        return unique_docs

100个hash函数;

在某个hash函数上,1个doc里的所有word,在该函数上的hash值,其中最小的那个,记下来;

该doc得到100个最小hash值,该100维向量,作为其signature;

2个doc的相似度,就是100个维度里的相等数目,除以100;

SimHash

MinHash和SimHash_minhash simhash-CSDN博客

海量文本去重(允许一定的噪声);文档里权重最大的前N个词(或词组)进行Hash编码,1正0负乘以词的权重,N个词的向量按位相加,再反编码(正1负0),得到该文档的编码;两篇文档的距离用编码的海明距离,小于Bar(例如3)则认为二者相似;

import hashlib
from typing import List, Tuple

class SimHash:
    def __init__(self, hash_bits: int = 64):
        self.hash_bits = hash_bits

    def _string_hash(self, text: str) -> int:
        """Create a hash for a given string."""
        return int(hashlib.md5(text.encode('utf-8')).hexdigest(), 16)

    def _create_shingles(self, text: str, k: int = 2) -> List[str]:
        """Create k-shingles from the text."""
        return [text[i:i+k] for i in range(len(text) - k + 1)]

    def _compute_simhash(self, features: List[str]) -> int:
        """Compute the SimHash of a list of features."""
        v = [0] * self.hash_bits
        
        for feature in features:
            feature_hash = self._string_hash(feature)
            for i in range(self.hash_bits):
                bitmask = 1 << i
                if feature_hash & bitmask:
                    v[i] += 1
                else:
                    v[i] -= 1
        
        fingerprint = 0
        for i in range(self.hash_bits):
            if v[i] >= 0:
                fingerprint |= 1 << i
        
        return fingerprint

    def _hamming_distance(self, hash1: int, hash2: int) -> int:
        """Compute the Hamming distance between two hashes."""
        xor = hash1 ^ hash2
        return bin(xor).count('1')

    def compute_similarity(self, hash1: int, hash2: int) -> float:
        """Compute the similarity between two SimHashes."""
        distance = self._hamming_distance(hash1, hash2)
        return 1 - (distance / self.hash_bits)

    def deduplicate(self, documents: List[str], threshold: float = 0.9) -> List[Tuple[str, int]]:
        """Deduplicate documents based on SimHash similarity."""
        unique_docs = []
        
        for doc in documents:
            shingles = self._create_shingles(doc.lower())
            doc_hash = self._compute_simhash(shingles)
            
            is_duplicate = False
            for unique_doc, unique_hash in unique_docs:
                if self.compute_similarity(doc_hash, unique_hash) >= threshold:
                    is_duplicate = True
                    break
            
            if not is_duplicate:
                unique_docs.append((doc, doc_hash))
        
        return unique_docs

# Example usage
if __name__ == "__main__":
    simhash = SimHash(hash_bits=64)
    
    documents = [
        "The quick brown fox jumps over the lazy dog",
        "The quick brown fox jumps over the sleeping dog",
        "The lazy dog is sleeping",
        "A completely different document"
    ]
    
    unique_docs = simhash.deduplicate(documents, threshold=0.7)
    
    print("Original documents:")
    for doc in documents:
        print(f"- {doc}")
    
    print("\nUnique documents:")
    for doc, _ in unique_docs:
        print(f"- {doc}")

相关推荐

  1. centos 找到并删除重复文件

    2024-07-11 02:58:01       32 阅读
  2. TF-IDF算法:揭秘文本数据的权密码

    2024-07-11 02:58:01       21 阅读
  3. oracle

    2024-07-11 02:58:01       53 阅读
  4. js 数组

    2024-07-11 02:58:01       56 阅读
  5. ArrayList数组

    2024-07-11 02:58:01       22 阅读
  6. sql LISTAGG

    2024-07-11 02:58:01       22 阅读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-07-11 02:58:01       67 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-07-11 02:58:01       72 阅读
  3. 在Django里面运行非项目文件

    2024-07-11 02:58:01       58 阅读
  4. Python语言-面向对象

    2024-07-11 02:58:01       69 阅读

热门阅读

  1. Leetcode 59. 螺旋打印矩阵

    2024-07-11 02:58:01       24 阅读
  2. MySQL 日期和时间函数

    2024-07-11 02:58:01       19 阅读
  3. Leetcode234.判断是否是回文单链表

    2024-07-11 02:58:01       20 阅读
  4. 基于深度学习的点云降噪

    2024-07-11 02:58:01       22 阅读
  5. Git 一种分布式版本控制系统

    2024-07-11 02:58:01       21 阅读
  6. C# —— FileStream文件流

    2024-07-11 02:58:01       21 阅读
  7. Pandas 进阶 —— 数据转换、聚合与可视化

    2024-07-11 02:58:01       24 阅读
  8. Ubuntu 22.04.1 LTS 离线安装Docker

    2024-07-11 02:58:01       21 阅读
  9. Perl文件系统探险家:自定义遍历策略全攻略

    2024-07-11 02:58:01       20 阅读
  10. 详解Go语言中的Goroutine组(Group)在项目中的使用

    2024-07-11 02:58:01       18 阅读
  11. numpy学习

    2024-07-11 02:58:01       20 阅读