贝叶斯实现拼写检查器

2024-07-21 15:10:04
开发
18

文章目录

- 贝叶斯实现拼写检查器

贝叶斯实现拼写检查器

求解： $a r g ma x c P (c ∣ W) - > a r g ma x c P (W ∣ c) P (c) / P (W)$

P©，文章中出现一个正确拼写单词c的概率，也就是说，在英语文章中，c出现的概率有多大
P(W|c)，在用户想键入c的情况下敲成W的概率。因为这个是代表用户会以多大的概率把c敲错成W
argmax，用来枚举所有可能的c并且选取概率最大的

import collections
import re

# 把语料中的单词全部抽取出来，转成小写，并且去除单词中间的特殊符号
def words(text):
    return re.findall('[a-z]+', text.lower())

def train(features):
    model = collections.defaultdict(lambda: 1)

    for f in features:
        model[f] += 1

    return model

NWORDS = train(words(open('big.txt').read()))
print(NWORDS)
"""
defaultdict(<function train.<locals>.<lambda> at 0x00000106C4B1B0D0>, {'throughout': 2, 'my': 8, 'time': 2, 'in': 9, 'college': 4, 'several': 2, 'memorable': 2, 'event': 4, 'left': 2, 'a': 9, 'significant': 3, 'impact': 2, 'on': 4, 'life': 3, 'however': 2, 'one': 4, 'particular': 3, 'stands': 2, 'out': 2, 'as': 5, 'the': 20, 'most': 3, 'unforgettable': 2, 'our': 2, 'annual': 3, 'university': 2, 'cultural': 7, 'festival': 8, 'this': 2, 'vibrant': 2, 'week': 2, 'long': 2, 'celebration': 2, 'not': 5, 'only': 5, 'showcased': 2, 'diverse': 4, 'talents': 4, 'and': 20, 'creativity': 2, 'of': 9, 'student': 2, 'body': 2, 'but': 5, 'also': 5, 'provide': 2, 'unique': 2, 'platform': 2, 'for': 3, 'students': 7, 'from': 6, 'various': 4, 'backgrounds': 2, 'interests': 3, 'to': 9, 'connect': 2, 'collaborate': 2, 'learn': 2, 'another': 2, 'was': 4, 'high': 2, 'anticipated': 2, 'campus': 4, 'drawing': 2, 'large': 2, 'crowds': 2, 'enthusiastic': 2, 'faculty': 2, 'even': 2, 'visitors': 2, 'other': 3, 'institutions': 2, 'atmosphere': 2, 'electric': 2, 'with': 2, 'performances': 3, 'competitions': 3, 'workshops': 4, 'taking': 2, 'place': 2, 'simultaneous': 2, 'across': 2, 'captivating': 2, 'dance': 2, 'shows': 2, 'soul': 2, 'stirring': 2, 'music': 2, 'thought': 2, 'provoking': 2, 'debates': 2, 'mesmerizing': 2, 'art': 2, 'exhibitions': 2, 'offered': 2, 'something': 2, 'everyone': 2, 'aspects': 2, 'opportunity': 2, 'it': 2, 'afforded': 2, 'showcase': 2, 'their': 5, 'skills': 3, 'supportive': 2, 'inclusive': 2, 'environment': 2, 'inter': 2, 'fostered': 3, 'spirit': 2, 'healthy': 2, 'rivalry': 2, 'sportsmanship': 2, 'amang': 2, 'participants': 2, 'these': 3, 'events': 2, 'encouraged': 3, 'hone': 2, 'abilities': 2, 'allowed': 2, 'them': 2, 'appreciate': 2, 'perspectives': 2, 'present': 2, 'within': 2, 'wider': 2, 'community': 2, 'moreover': 2, 'facilitated': 2, 'personal': 3, 'growth': 3, 'development': 2, 'through': 2, 'wide': 2, 'range': 2, 'led': 2, 'by': 2, 'experts': 2, 'fields': 2, 'were': 2, 'explore': 2, 'new': 3, 'cultivate': 2, 'develop': 2, 'hobbies': 2, 'ranging': 2, 'photography': 2, 'painting': 2, 'pottery': 2, 'disciplines': 2, 'enriched': 2, 'extracurricular': 2, 'lives': 2, 'broadened': 3, 'horizons': 3, 'lifelong': 2, 'love': 2, 'learning': 2, 'conclusion': 2, 'remains': 2, 'etched': 2, 'memory': 2, 'an': 2, 'exceptional': 2, 'experience': 2, 'that': 3, 'significantly': 2, 'contributed': 2, 'professional': 2, 'taught': 2, 'me': 2, 'invaluable': 2, 'lessons': 2, 'teamwork': 2, 'perseverance': 2, 'leadership': 2, 'i': 2, 'reflect': 2, 'days': 2, 'serves': 2, 'poignant': 2, 'reminder': 2, 'enriching': 2, 'experiences': 2, 'helped': 2, 'shape': 2})
"""

要是遇到我们从来没有见过的新词怎么办。假如说一个词写完完全正确，但是语料库中没有包含这个词，从而这个词也永远不会出现在训练集中。于是，我们就要返回出现这个词的概率是0.这个情况不太妙，因为概率为0这个代表了这个时间绝对不可能发生，而在我们的概率模型中，我们期望用一个很小的概率来代表这种情况。lambda:1

编辑距离

两个词之间的编辑距离定义为使用了几次插入（在词中插入一个单字母），删除（删除一个单字母），交换（交换相邻两个字母），替换（把一个字母换成另一个）的操作从一个词变到另一个词。

# 返回所有与单词w编辑距离为1的集合
def edits(word):
    n = len(word)
    return set([word[0:i] + word[i+1:] for i in range(n)] +    # deletion
               [word[0:i] + c + word[i:] for i in range(n+1) for c in 'abcdefghijklmnopqrstuvwxyz'] +    # transposition
               [word[0:i] + c + word[i+1:] for i in range(n) for c in 'abcdefghijklmnopqrstuvwxyz'] +    # alteration
               [word[0:i] + c + word[i+1:] + c for i in range(n) for c in 'abcdefghijklmnopqrstuvwxyz'])    # insertionlen)

优化：在这些编辑距离小于2的词中间，只把那些正确的词作为候选词。

# 返回所有与单词w编辑距离为2的单词
#在这些编辑距离小于2的词中间，只把那些正确的词作为候选词
def edits2(word):
    return set(e2 for e1 in edits(word) for e2 in edits(e1))

正常来说把一个元音拼成另一个的概率要大于辅音（因为人常常把hello打成hallo）；把单词的第一个字母拼错的概率会相对小，等等。但是为了简单起见，选择了一个简单的方法：编辑距离为1的正确单词比编辑距离为2的优先级高，而编辑距离为0的正确单词优先级比编辑距离为1高。

def known(words):
    return set(w for w in words if w in NWORDS)
#如果know(set)非空，candidate就会选取这个集合，而不继续计算后面的
def correct(word):
    candidates = known([word]) or known(edits(word)) or known(edits2(word)) or [word]
    return max(candidates, key=NWORDS.get)

运行拼写检查器

NWORDS = train(words(open('big.txt').read()))
print(correct('lfe'))
# life

原文地址:https://blog.csdn.net/weixin_73044854/article/details/140572550 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1814920763367624704.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部