08.C2W3.Auto-complete and Language Models

往期文章请点这里

N-Grams: Overview

● Create language model (LM) from text corpus to
○ Estimate probability of word sequences
○ Estimate probability of a word following a sequence of words
● Apply this concept to autocomplete a sentence with most likely suggestions
在这里插入图片描述
语言模型在自然语言处理（NLP）和人工智能领域有着广泛的应用，以下是它们在特定领域中的应用：
语音识别（Speech Recognition）:
语音识别系统将人类的语音转换成书面文本。语言模型在这个过程中起到关键作用，因为它们帮助系统理解语音片段中单词的上下文和语法结构。通过语言模型，系统能够更准确地预测一个单词序列的可能性，从而提高语音到文本转换的准确性。
在这里插入图片描述

拼写检查与纠正（Spelling Correction）:
语言模型可以用来检测和纠正文本中的拼写错误。由于语言模型知道哪些单词序列在特定语言中是常见的或符合语法规则的，它们可以识别出不符合这些规则的单词，提示或自动更正为正确的拼写。例如，如果用户错误地输入了“recieve”，语言模型可以识别出这不是一个常见单词，并建议更正为“receive”。
在这里插入图片描述

辅助交流（Augmentative and Alternative Communication, AAC）:
辅助交流设备或系统帮助那些有语言障碍或沟通困难的人表达自己。语言模型可以集成到这些系统中，提供个性化的预测和建议，帮助用户更快地构建句子和表达思想。例如，对于使用特殊设备进行交流的用户，语言模型可以预测他们可能想要表达的下一个单词或短语，从而提高交流效率。
在这里插入图片描述

主要目标：
●Process text corpus to N-gram language model
●Out of vocabulary words
●Smoothing for previously unseen N-grams
●Language model evaluation

N-grams and Probabilities

N-grams

N-gram 是自然语言处理中用于描述文本数据的一种统计模型。简单来说，一个 N-gram 是由 N 个连续的词（words）组成的序列。在这个序列中，每个词被称作一个“gram”，并且这个序列可以被用来捕捉文本中的局部上下文信息。

以下是不同 N 值的 N-gram 的一些例子：
对于 Unigram（1-gram）：N=1，它只包含一个词。例如，“cat”就是一个 unigram。
对于 Bigram（2-gram）：N=2，它包含两个连续的词。例如，“cat sat”就是一个 bigram。
对于 Trigram（3-gram）：N=3，它包含三个连续的词。例如，“cat sat on”就是一个 trigram。
N-gram 模型在语言模型中非常重要，因为它们可以用来预测文本序列中下一个词出现的概率。例如，在一个 bigram 模型中，给定第一个词，模型可以预测第二个词出现的概率。这种模型对于诸如拼写检查、语法分析、机器翻译和语音识别等应用至关重要。

然而，N-gram 模型也存在一些局限性，比如当 N 值较大时，模型可能会遇到数据稀疏问题，因为大量的词序列在训练数据中可能只出现很少的次数或从未出现过。此外，N-gram 模型通常忽略了词序之外的上下文信息，如句法和语义。

理解 N-gram 的关键是认识到它们提供了一种简单但有效的方式来捕捉和表示文本数据中的局部依赖关系。
另外一个例子：
Corpus: I am happy because I am learning
Unigrams: {I , am, happy, because, learning}
Bigrams: {I am, am happy , happy because …}这里I happy不是Bigrams，必须要连续的两个词；I am在语料库中出现两次，只会记录一次
Trigrams: {I am happy , am happy because , …}

Sequence notation

假设现有语料库中有500个单词：
在这里插入图片描述
则单词序列可以表示为：
$w_1^m=w_1w_2\cdots w_m$
例如第一个到第三个单词的序列：
$w_1^3=w_1w_2w_3$
语料库中最后三个词可以表示为：
$w_{m-2}^m=w_{m-2}w_{m-1}w_m$

Unigram probability

假设语料库为：I am happy because I am learning
语料库大小 $m = 7$
对于单词I： $P(I)=\cfrac{2}{7}$
对于单词happy： $P(happy)=\cfrac{1}{7}$
Unigram probability公式为：
$P(w)=\cfrac{C(w)}{m}$

Bigram probability

假设语料库为：I am happy because I am learning
则前一个单词是I，后一个单词是am的概率为： $P(am|I)=\cfrac{C(I\space am)}{C(I)}=\cfrac{2}{2}=1$
前一个单词是I，后一个单词是happy的概率为： $P(happy|I)=\cfrac{C(I\space happy)}{C(I)}=\cfrac{0}{2}=0$
前一个单词是am，后一个单词是learning的概率为： $P(learning|am)=\cfrac{C(am\space learning)}{C(am)}=\cfrac{1}{2}$
Bigram probability公式为：
$P(y|x)=\cfrac{C(x\space y)}{\sum_wC(x\space w)}=\cfrac{C(x\space y)}{C(x)}$

Trigram Probability

假设语料库为：I am happy because I am learning
前两个单词是I am，后一个单词是happy的概率为： $P(happy|I\space am)=\cfrac{C(I\space am\space happy)}{C(I\space am)}=\cfrac{1}{2}$
Trigram Probability公式为：
$P(w_3|w_1^2)=\cfrac{C(w_1^2w_3)}{C(w_1^2)}$

N -gram probability

直接给公式：
$P(w_N|w_1^{N-1})=\cfrac{C(w_1^{N-1}w_N)}{C(w_1^{N-1})}$
分子： $C(w_1^{N-1}w_N)=C(w_1^{N})$

Quiz

Corpus:
In every place of great resort the monster was the fashion. They sang of it in the cafes, ridiculed it in the papers, and rep res ented it on the stage. ” (Jules Verne, Twenty Thousand Leagues under the Sea)
In the context of our corpus, what is the probability of word “papers” following the phrase “it in the”.
Answer: 1/2
解析：it in the总共出现了2次，后面接papers出现了1次

Sequence Probabilities

Probability of a sequence

给定一个句子，其出现概率如何计算？
根据链式法则：
$P (A, B, C, D) = P (A) P (B ∣ A) P (C ∣ A, B) P (D ∣ A, B, C)$
根据条件概率：
$P(B|A)=\cfrac{P(A,B)}{P(A)}\xRightarrow{} P(A,B)=P(A)P(B|A)$
则某句话出现的概率为：
P(the teacher drinks tea)=P(the)P(teacher|the)P(drinks |the teacher)P(tea |the teacher drinks)

Sequence probability shortcomings

最大的问题：Corpus almost never contains the exact sentence we’re interested in or even its longer subsequences!
例如上面的例子中最后一项：
$|the\space teacher\space drinks)=\cfrac{C(the\space teacher\space drinks\space tea)}{C(the\space teacher\space drinks)}$
可以预想到分子和分母项在语料中出现的次数估计为0，会使得P(the teacher drinks tea)计算依赖相乘的结果也为0

Approximation by N gram probabilities

为了避免上面提到的情况，将条件概率中的条件限制为前一个单词：
$|the\space teacher\space drinks)\approx P(tea|drinks)$
$P(the\space teacher\space drinks\space tea)=P(the)P(teacher|the)P(drinks |the\space teacher)P(tea |the\space teacher\space drinks)\\ \approx P(the)P(teacher|the)P(drinks |teacher)P(tea |drinks)$
当然，还可以根据Markov assumption: only last N words matter
Bigram某个单词出现概率：
$P(w_n | w_1^{n-1}) \approx P(w_n | w_{n-1})$
N-gram某个单词出现概率：
$P(w_n | w_1^{n-1}) \approx P(w_n|w_{n-N+1}^{n-1})$
Bigram整个句子出现概率：
$P(w_1^n)\approx P(w_1)P(w_2|w_1)\cdots P(w_n|w_{n-1})$

Quiz

Given these conditional probabilities
P(Mary)=0.1;
P(likes)=0.2;
P(cats)=0.3
P(Mary|likes) =0.2;
P(likes|Mary) =0.3;
P(cats|likes)=0.1;
P(likes|cats)=0.4
Approximate the probability of the following sentence with bigrams: “Mary likes cats”
Answer:0.003
解析：P(Mary likes cats)=P(Mary)P(likes|Mary)P(cats|likes)=0.1×0.3×0.1=0.003

Starting and Ending Sentences

Start of sentence token <s>

$P(the\space teacher\space drinks\space tea) \approx P(the)P(teacher|the)P(drinks |teacher)P(tea |drinks)$
可以看到第一个单词没有前置词，无法使用Bigram来计算条件概率，因此，我们通常会加上一个特殊项，使得上面的公式右边每一项都变成Bigram，the teacher drinks tea就变成了<s> the teacher drinks tea，概率计算变成：
$P(<s>\space the\space teacher\space drinks\space tea) \approx P(the|<s>)P(teacher|the)P(drinks |teacher)P(tea |drinks)$

对于Trigram:
$P(the\space teacher\space drinks\space tea)\approx P(the)P(teacher|the)P(drinks| the\space teacher)P(tea|teacher\space drinks)$
需要加上两个<s>，得到：<s> <s> the teacher drinks tea

进一步推广到N-gram，则需要添加N-1个<s>

End of sentence token </s> -motivation

第一个动机：
对于公式：
$P(y|x)=\cfrac{C(x,y)}{\sum_wC(x,w)}=\cfrac{C(x,y)}{C(x)}$
当我们计算最后一个词的时候，上面公式的分母不一定相等，即： $\sum_wC(x,w)\neq C(x)$
例如有语料库：
<s> Lyn drinks chocolate
<s> John drinks
数一下drinks后面带有单词出现的次数是1：
$\sum_wC(drinks,w)=1$
drinks单独出现的次数是2：
$\sum_wC(drinks)=2$
第二个动机：
假如有语料库：
<s> yes no
<s> yes yes
<s> no no
先生成长度为2的句子：
<s> yes yes
<s> yes no
<s> no no
<s> no yes
以第一个<s> yes yes为例，计算其出现概率：
$P(<s>\space yes\space yes)=P(yes|<s>)\times P(yes|yes)\\ =\cfrac{C(<s>,yes)}{\sum_wC(<s>,w)}\times\cfrac{C(yes,yes)}{\sum_wC(yes,w)}\\ =\cfrac{2}{3}\times\cfrac{1}{2}=\cfrac{1}{3}$
同理，可以计算得到<s> yes no出现概率为：1/3；<s> no no出现概率为：1/3；<s> no yes 出现概率为：0；
也就是说所有长度为2的句子出现概率总和为： $\sum_{2\space word}P(\cdots)=1/3+1/3+1/3+0=1$
同理可以计算长度为3的句子：
在这里插入图片描述
这个结果是不符合我们的假设的，正常来说，根据语料库生成所有句子的可能性加起来应该为1，而不是某个长度的句子生成概率为1：
$\sum_{2\space word}P(\cdots)+\sum_{3\space word}P(\cdots)+\cdots=1$

End of sentence token </s> -solution

解决方法就是在句末加</s>，例如：<s> the teacher drinks tea </s>，出现概率为：
$P (t h e ∣ < s >) P (t e a c h er ∣ t h e) P (d r ink s ∣ t e a c h er) P (t e a ∣ d r ink s) P (< / s > ∣ t e a)$
注意：和句首不一样，即使是N-gram也只需要加一个</s>，例如Trigram：
the teacher drinks tea=> <s> <s> the teacher drinks tea </s>

对于动机1：
<s> Lyn drinks chocolate </s>
<s> John drinks </s>
数一下drinks后面带有单词出现的次数是2：
$\sum_wC(drinks,w)=2$
drinks单独出现的次数是2：
$\sum_wC(drinks)=2$

Example-bigram

假设语料库为：
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
以下是一些单词出现概率计算结果：
$P(John|<s>)=\cfrac{1}{3}\quad P(</s>|tea)=\cfrac{1}{1}$
$)=\cfrac{1}{1}\quad P(Lyn |<s>)=\cfrac{2}{3}$
对于第一句话出现概率为：
$P(sentence)=\cfrac{2}{3}\times\cfrac{1}{2}\times\cfrac{1}{2}\times\cfrac{2}{2}=\cfrac{1}{6}$
可以看到，计算结果要比3个句子情况下出现概率为1/3的概率要低，剩余概率可以分布到语料库中使用bigram生成的其他句子中，这就是模型的泛化方式。

Quiz

The N-gram Language Model

Count matrix

在N-gram的公式中：
$P(w_n|w_{n-N+1}^{n-1})=\cfrac{C(w_{n-N+1}^{n-1},w_n)}{C(w_{n-N+1}^{n-1})}$
分子： $C(w_{n-N+1}^{n-1},w_n)$
Count matrix计算了在语料库中出现的所有共现次数。
它的行值是非重复语料库前一词
列是所有非重复语料当前词
Bigram count matrix实例：
Corpus:<s> I study I learn </s>
在这里插入图片描述
上面的study I在语料库出现1次

Probability matrix

上面以及计算了分子，再计算出分母后就得到概率矩阵
Divide each cell by its row sum
$sum(row)=\sum_{w\in V}C(w_{n-N+1}^{n-1},w_n)=C(w_{n-N+1}^{n-1})$
根据Count matrix计算每行的求和
在这里插入图片描述
然后计算概率得到Probability matrix：

Language model

通过Probability matrix，Language model可以计算：
○ Sentence probability
○ Next word prediction
例如，根据上一节的Probability matrix，计算<s> I learn </s>这个句子的概率：
$P(sentence)=P(I|<s>)P(learn|I)P(</s>|learn)=1\times0.5\times1=0.5$

Log probability

同样的，这里也出现了多个概率相乘的情况，需要使用对数计算防止下溢。
$P(w_1^n ) \approx\prod_{i=1}^{n} P(w_i | w_{i-1})$
取对数后：
$\log(P(w_1^n ) )\approx\sum_{i=1}^{n}\log( P(w_i | w_{i-1}))$

Generative Language model

实例：
在这里插入图片描述
可以看到，生成语言模型算法大概步骤如下：

Choose sentence start
Choose next bigram starting with previous word
Continue until </s> is picked

Language Model Evaluation

Test data

	For smaller corpora	For large corpora (typical for text)
Train	80% Train	98%
Validation	10% Validation	1%
Test	10% Validation	1%

●split method
对于连续的文本
在这里插入图片描述
对于Random short sequences

Perplexity

Perplexity（困惑度）是自然语言处理中用来衡量语言模型好坏的一个指标，特别是在评估语言模型对文本的预测能力时。Perplexity的公式通常表示为：

markdown
$\text{PP}(W) = P(w_1 ,w_2 ,...,w_m)^{-\frac{1}{m}}$
其中：
$P(w_1 ,w_2 ,...,w_m)$ 是语言模型对观测到的词序列的概率的乘积
$m$ 是词序列中的词的总数。
具体来说， $P(w_1 ,w_2 ,...,w_N)$ 可以展开为：

$P(w_1, w_2, ..., w_N) = \prod_{i=1}^{N} P(w_i | w_1, w_2, ..., w_{i-1})$
这里：
$w_i$ 表示序列中的第 $i$ 个词。
$P(w_ i ∣w-1 ,w_2 ,...,w_{i−1} )$ 是给定前 $i - 1$ 个词的情况下，第 $i$ 个词出现的概率。
Perplexity的计算公式中的 $P^{-\frac{1}{N}}$ 表示的是所有词的概率的几何平均值的倒数。几何平均值可以看作是所有概率乘积的N次方根，而取倒数是为了将平均值转换为原始概率的尺度。
困惑度越低，表示语言模型对数据的预测越准确，即模型对词序列的预测越不困惑。在实践中，一个低困惑度的语言模型意味着它能够更好地预测下一个词，从而生成更自然、更连贯的句子。
Smaller perplexity = better model
Character level models PP < word based models PP

Perplexity for bigram models

$PP(W)=\sqrt[m]{\prod_{i=1}^m\prod_{j=1}^{|s_i|}\cfrac{1}{P(w_j^{(i)}|w_{j-1}^{(i)})}}$
$w_j^{(i)}$ 表示第i个句子中的第j个词

concatenate all sentences in W
然后计算bigram模型的困惑度，需要计算所有句子的bigram概率的乘积，然后取幂次-1/m
$PP(W)=\sqrt[m]{\prod_{i=1}^m\cfrac{1}{P(w_i|w_{i-1})}}$
$w_{i}$ 表示test set中第i个词

Log perplexity

同样将乘法变成加法：
$\log PP(W)=\cfrac{1}{m}\sum_{i=1}^m\log_2(P(w_i|w_{i-1}))$

Example

在这里插入图片描述
Training 38 million words, test 1.5 million words, WSJ corpus
Perplexity Unigram: 962 Bigram: 170 Trigram: 109
WSJ corpus，全称为Wall Street Journal (WSJ) Corpus，是一个广泛使用的文本语料库，它基于《华尔街日报》的文本内容。这个语料库在自然语言处理（NLP）领域非常知名，特别是用于语言模型的训练和评估。

Out of Vocabulary Words

Out of vocabulary words

Closed vs. Open vocabularies
封闭词汇表提供了一种简化的方法来处理文本，但可能会牺牲对新词的处理能力；而开放词汇表提供了更大的灵活性，可以更好地适应多样化的语言使用，但可能会增加模型的复杂性和计算成本。
Closed Vocabularies（封闭词汇表）：
在封闭词汇表系统中，模型在训练前定义了一个固定的词汇集，这个词汇集包含了所有在模型训练和预测时会用到的单词或标记（tokens）。
任何不在词汇表中的词在处理时通常会被忽略或替换为一个特殊的未知标记（如<UNK>）。
封闭词汇表有助于减少模型的复杂性，因为它限制了模型需要学习和预测的词汇数量。
这种方法的一个缺点是，它无法很好地处理词汇表之外的新词或罕见词，这可能会影响模型对新文本的理解能力。

Open Vocabularies（开放词汇表）：
开放词汇表系统不限制模型使用的词汇数量。模型可以处理任何它遇到的词，无论这些词是否在训练数据中出现过。
在这种设置下，模型通常使用子词分割（subword segmentation）技术，如Byte Pair Encoding（BPE）或WordPiece，来处理不在训练集中的词。
开放词汇表可以更好地处理多样化的文本，包括专业术语或新出现的词汇，因为它们不会被简单地替换为未知标记。
然而，这种方法可能会增加模型的复杂性，因为模型需要学习更多的词汇和它们之间的关系。

Unknown word = Out of vocabulary word (OOV)
special tag <UNK> in corpus and in input

Using <UNK> in corpus

步骤：
● Create vocabulary V
● Replace any word in corpus and not in V by <UNK>
● Count the probabilities with <UNK> as with any other word
例子：
Corpus
<s> Lyn drinks chocolate </s>
<s> John drinks tea </s>
<s> Lyn eats chocolate </s>
将词表门槛定为最少出现两次：Min frequency f=2
<s> Lyn drinks chocolate </s>
<s> <UNK> drinks <UNK> </s>
<s> Lyn <UNK> chocolate </s>
最后的词表为：
Vocabulary
Lyn, drinks, chocolate
在进行输入查询时，如果有非词表的单词，也要替换为UNK
<s>Adam drinks chocolate</s>
<s><UNK> drinks chocolate</s>

How to create vocabulary V

两种条件：

设定单词最小出现频率，大于该频率的进入词表，否则设置为UNK
设定词表最大容量 $∣ V ∣$ ，按单词出现频率排序，将前 $∣ V ∣$ 个单词包含进词表，其他的设置为UNK

虽然UNK对于降低困惑度有效，但不建议设置过多的UNK词，否则在你生成句子的时候会看到很多的UNK
在比较困惑度的时候，only compare LMs with the same V

Quiz

Given the training corpus and minimum word frequency=2, how would the vocabulary for corpus
preprocessed with <UNK> look like?
“<s> I am happy I am learning </s> <s> I am happy I can study </s>”
Answer:
V = (I,am,happy)

Smoothing

Missing N-grams in training corpus

Problem: N-grams made of known words still might be missing in the training corpus
如何处理由语料库中出现的单词组成但Ngram本身不存在的N-gram的概率
例如，语料库有“John”,“eats”，但是没有“John eats”，此时“John eats”的计数为0，其bigram概率也为0，会导致整个句子出现概率也为0

Smoothing

Add-one smoothing (Laplacian smoothing)

$P(w_n|w_{n-1})=\cfrac{C(w_{n-1},w_n)+1}{\sum_{w\in V}(C(w_{n-1},w_n)+1)}=\cfrac{C(w_{n-1},w_n)+1}{C(w_{n-1})+V}$
Add-one smoothing需要在词表足够大的情况下使用，否则会使得缺失单词概率过高。
如果语料库非常大，则可以使用Add k smoothing（可用在3gram、4gram等高阶gram上）：
$P(w_n|w_{n-1})=\cfrac{C(w_{n-1},w_n)+k}{\sum_{w\in V}(C(w_{n-1},w_n)+k)}=\cfrac{C(w_{n-1},w_n)+k}{C(w_{n-1})+k\times V}$
Advanced methods:
Kneser-Ney Smoothing（Kneser-Ney 平滑）：
Kneser-Ney 由 Reinhard Kneser 和 Hermann Ney 提出，是一种用于计算条件概率分布的平滑技术。
它通过调整概率分布，使得低频词或未见词的概率分布更加均匀，从而提高语言模型的泛化能力。
Kneser-Ney 考虑了词的上下文，通过加权平均的方式来更新概率，其中权重取决于词在语料库中的相对频率。
它特别适合处理大规模语料库，因为它可以有效地利用语料中的统计信息。

Good-Turing Smoothing（Good-Turing 平滑）：
Good-Turing smoothing 是由I. J. Good提出的，用于估计在语料库中未出现过的词的概率。
它基于一个简单的统计观察：在语料库中出现一次的词的数量大约是出现多次的词的数量的一半。
Good-Turing 方法通过将概率质量从高频词转移到低频词来实现平滑，特别是对于那些在训练语料中未出现过的词。

这种方法简单且计算效率高，但可能不如 Kneser-Ney 方法那样灵活，因为它不区分不同上下文中的词。
两种平滑方法各有优势和局限性。Kneser-Ney smoothing 通常在实际应用中表现更好，因为它考虑了词的上下文信息，但计算复杂度较高。Good-Turing smoothing 则因其简单性和效率而在某些情况下被采用，尤其是在资源受限的情况下。

Backoff

If N-gram missing => use (N-1)-gram, …有两种backoff方式
第一种是直接替换：Probability discounting e.g. Katz backoff
第二种是乘以某个常数（0.4比较好）后替换：“Stupid” backoff
在这里插入图片描述

Interpolation

Interpolation（插值）是一种在自然语言处理中用于平滑语言模型的技术，特别是在处理不同概率分布的组合时。它通过将多个模型或分布的输出以某种方式结合起来，以减少模型的不确定性和过拟合，同时提高泛化能力。最常见的插值方法是线性插值，它简单地将不同模型的概率输出按照一定的权重进行加权平均。
在这里插入图片描述
系数 $\lambda$ 可以通过训练来确定

Quiz

Question:
Corpus: “I am happy I am learning”
In the context of our corpus, what is the estimated probability of word “can” following the word “I” using the
bigram model and add k smoothing where k=3.
Answer:
P(can|I)=P(can|I) = 3/(2+3×4)