【机器翻译】基于术语词典干预的机器翻译挑战赛

赛题链接:https://challenge.xfyun.cn/topic/info?type=machine-translation-2024

赛题解读

安装库

spacy

1.查看本地spacy版本
pip show spacy

我安装3.6.0

pip install en_core_web_sm-3.6.0.tar.gz

en_core_web_sm下载链接:https://github.com/explosion/spacy-models/releases?q=en_core_web_sm&expanded=true

数据预处理

赛题数据
  • 训练集:双语数据 - 中英14万余双语句对
  • 开发集:英中1000双语句对
  • 测试集:英中1000双语句对
  • 术语词典:英中2226条
# 定义数据集类
# 修改TranslationDataset类以处理术语
class TranslationDataset(Dataset):
    def __init__(self, filename, terminology):
        self.data = []
        with open(filename, 'r', encoding='utf-8') as f:
            for line in f:
                en, zh = line.strip().split('\t')
                self.data.append((en, zh))
        
        self.terminology = terminology
        
        # 创建词汇表,注意这里需要确保术语词典中的词也被包含在词汇表中
        self.en_tokenizer = get_tokenizer('basic_english')
        self.zh_tokenizer = list  # 使用字符级分词
        
        en_vocab = Counter(self.terminology.keys())  # 确保术语在词汇表中
        zh_vocab = Counter()
        
        for en, zh in self.data:
            en_vocab.update(self.en_tokenizer(en))
            zh_vocab.update(self.zh_tokenizer(zh))
        
        # 添加术语到词汇表
        self.en_vocab = ['<pad>', '<sos>', '<eos>'] + list(self.terminology.keys()) + [word for word, _ in en_vocab.most_common(10000)]
        self.zh_vocab = ['<pad>', '<sos>', '<eos>'] + [word for word, _ in zh_vocab.most_common(10000)]
        
        self.en_word2idx = {word: idx for idx, word in enumerate(self.en_vocab)}
        self.zh_word2idx = {word: idx for idx, word in enumerate(self.zh_vocab)}


    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        en, zh = self.data[idx]
        en_tensor = torch.tensor([self.en_word2idx.get(word, self.en_word2idx['<sos>']) for word in self.en_tokenizer(en)] + [self.en_word2idx['<eos>']])
        zh_tensor = torch.tensor([self.zh_word2idx.get(word, self.zh_word2idx['<sos>']) for word in self.zh_tokenizer(zh)] + [self.zh_word2idx['<eos>']])
        return en_tensor, zh_tensor

def collate_fn(batch):
    en_batch, zh_batch = [], []
    for en_item, zh_item in batch:
        en_batch.append(en_item)
        zh_batch.append(zh_item)
    
    # 对英文和中文序列分别进行填充
    en_batch = nn.utils.rnn.pad_sequence(en_batch, padding_value=0, batch_first=True)
    zh_batch = nn.utils.rnn.pad_sequence(zh_batch, padding_value=0, batch_first=True)
    
    return en_batch, zh_batch

主函数解读

最近更新

  1. docker php8.1+nginx base 镜像 dockerfile 配置

    2024-07-12 05:52:02       67 阅读
  2. Could not load dynamic library ‘cudart64_100.dll‘

    2024-07-12 05:52:02       72 阅读
  3. 在Django里面运行非项目文件

    2024-07-12 05:52:02       58 阅读
  4. Python语言-面向对象

    2024-07-12 05:52:02       69 阅读

热门阅读

  1. 白骑士的C++教学进阶篇 2.3 模板

    2024-07-12 05:52:02       29 阅读
  2. python常用命令

    2024-07-12 05:52:02       22 阅读
  3. 微信小程序中的数据通信

    2024-07-12 05:52:02       27 阅读
  4. TypeScript中的交叉类型

    2024-07-12 05:52:02       29 阅读
  5. CUDA编程 - asyncAPI 学习记录

    2024-07-12 05:52:02       25 阅读
  6. Postman脚本炼金术:高级数据处理的秘籍

    2024-07-12 05:52:02       25 阅读