【NLP】如何实现快速加载gensim word2vec的预训练的词向量模型

2024-03-16 06:56:02
开发
45

1 问题

通过以下代码，实现加载word2vec词向量，每次加载都是几分钟，效率特别低。

from gensim.models import Word2Vec,KeyedVectors

# 读取中文词向量模型（需要提前下载对应的词向量模型文件）
word2vec_model = KeyedVectors.load_word2vec_format('hy-tmp/word2vec.bz2', binary=False)

2 解决方案

（1）方案一
第一次加载后保存为能够快速加载的文件，第二次加载就能快读读取。

file_path = "word2vec/train_bio_word"
if os.path.exists(file_path):
    word2vec_model = KeyedVectors.load(file_path,mmap='r')
else:
    # 读取中文词向量模型（需要提前下载对应的词向量模型文件）
    word2vec_model = KeyedVectors.load_word2vec_format('hy-tmp/word2vec.bz2', binary=False)
    word2vec_model.init_sims(replace=True)
    word2vec_model.save(file_path)

（2）方案二
第一次加载后，只将使用到的词向量以表格的形式保存到本地，第二次读取就不需要加载全部word2vec的，只加载表格中的词向量。

file_path = "word2vec/train_vocabulary_vector.csv"
if os.path.exists(file_path):
    # 读取词汇-向量字典，csv转字典
    vocabulary_vector = dict(pd.read_csv(file_path))
    # 此时需要将字典中的词向量np.array型数据还原为原始类型，方便以后使用
    for key,value in vocabulary_vector.items():
       vocabulary_vector[key] = np.array(value)
    
else:
    # 所有文本构建词汇表，words_cut 为分词后的list，每个元素为以空格分隔的str.
    vocabulary = list(set([word for item in text_data1 for word in item]))
    # 构建词汇-向量字典
    vocabulary_vector = {}
    for word in vocabulary:
       if word in word2vec_model:
          vocabulary_vector[word] = word2vec_model[word]
    # 储存词汇-向量字典，由于json文件不能很好的保存numpy词向量，故使用csv保存
    pd.DataFrame(vocabulary_vector).to_csv(file_path)

（3）方案三
不使用word2vec的原训练权重，使用Embedding工具库。自动下载权重文件后，高效使用。
参考：https://github.com/vzhong/embeddings
安装库

pip install embeddings  # from pypi
pip install git+https://github.com/vzhong/embeddings.git  # from github

from embeddings import GloveEmbedding, FastTextEmbedding, KazumaCharEmbedding, ConcatEmbedding

g = GloveEmbedding('common_crawl_840', d_emb=300, show_progress=True)
f = FastTextEmbedding()
k = KazumaCharEmbedding()
c = ConcatEmbedding([g, f, k])
for w in ['canada', 'vancouver', 'toronto']:
    print('embedding {}'.format(w))
    print(g.emb(w))
    print(f.emb(w))
    print(k.emb(w))
    print(c.emb(w))

原文地址:https://blog.csdn.net/weixin_43935696/article/details/136657972 本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：https://www.suanlizi.com/kf/1768773176696901632.html 如若内容造成侵权/违法违规/事实不符，请联系《酸梨子》网邮箱：1419361763@qq.com进行投诉反馈，一经查实，立即删除！

阅读全部

【NLP】如何实现快速加载gensim word2vec的预训练的词向量模型

1 问题

2 解决方案

相关推荐

最近更新

热门阅读