第N2周：构建词典-CSDN博客

本文链接：https://blog.csdn.net/a536723241/article/details/142750127

🍨 本文为🔗365天深度学习训练营中的学习记录博客
🍖 原作者：K同学啊

本周任务：
使用N1周的.txt文件构建词典

导入数据

from torchtext.vocab import build_vocab_from_iterator
from collections import Counter
from torchtext.data.utils import get_tokenizer
import jieba,re,torch

data = [
    '我是k同学啊！',
    '我是一个深度学习博主，',
    '这是我的365深度学习训练营教案',
    '你可以通过百度、微信搜索关键词【K同学啊】找到我'
]

设置分词器

# 中文分词
tokenizer = jieba.lcut
# 自定义词典
jieba.load_userdict('任务文件.txt')

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\admin\AppData\Local\Temp\jieba.cache
Loading model cost 0.686 seconds.
Prefix dict has been built successfully.

清除标点与停用词

# 去除标点
def remove_punctuation(text):
    return re.sub(r'[^\w\w]','',text)

# 停用词表
stopwords = set([
    '这','是','的'
])

# 去除停用词
def remove_stopwords(words):
    return [word for word in words if word not in stopwords]

设置迭代器

# 定义迭代器
def yield_tokens(data_iter):
    for text in data_iter:
        text = remove_punctuation(text)
        text = tokenizer(text)
        text = remove_stopwords(text)
        yield text

构建词典

# 构建词典
vocab = build_vocab_from_iterator(yield_tokens(data),specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

文本数字化

# 打印词汇表
print('词典大小:',len(vocab))
print('词典内部映射：',vocab.get_stoi())

text = '这是我的365天深度学习训练营教案'
words = remove_stopwords(jieba.lcut(text))
print('\n')
print('jieba分词后的文本：',jieba.lcut(text))
print('去除停用词后的文本:',remove_stopwords(jieba.lcut(text)))
print('数字化后的文本：',[vocab[word] for word in words])

词典大小: 21
词典内部映射： {'<unk>': 0, '可以': 13, '我': 1, '深度': 5, '微信': 14, '学习': 4, '同学': 2, '啊': 3, '365': 6, 'K': 7, 'k': 8, '一个': 9, '你': 10, '关键词': 11, '博主': 12, '找到': 15, '搜索': 16, '教案': 17, '百度': 18, '训练营': 19, '通过': 20}


jieba分词后的文本： ['这', '是', '我', '的', '365', '天', '深度', '学习', '训练营', '教案']
去除停用词后的文本: ['我', '365', '天', '深度', '学习', '训练营', '教案']
数字化后的文本： [1, 6, 0, 5, 4, 19, 17]

总结

构建词典函数build_vocab_from_iterator()
函数原型：

build_vocab_from_iterator(iterator: Iterable,
						min_freq: int = 1,
						specials: Optional[List[str]]=None,
						special_first: bool = True,
						max_tokens: Optional[int] = None)