Python深度学习——文本与语言

最新推荐文章于 2024-07-08 00:01:13 发布

czslxk

最新推荐文章于 2024-07-08 00:01:13 发布

阅读量268

点赞数 2

分类专栏： Python深度学习之pytorch 文章标签： python 深度学习自然语言处理

本文链接：https://blog.csdn.net/weixin_45717457/article/details/104315134

版权

文本与语言

本文包含以下内容
1、文本预处理
2、语言模型

文本预处理

文本是一类序列数据，一篇文章可以看作是字符或单词的序列，本节将介绍文本数据的常见预处理步骤，预处理通常包括四个步骤：

1、读入文本

import collections
import re

def read_time_machine():
    with open('/home/kesci/input/timemachine7163/timemachine.txt', 'r') as f:
        lines = [re.sub('[^a-z]+', ' ', line.strip().lower()) for line in f]
    return lines


lines = read_time_machine()
print('# sentences %d' % len(lines))

out：sentences 3221

2、分词
对每个句子进行分词，也就是将一个句子划分成若干个词（token），转换为一个词的序列。

def tokenize(sentences, token='word'):
    """Split sentences into word or char tokens"""
    if token == 'word':
        return [sentence.split(' ') for sentence in sentences]
    elif token == 'char':
        return [list(sentence) for sentence in sentences]
    else:
        print('ERROR: unkown token type '+token)

tokens = tokenize(lines)
tokens[0:2]

out：[[‘the’, ‘time’, ‘machine’, ‘by’, ‘h’, ‘g’, ‘wells’, ‘’], [’’]]
3、建立字典，将每个词映射到一个唯一的索引（index）
为了方便

最低0.47元/天解锁文章

czslxk

关注

2
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Python深度学习——文本与语言

文本与语言本文包含以下内容1、文本预处理2、语言模型文本预处理文本是一类序列数据，一篇文章可以看作是字符或单词的序列，本节将介绍文本数据的常见预处理步骤，预处理通常包括四个步骤：1、读入文本import collectionsimport redef read_time_machine(): with open('/home/kesci/input/timemachin...
复制链接

扫一扫