txt 格式文本的下载:https://www.ixdzs.com/d/77/77362/#download
下载后用记事本打开另存为utf-8编码格式。
分割文本数据的代码:
# -*- coding: utf-8 -*-
"""
Created on Mon Aug 27 06:14:13 2018
@author: xiaozhe
"""
import re
#pattern = re.compile(r'[\n,。!?]')
pattern = re.compile(r'\s+')
with open('红楼梦.txt', 'r', encoding='utf-8') as f:
jay_file = pattern.sub('', f.read())
# jay_file = f.read()
size = len(jay_file)
train_data = jay_file[:int(size*0.7)]
vali_data = jay_file[int(size*0.7):int(size*0.9)]
test_data = jay_file[int(size*