语法积累:
以下实例展示了strip()函数的使用方法:
jieba.cut () 出来的都是都是单个的词
" ",join() 可以将单个词连接在一起
import pandas as pd
import numpy as np
读取的注意没有列索引影响的情况
train_df = pd.read_csv('3.0text_train.txt', sep='\t', header=None)
# header=None,告诉函数读取的原始文件数据没有列索引;否则会把数据的第一行默认为字段名标题
print(train_df.shape)
train_df.head()
test_df = pd.read_csv('3.0text_test.txt', sep='\t', header=None)
print(test_df.shape)
test_df.head()
train_df = train_df.sample(frac=0.35, random_state=1)
test_df = test_df.sample(frac=0.35, random_state=1)
``
# ` **重命名表名**
```python
# 重命名列名,第1列为每个样本的分类标签,第2列为每个样本对应的文本
train_df=train_df.rename(columns={0:'category', 1:'text'})
test_df=test_df.rename(columns={0:'category', 1:'text'})
train_df.head(3)
train_df.groupby('category').count()
读取停顿词
#读取包含停顿词的txt文件赋值给stopWordlist
with open('C:/Users/lb/Desktop/test/3.0stopwords.txt',encoding = 'utf8' ) as file :
stopWordlist = [ k.strip() for k in file.readlines() ] #空格 \n 过滤
with open('3.0stopwords.txt', encoding='utf8') as file:
print([k.strip() for k in file.readlines()])
分词:
import jieba
import time
train_df.columns = ['分类', '文章'] #dataframe.columns
stopword_list = [k.strip() for k in open('3.0stopwords.txt', encoding='utf8').readlines() if k.strip() != '']
cutWords_list = []
i = 0
startTime = time.time()
for article in train_df['文章']:
#使用jieba的cut方法分词,同时过滤掉停顿词
cutWords = [k for k in jieba.cut(article) if k not in stopword_list]
i += 1
if i % 1000 == 0:
print('前%d篇文章分词共花费%.2f秒' %(i, time.time()-startTime))
cutWords_list.append(cutWords)
分词结果写进文件:
注意加一个换行符
with open('3.0cutWords_list.txt', 'w') as file:
for cutWords in cutWords_list:
file.write(' '.join(cutWords) + '\n')
读取完成了的分词结果
with open('3.0cutWords_list.txt') as file:
cutWords_list = [k.split() for k in file.readlines()]
print(len(cutWords_list))
print(len(cutWords_list[0]))
第0行
print(cutWords_list[0])
利用 “” join 连接在一起
" ".join(cutWords_list[0])
所有的连接起来:
document = [" ".join(sent0) for sent0 in cutWords_list]