一.数据集
train.dat共393366项,第一维是用户的代号,第二维是商品代号,第三项是用户对该商品的评分,第四项是评论数,第五项是评论内容。
test.dat第一维是用户代号,第二维是商品代号。
二.简单思路及实现过程
这里简单先贴出代码实现,后期再慢慢补上具体过程说明~
1.数据预处理
去除部分无关常用词,这里调用nltk包,将评论词词词根化等,进行评论数据的预处理,存为new dat.dat
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split
def textPrecessing(text):
wordLst = nltk.word_tokenize(text)
filtered = [w for w in wordLst if w not in stopwords.words('english')]
refiltered =nltk.pos_tag(filtered)
filtered = [w for w, pos in refiltered if pos.startswith('NN')]
ps = PorterStemmer()
filtered = [ps.stem(w) for w in filtered]
return " ".join(filtered)
def split_word():
x = []
y = []
with open("E:/project/o/comdata/train.dat", encoding='utf-8') as f:
for data in f.readlines():
data = data.strip("\n").split(" ")
temp = []
temp.append(' '.join(data[0:4]))
text = data[4:]
temp.append(textPrecessing(" ".join(text)))
x.append(' '.join(temp))
with open("E:/project/o/comdata/newdat.dat", "w", encoding='utf-8') as w:
for i in x:
w.write(i)
w.write('\n')
f.close()
w.close()
with open('train.dat',encoding='UTF-8') as file_o