对于aclImdb数据集有以下结构
aclImdb
|- test
|-- neg
|-- pos
|- train
|-- neg
|-- pos
可以看到可以分为训练集与测试集,训练集与测试集下面都有两个文件,分别为neg,pos 分别代表消极,积极言论,在这两个文件夹下面又很多小的txt文件,每个评论就是一个txt文件,我将所有小文件都整合到一个txt文件中,以行划分,代码如下:
class HandData:
def handData(self,dirname):
all_files = {}
for root, dirs, files in os.walk(dirname, topdown=False):
self.dirs = dirs
all_files["all"] = dirs
for i in range(len(self.dirs)-1):
for _, _, files in os.walk(os.path.join(dirname,self.dirs[i]), topdown=False):
all_files[self.dirs[i]] = files
with open("pos.txt","w",encoding="utf8") as f:
for i in all_files["pos"]:
with open(os.path.join(os.path.join(dirname,"pos"),i),encoding="utf8") as fe :
f.write(fe.read()+"\n")
with open("neg.txt","w",encoding="utf8") as f:
for i in all_files["neg"]:
with open(os.path.join(os.path.join(dirname,"neg"),i),encoding="utf8") as fe :
f.write(fe.read()+"\n")
fileHander = HandData()
fileHander.handData("aclImdb/train")