项目地址:
https://github.com/Lee991211/Innovation_training.git
数据清洗
当获取了大量的微博数据,需要对冗余数据进行清洗,使数据满足一定格式,以达到模型训练的要求@杨涛同学。当然这个任务相对于爬取就比较简单了,出于保存数据备份的想法,我的清洗脚本分为两步。
wash:
import pandas as pd
data1 = pd.read_csv("keyword.csv")
#data2 = pd.read_csv("Aprilplus.csv")
data1.drop(data1.columns[[0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16]], axis=1, inplace=True)
#data2.drop(data2.columns[[0, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16]], axis=1, inplace=True)
df1 = pd.DataFrame(data1)
#df2 = pd.DataFrame(data2)
df1.to_csv('keywordtemp.csv', index=None)
#df2.to_csv('washAprilPlus.csv', index=None)
wash可以将微博爬取的原文件去掉多余的列,只留下微博正文和发布时间两列。
wash2:
import pandas as pd
import csv
data1 = pd.read_csv("keywordtemp.csv",)
# data2 = pd.read_csv("washAprilPlus.csv",)
data1['发布时间'] = data1['发布时间'].str.split(' ', expand=True)[0]
# data2['发布时间'] = data2['发布时间'].str.split(' ', expand=True)[0]
data1['发布时间'] = data1['发布时间'].str.split('-', expand=True)[1] + '/' + data1['发布时间'].str.split('-', expand=True)[2]
# data2['发布时间'] = data2['发布时间'].str.split('-', expand=True)[1] + '/' + data2['发布时间'].str.split('-', expand=True)[2]
count = 1
result = []
temp = []
for row in data1.index:
if data1.loc[row].values[1] == '05/01':
temp.append(data1.loc[row].values)
count = count + 1
if count == 16:
result = result + temp
temp = []
count = 1
break
for row in data1.index:
if data1.loc[row].values[1] == '05/02':
temp.append(data1.loc[row].values)
count = count + 1
if count == 16:
result = result + temp
temp = []
count = 1
break
for row in data1.index:
if data1.loc[row].values[1] == '05/03':
temp.append(data1.loc[row].values)
count = count + 1
if count == 16:
result = result + temp
temp = []
count = 1
break
for row in data1.index:
if data1.loc[row].values[1] == '05/04':
temp.append(data1.loc[row].values)
count =