1.数据采集
采集目标:世界杯成绩信息表:WorldCupsSummary
包含了所有21届世界杯赛事(1930-2018)的比赛主办国、前四名队伍、总参赛队伍、总进球数、现场观众人数等汇总信息,包括如下字段:Year: 举办年份,HostCountry: 举办国家,Winner: 冠军队伍,Second: 亚军队伍,Third: 季军队伍,Fourth: 第四名队伍,GoalsScored: 总进球数,QualifiedTeams: 总参赛队伍数,MatchesPlayed: 总比赛场数,Attendance: 现场观众总人数,HostContinent: 举办国所在洲,WinnerContinent: 冠军国家队所在洲等,采集后如下图数据采集目标字段截图所示。
采集流程:
第一步,导入所需要的库,如requests库、json库、pandas库等,代码展示如图所示:
import requests
import json
import time
import pandas as pd
import random
第二步,确定翻页方法,确定url和请求参数,发起请求与响应,代码展示如图所示:
df = []
i = 0
# API
url = 'https://api.bilibili.com/x/space/arc/search?'
# 请求头参数
headers = {
'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36 Edg/92.0.902.67',
'refer': 'https://www.bilibili.com/',
}
第三步,发起请求,确定网站返回值正常,网站可以正常爬取,代码展示如图所示:
response = requests.get(url, headers=headers, params=params)
第四步,定义一个列表,利用requests库发送请求,由于我们要收集不同网页的数据,再利用每一页网址存在一定差异的特点进行分页爬取,定位列表爬取的内容,代码展示如下图所示:
for value in info_list:
try:
RoundID = value['RoundID']
except:
RoundID = ""
try:
MatchID = value['MatchID']
except:
MatchID = ''
try:
Team_Initials = value['Team_Initials']
except:
Team_Initials = ''
try:
Coach_Name = value["Coach_Name"]
except:
Coach_Name = ''
try:
Line_up = value['Line_up']
except:
Line_up = ''
try:
Shirt_Number = value['Shirt_Number']
except:
Shirt_Number = ''
try:
Player_Name = value['Player_Name']
except:
Player_Name = 0
try:
Position= value['Position']
except:
Position = '无'
try:
Event = value['Event']
except:
Event = 0
df.append([RoundID,MatchID,Team_Initials,Coach_Name,Line_up,Shirt_Number,Player_Name,Position,Event])
第五步,加入时间函数,设置代码运行速度,使翻页速度下降,模拟成人工翻页,防止爬取速度过快,代码展示如图所示:
# 防止爬取速度过快
time.sleep(random.randint(1, 2))
第六步,在确定网站可以正常爬取的前提下,通过for循环获取信息,再将爬取的文件写入并保存至csv文件,代码展示如图所示:
if __name__=='__main__':
start = time.time()
# 设置需要爬取的页数
num = 3
list = [1935882,351498044,390461123,99157282,60384544,9008159,8960728,71851529]
for value in list:
pau = get_info(url,headers,num,value)
end = time.time()
# 数据存储
df = pd.DataFrame(df,columns=["RoundID", "MatchID", "Team Initials","Coach Name","Line-up","Shirt Number","Player Name","Position","Event"])
df.to_csv('WorldCupsSummary', encoding='gb18030', index=False)
这样,运行完整代码,就可以在csv文件中找到我们获取到的主页信息,并对它们进行有效分析,采集的部分数据截图如图所示:
2.数据预处理
2.1 数据预处理的目标:
(1)数据清洗
(2)缺失值处理