时间:
2022/11/20
背景:
我准备爬取下方页面,是英雄联盟比赛的数据(课程作业)。想爬取网页中的相应数据,里面有选手的名字和一些比赛的数据。本人还是网络爬虫新手,好多别人的代码看不懂QWQ。2022 Worlds Schedule, Standings and Match Results | QWER.GGhttps://qwer.gg/leagues/Worlds/2022?tournament=%22969%22
问题:
在爬取的过程中,我尝试获取网页的源码,但是出现了一点小问题,所以我转而去爬取data数据,如下图所示,但是爬取之后出现问题,无法获得数据,想一下怎么解决。
代码:
import requests
from lxml import etree
if __name__ == '__main__':
# UA伪装&参数
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/107.0.0.0 Safari/537.36 Edg/107.0.1418.42 ",
}
param = {
"operationName": "ListPlayerStatisticsByTournament",
"variables": {"tournamentId": "969"},
"query": "fragment CorePlayer on Player {\n id\n nickName\n firstName\n lastName\n imageUrl\n birthday\n nationality\n position\n __typename\n}\n\nfragment CoreTeam on Team {\n id\n name\n acronym\n imageUrl\n nationality\n __typename\n}\n\nfragment CorePlayerStatistic on PlayerStatistic {\n player {\n ...CorePlayer\n __typename\n }\n team {\n ...CoreTeam\n __typename\n }\n playerId\n teamId\n tournamentId\n position\n games\n wins\n loses\n winRate\n kda\n kills\n deaths\n assists\n wardsPlaced\n wardsKilled\n dpm\n dtpm\n gpm\n cspm\n dpgr\n firstBlood\n firstTower\n __typename\n}\n\nquery ListPlayerStatisticsByTournament($tournamentId: ID!, $playerId: ID, $teamId: ID) {\n playerStatisticsByTournament(\n tournamentId: $tournamentId\n playerId: $playerId\n teamId: $teamId\n ) {\n ...CorePlayerStatistic\n __typename\n }\n}"
}
url_html = "https://qwer.gg/leagues/Worlds/2022?tournament=%22969%22"
url_date = "https://qwer.gg/general/graphql"
# 发起请求 & 获取页面数据
rp = requests.get(url=url_date, headers=headers, params=param)
text = rp.text
date = rp.json()
对爬虫的一些疑问:
1、param怎么确定;
2、headers除了'User-Agent'还需要哪些头,需要Cookie吗;