1、分析网页结构
目标网站:http://movie.mtime.com/boxoffice/#CN/2019
选择抓取2018年内陆票房信息
点击下一页url链接不变,是通道加载,ajax结构,需要抓包来提取信息。
第一页:http://www.mtime.com/boxoffice/?year=2018&area=china&type=MovieRankingYear&category=all&page=0&display=list×tamp=1587368887785&version=07bb781100018dd58eafc3b35d42686804c6df8d&dataType=json
第二页:http://www.mtime.com/boxoffice/?year=2018&area=china&type=MovieRankingYear&category=all&page=1&display=list×tamp=1587368896027&version=07bb781100018dd58eafc3b35d42686804c6df8d&dataType=json
第三页:http://www.mtime.com/boxoffice/?year=2018&area=china&type=MovieRankingYear&category=all&page=2&display=list×tamp=1587368898757&version=07bb781100018dd58eafc3b35d42686804c6df8d&dataType=json
最后页:http://www.mtime.com/boxoffice/?year=2018&area=china&type=MovieRankingYear&category=all&page=9&display=list×tamp=1587369336977&version=07bb781100018dd58eafc3b35d42686804c6df8d&dataType=json
2、请求服务器
此时设置User-Agent请求,由于网站有反爬虫,所以请求服务器失败,因此需要再利用cookie值请求。
import requests,time,csv#导入包
from lxml import etree#树状筛选
#第一页url链接
url="http://www.mtime.com/boxoffice/?year=2018&area=china&type=MovieRankingYear&category=all&page=0&display=list×tamp=1587368887785&version=07bb781100018dd58eafc3b35d42686804c6df8d&dataType=json"
xheaders={
'Cookie':'userId=0; defaultCity=%25E5%258C%2597%25E4%25BA%25AC%257C290; _userCode_=20204201539583959; _userIdentity_=20204201539584524; __utmc=221034756; _tt_=114FE2FE246906E7E72C255389769B90; DefaultCity-CookieKey=290; __utmc=196937584; __utmz=196937584.1587368454.1.1.utmcsr=movie.mtime.com|utmccn=(referral)|utmcmd=referral|utmcct=/; maxShowNewbie=2; __utma=196937584.1241856532.1587368454.1587383179.1587386571.3; __utma=221034756.1078113473.1587368399.1587368399.1587389201.2; __utmz=221034756.1587389201.2.2.utmcsr=localhost:8889|utmccn=(referral)|utmcmd=referral|utmcct=/notebooks/%E7%88%AC%E7%94%B5%E5%BD%B1%E4%BF%A1%E6%81%AF/%E7%88%AC%E7%94%B5%E5%BD%B1%E4%BF%A1%E6%81%AF.ipynb; Hm_lvt_6dd1e3b818c756974fb222f0eae5512e=1587368401,1587389201,1587389214; _ydclearance=da5f6a298f9e5bb95822db94-eae8-4307-90d1-c49a99893d02-1587397694; Hm_lpvt_6dd1e3b818c756974fb222f0eae5512e=1587390503; __utmt=1; __utmt_~1=1; __utmb=221034756.6.10.1587389201',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.92 Safari/537.36'
}
response=requests.get(url=url,headers=xheaders)
返回结果为200,说明请求服务器成功
3、解析网页,定位节点
利用response.json()解析html
response.json()["html"]
根据HTML标签定位到自己需要爬取的内容,再分析定位节点规律,利用for循环爬取整页中的信息。
html_et