需求:有这样一个网站:http://www.5er0.com/.网站可以搜索有关电影电视剧有关的信息或者下载链接。现在要输入一个video name,爬到其下载链接。
首先我们打开网站首页:
看到有个搜索框,我们尝试输入战狼并点击搜索,得到结果如下:
既然爬虫我们就要看一下数据是如何请求的,回到首页,审查元素查看搜索框的元素:
看到这是一个post表单。
再看搜索结果页面,显示的url是这样的:
http://www.5er0.com/search.php?mod=portal&searchid=822100&searchsubmit=yes&kw=%D5%BD%C0%C7
别的字段都好理解,唯独这个searchid是个什么东西?
(省略无数实验过程,直接放出结论)
searchid表示这是整个网站的第几次搜索(所有人的搜索次数加一起),比如你前边已经有了100次搜索,那么这次就要传101.假如我们不传101会如何呢?
1 传一个比101 大的数:
显示搜索不存在
2 传一个比101小的数
比如传99,那么结果就是第99次搜索的结果(比如第99次搜索是大话西游)而不管你的关键字kw是什么。
那么显然searchid的获取极为关键。鉴于鄙人前端水平渣的可以,实在不熟悉js那一套理论,于是就动用了抓包工具fiddler4来研究searchid的获取。
首先打开fiddler和网站首页:
输入战狼
可以看到第2个url中就出现了searchid,那么显然在第一个url中获取到了searchid。
我们看一下第一个url对应报文:
文字版如下:
POST http://www.5er0.com/search.php?searchsubmit=yes HTTP/1.1
Host: www.5er0.com
Connection: keep-alive
Content-Length: 121
Cache-Control: max-age=0
Origin: http://www.5er0.com
Upgrade-Insecure-Requests: 1
Content-Type: application/x-www-form-urlencoded
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Referer: http://www.5er0.com/
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.8
Cookie: IJKY_2132_saltkey=WB3xlz0N; IJKY_2132_lastvisit=1509431731; UM_distinctid=15f715b6341816-05f143994a3908-464a0129-1fa400-15f715b6342a79; IJKY_2132_lastact=1509959135%09portal.php%09view; IJKY_2132_sid=Ub2EJ2; CNZZDATA1264763675=1237924191-1509431778-%7C1509957167
mod=search&formhash=f62b70ea&srchtype=title&srhfid=0&srhlocality=portal%3A%3Aindex&srchtxt=%D5%BD%C0%C7&searchsubmit=true
可以看出请求url:http://www.5er0.com/search.php?searchsubmit=yes ,参数为mod=search&formhash=f62b70ea&srchtype=title&srhfid=0&srhlocality=portal%3A%3Aindex&srchtxt=%D5%BD%C0%C7&searchsubmit=true。于是尝试将http://www.5er0.com/search.php?searchsubmit=yes&mod=search&formhash=f62b70ea&srchtype=title&srhfid=0&srhlocality=portal%3A%3Aindex&srchtxt=%D5%BD%C0%C7&searchsubmit=true
输入浏览器发现可以得到预期搜索结果。这样searchid的问题就被绕过去了。
剩下的很简答了,上代码:
import re
import time
import random
import urllib
from bs4 import BeautifulSoup
class LongbuluoCrawler():
SEARCH_URL = u'http://www.5er0.com/search.php?mod=search&formhash=f62b70ea&srcht' \
u'ype=title&srhfid=0&srhlocality=portal::index&%s&searchsubmit=true'
BASE_URL = u'http://www.5er0.com/%s'
PAGE_URL = u'http://www.5er0.com/search.php?searchid=%s&searchsubmit=yes&page=%d'
def _crawlSiteVideo(self, title):
title = title.encode('gb2312')
data = {'srchtxt': title}
url = self.SEARCH_URL % urllib.urlencode(data)
video_list = []
url_list = []
page_count = 1
searchid = 0
try:
html = self._getHtml(url).decode('gbk')
soup = BeautifulSoup(html, "html.parser")
pg = soup.find_all('div', class_='pg')
url_list.extend(self._getUrlList(url))
if pg: # page_count > 1
pg = str(pg[0])
page_count = int(re.findall('共.+?([0-9]+).+?页', pg)[0])
searchid = re.findall('searchid=([0-9]+)&', pg)[0]
for i in range(2, page_count+1):
time.sleep(random.randint(2,6))
url_list.extend(self._getUrlList(self.PAGE_URL % (searchid, i)))
for url in url_list:
time.sleep(random.randint(2, 6))
video = self._getVideoInfo(url)
if video:
video_list.append(video)
except:
pass
return video_list
def _getUrlList(self, url):
url_list = []
try:
soup = BeautifulSoup(self._getHtml(url).decode('gbk'), "html.parser")
xs3 = soup.find_all('h3', class_='xs3')
for ele in xs3:
ele = str(ele)
title = re.findall('target=\"_blank\">(.+?)</a>', ele)[0]
title = str(title).replace('<strong><font color="#ff0000">', "")
title = str(title).replace('</font></strong>', "")
if self._isVideo(title):
url_list.append(self.BASE_URL % re.findall('href=\"(.+?)\" target', ele)[0])
except Exception,e:
pass
return url_list
def _isVideo(self,str):
return any(('高清' in str, '下载' in str, '百度' in str, '迅雷' in str, '网盘' in str, '在线' in str, '观看' in str))
def _getVideoInfo(self,url):
try:
html = self._getHtml(url).decode('gbk')
title = re.findall('<title>(.+?)</title>', html)[0]
soup = BeautifulSoup(html, "html.parser")
xg1 = soup.find_all('p',class_='xg1')
author = re.findall('html">(.+?)</a>', str(xg1[0]))
article_content = str(soup.find_all('td', id='article_content')[0])
duration = self._time_transfer(re.findall('([0-9]+)分钟', article_content))
return {"title": title, "url": url, "author": author[0] if author else None, "duration": duration}
except:
return None
def _time_transfer(self, time):
try:
time = int(time[0])
if time > 60:
return str(time/60)+":"+str(time % 60)+":00"
else:
return "00:"+str(time)+":00"
except:
return None