这次练习的是抓取动态网页,因为个人喜欢恐怖片,就选了豆瓣的恐怖片来作为爬取对象。
网页是动态加载的,点击加载更多就出现更多的信息。所以需要在浏览器用F12工具中打开network,找到XHR,观察加载的内容。
通过观察Headers里的Request URL,知道了返回信息的url,点击几次加载更多,会发现url:https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=%E7%94%B5%E5%BD%B1,%E6%81%90%E6%80%96&start=0
https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=%E7%94%B5%E5%BD%B1,%E6%81%90%E6%80%96&start=20
https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=%E7%94%B5%E5%BD%B1,%E6%81%90%E6%80%96&start=40.
每次改变的只是末尾start的值,每次增多20,20代表的就是每次加载电影的个数,可以数一数来确定。这样就确定了爬取页面的url。
接下来切换到Preview项,观察这个json格式的字典,确定想要爬取的信息,比如id,title,rate等等。
但是我发现这样获取的信息不够多,我还想知道电影的上映时间,导演,演员,电影时长。得不到这些信息让我的这个爬虫一直没有写成,直到有一次我逛豆瓣的其他分类电影,发现把鼠标光标停留在某些电影图片上时,会出现更加详细的信息,如下图。打开F12,查看XHR,果然这些信息也是通过XHR异步加载,也是json格式且更加详细。
如图所示,观察下发现这类url需要得知电影的id才能返回相应的信息,而获取id十分简单,直接解析前面获得的json即可。剩下的问题就是如何快速爬取和不被封掉ip地址。根据上次学习的multiprocessing,利用多线程达到快速爬取的目的。而更换ip地址的问题,可先写一个抓取”快代理“免费ip地址的爬虫,把抓到的ip用于我们的豆瓣爬虫上。
下面上代码:
import requests
import json
from multiprocessing import Pool
import pymysql
import random
def get_ip():#把爬下来的ip地址存到文件中,调用储存在文件中的IP地址,转换为字典形式
proxies = {}
with open('ip池.txt','r') as r:
datas = r.readlines()
data = datas[random.randint(0,11)]#利用random随机数,随机提取IP
key = data[:6].strip("'")
value = data[7:].strip("'").strip('\n').rstrip("'")
proxies[key] = value#转换为字典
r.close()
return(proxies)
def get_id(page):#获取电影的id
db = pymysql.connect('localhost','你的用户名','你的密码','你的数据库名',charset='utf8')#调用pymysql储存数据
cursor = db.cursor()
url = 'https://movie.douban.com/j/new_search_subjects?sort=T&range=0,10&tags=%E7%94%B5%E5%BD%B1,%E6%81%90%E6%80%96&start={0}'.format(page*20)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
try:#用try语句处理异常,防止出现IP被封和爬虫假死的情况
proxies = get_ip()
js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
except requests.exceptions.ConnectTimeout:
js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
except:
proxies = get_ip()
js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
try:
for i in range(20):
data = js['data'][i]
title = data['title']
id = data['id']
score, year, duration, directors, actors = get_detail(id)
print('正在储存{0},使用的ip是{1}'.format(title,proxies['HTTP']))#可以看到程序正在执行的过程
sql = 'insert into horror values (%s,%s,%s,%s,%s,%s)'
cursor.execute(sql,(str(title),score,year,duration,directors,actors))
db.commit()
except IndexError:
print('无内容')
db.close()
def get_detail(id):#根据电影的id获取更详细的电影信息
url = 'https://movie.douban.com/j/subject_abstract?subject_id={0}'.format(id)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36'}
try:
proxies = get_ip()
js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
except requests.exceptions.ConnectTimeout:
js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
except:
proxies = get_ip()
js = requests.get(url, headers=headers,proxies = proxies,timeout = 6).json()
data = js['subject']
score = data['rate']
try:
year = data['release_year']
except:
year = 'unknown'
duration = data['duration']
try:
directors = data['directors'][0]
except:
directors = 'unknown'
try:
actors = data['actors'][:4]
except:
actors = data['actors']
return str(score),year,duration,directors,str(actors).strip('[').strip(']')
def main(i):
get_id(i)
if __name__ == '__main__':
i = [i for i in range(550)]#调用多线程
pool = Pool(8)
pool.close()
pool.join()
pool.map(main,i)
最终储存到数据库中:
补上个爬取免费ip地址的爬虫(需要设置sleep,速度太快爬取的ip地址个数会爬漏很多,网站可能限制了爬虫的访问):
import requests
from lxml import etree
import time
def get_ip(url):
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.221 Safari/537.36 SE 2.X MetaSr 1.0'}
res = requests.get(url,headers = headers).text
s = etree.HTML(res)
for i in range(1,16):
ip = s.xpath('//*[@id="list"]/table/tbody/tr[{0}]/td[1]/text()'.format(i))
port = s.xpath('//*[@id="list"]/table/tbody/tr[{0}]/td[2]/text()'.format(i))
htt = s.xpath('//*[@id="list"]/table/tbody/tr[{0}]/td[4]/text()'.format(i))
proxies = {'{0}'.format(htt[0]):'{0}://{1}:{2}'.format(htt[0],ip[0],port[0])}
url2 = 'https://www.baidu.com/'#用于测试爬到的ip是否能用
ress = requests.get(url2,proxies = proxies)
if ress.status_code == 200:
print('ok,正在储存{0}'.format(ip[0]))
with open('ip池.txt','a') as f:
f.write('\'{0}\':\'http://{1}:{2}\'\n'.format(htt[0],ip[0],port[0]))
f.close()
else:
print('{0}不可用'.format(ip[0]))
def main():
with open('ip池.txt','w') as w:
w.seek(0)#从0开始定位
w.truncate()#配合seek清除文件全部内容,用于更新IP池
w.close()
for i in range(1,11):
url = 'https://www.kuaidaili.com/free/inha/{0}'.format(i)
get_ip(url)
time.sleep(1)
if __name__ == '__main__':
main()