python怎么爬取电影海报_豆瓣top250海报原图爬取

本文介绍如何使用Python爬取豆瓣电影Top250的高清原图。由于豆瓣对原图请求有严格过滤,需要携带特定referer,通过分析请求头并构造完整请求来获取图片。代码中使用urllib2和pyquery模块,逐个页面抓取电影信息,并下载高清海报。
摘要由CSDN通过智能技术生成

源码见

功能分析

网上爬取豆瓣电影排行很多,但是由于做一个h5画廊需求大量电影素材,而列表与详情页的图片清晰度满足不了要求,所以决定爬取豆瓣原图,在查看是发现需要登陆,然而登陆后原图链接查看并没有任何cookie信息在请求头,遂想直接构造链接爬取,爬取过程出现302,重新分析请求头,发现有referer,嗯,直接加了豆瓣首页作为referer,mdfuck居然没有用,后来发现每个原图需要带着该图前中等缩略图链接作为referer,可以看出豆瓣在服务器端作了比较严格的过滤处理,虽然在nginx上也做过图片防盗链,自认为没有做如此猥琐,别人猥琐,咱也不能谦虚是吧,一个个构造请求头,利用python爬取原图。

所需模块

网页内容抓取用urllib2

页面解析用pyquery(一个可以用jquey方式解析html的模块)

下面是代码部分:(好久没写python,表示有点手生ฅʕ•̫͡•ʔฅ,图片大概200m,页面爬取很快,最后是io下载,所以速度还是取决网速)

#coding:utf-8

import urllib2

import re

import sys

import time

from pyquery import PyQuery as pq

#http://movie.douban.com/top250?start=0&filter=&type=

class Douban:

def __init__(self):

reload(sys)

sys.setdefaultencoding('utf-8')

self.start = 0

self.param = '&filter=&type='

self.headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.113 Safari/537.36'}

self.movieList = []

self.filePath = './img/'

self.imgpath = 'https://img3.doubanio.com/view/photo/raw/public/'

self.refer = 'https://movie.douban.com/photos/photo/'

def getPage(self):

print '---getpagestart---'

try:

URL = 'http://movie.douban.com/top250?start=' + str(self.start)

request = urllib2.Request(url = URL, headers=self.headers)

response = urllib2.urlopen(request)

page = response.read().decode('utf-8')

pageNum = (self.start + 25)/25

print 'scrabing ' + str(pageNum) + 'page...'

self.start += 25;

return page

except urllib2.URLError, e:

if hasattr(e, 'reason'):

print 'failed reason', e.reason

def htmlparse(self):

print '---getMoviestart---'

while self.start < 250:

page = self.getPage()

html = pq(page);

list = html(".grid_view>li")

info = {};

for item in list:

item = pq(item)

info['name'] = item(".hd>a").text()

info['des'] = item(".bd p:first").text()

info['img'] = item(".pic img").attr('src');

group = re.findall(r'\/(p(\d+)\.jpg)', info['img'])

info['img'] = self.imgpath + group[0][0]

info['refer'] = self.refer + group[0][1] + '/'

#print info

self.movieList.append(

, info['des'], info['img'],info['refer']])

def hook(self):

mfile = open(self.filePath + 'movielist.txt', 'w')

try:

for index, movie in enumerate(self.movieList):

print movie[0].encode('gbk','ignore')

#print movie

self.downImg(movie[2], movie[3], self.filePath + 'movie' + str(index+1) + '.jpg')

mfile.write(str(index+1) + '、' + movie[0] + '\\n'+ movie[1] + '\\n')

print 'wirte done'

finally:

mfile.close()

def downImg(self, URL, refer, imgpath):

head = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.113 Safari/537.36'};

head['Referer'] = refer

request = urllib2.Request(url = URL, headers = head)

try:

f = open(imgpath, 'wb')

res = urllib2.urlopen(request).read()

f.write(res)

f.close()

except urllib2.URLError, e:

if hasattr(e, 'reason'):

print 'failed reason', e.reason

def main(self):

print '---mainstart---'

self.htmlparse()

print len(self.movieList)

self.hook()

DB = Douban()

DB.main()```

>运行效果

![图片发自简书App](http://upload-images.jianshu.io/upload_images/3079704-08647100c5f3059e.jpg)![图片发自简书App](http://upload-images.jianshu.io/upload_images/3079704-4efdd343c37adcee.jpg)![图片发自简书App](http://upload-images.jianshu.io/upload_images/3079704-ea57849490623d6b.jpg)![图片发自简书App](http://upload-images.jianshu.io/upload_images/3079704-a6a4050dddbcfab8.jpg)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值