python 爬虫爬取动态链接_Python 爬虫 | 爬取动态加载的网站

爬取动态加载图片

最新推荐文章于 2020-12-10 14:01:24 发布

最新推荐文章于 2020-12-10 14:01:24 发布 · 548 阅读

·

0

·

文章标签：

#python 爬虫爬取动态链接

本文介绍了一种针对使用AJAX动态加载图片的网站进行爬取的方法。以多玩图库为例，展示了如何定位到真正的图片地址，并通过Python脚本实现自动化下载。

上篇说了如何爬取静态网站https://www.jianshu.com/p/bbf4386f7527，我们可能在爬取的过程中发现有的网站并没有把内容放到html里面，而是通过ajax动态加载的方式放进来的。

比如http://tu.duowan.com/gallery/138916.html#p1

我们访问发现很容易找到图片的原图地址，于是我们兴冲冲的用爬虫请求一下发现根本没有地址，根本是个空的，一脸懵逼，可以比较下下面的两幅图。

浏览器的F12

爬虫请求的html

很明显我们请求的并没有地址，而浏览器是有的。

这是因为网站用了AJAX，也就是XMLHttpRequest，那我们怎么找到真正的地址呢？

XHR

我们可以从这里找到XHR请求的地址，也就是http://tu.duowan.com/index.php?r=show/getByGallery/&gid=138916&_=1558600256687，我们请求这个链接发现是个json：

地址真正的所在地

那这就好办了，既然找到了真正的地址，我们就按照我们之前的经验搞一搞。

#!/usr/bin/python

# -*- coding: UTF-8 -*-

import urllib2, urllib, os, time, json

class Pic:

def __init__(self, url, desc, path):

self.url = url

self.desc = desc

self.path = path

locol = "/Users/y/PythonWorkSpace/DUOWAN/"

def test():

# 20000-20700

start_index = 137882

end_index = 138930

for i in range(start_index, end_index):

download_pic(i)

return

def download_pic(index):

curr_time = str(time.time()).replace(".", "0")

url = "http://tu.duowan.com/index.php?r=show/getByGallery/&gid=%d&_=%s" % (index, curr_time)

print "开始执行Task %s" % url

request = urllib2.Request(url) # Request参数有三个，url,data,headers,如果没有data参数，那就得按我这样的写法

request.add_header("User-Agent",

"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36")

request.add_header("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7,fr;q=0.6")

request.add_header("Accept",

"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3")

response = urllib2.urlopen(request)

# print response.code

if response.code != 200:

return

html = response.read()

if html.strip() == '':

return

dict = json.loads(html, encoding="GBK")

# print raw.keys()

# print dict[u'picInfo']

pic_list = []

pic_info = dict[u'picInfo']

current_dir = locol + "" + str(index) + "/"

for info in pic_info:

source = info[u'source']

desc = info[u'add_intro']

suffix = '.gif'

if source.endswith("gif"):

suffix = '.gif'

elif source.endswith("jpg"):

suffix = '.jpg'

else:

return

path = current_dir + desc + suffix

pic = Pic(source, desc, path)

pic_list.append(pic)

for pic in pic_list:

if not os.path.exists(current_dir):

os.mkdir(current_dir)

print "-------------开始下载---------------", pic.url, pic.path

urllib.urlretrieve(pic.url, pic.path)

print '休息一下，休息3s'

time.sleep(3)

return

打完收工~~~

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。