1. 异步加载爬虫
对于静态页面爬虫很容易获取到站点的数据内容,然而静态页面需要全量加载站点的所有数据,对于网站的访问和带宽是巨大的挑战,对于高并发和大访问访问量的站点来说,需要使用AJAX相关的技术来实现异步加载,即根据需要来获取数据,以pexels网站为例,按F12,切换到Network的XHR标签,通过下拉菜单访问该站点,此时数据会以此加载,在XHR页面中会逐步增加访问的URL地址,点击查看其中一个URL地址,发现其URL的地址类似为:https://www.pexels.com/search/book/?page=3&seed=2018-02-22+05:21:39++0000&format=js&seed=2018-02-22 05:21:39 +0000,将其修改为https://www.pexels.com/search/book/?page=3,并修改page后面数的值发现可以访问到不同的页面内容,以此来构造需要访问的url站点内容。
2. 代码内容
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 | #!/usr/bin/python #_*_ coding:utf _*_ #author: HappyLau #blog: https://www.cnblogs.com/cloudlab import os import sys import time import os.path import random import requests from lxml import etree reload (sys) sys.setdefaultencoding( 'utf8' ) def get_jianshu(url): ''' demo简书网站的获取信息 ''' headers = { "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0" } try : req = requests.get(url,headers = headers) if req.status_code = = 200 : return req.text.encode( 'utf8' ) else : return '' except Exception as e: print e def get_picture(url,download_dir): ''' @params:获取url中的图片信息,并将其下载到download_dir目录中 @download_dir:图片下载的本地路径 ''' if not os.path.exists(download_dir): os.mkdir(download_dir) html = get_jianshu(url) selector = etree.HTML(html) for url in selector.xpath( '//img[@class="photo-item__img"]/@src' ): picture_name = url.split( "?" )[ 0 ].split( "/" )[ - 1 ] print "downloading picutre %s" % (picture_name) with file (download_dir + picture_name, 'wb' ) as f: f.write(requests.get(url).content) time.sleep(random.randint( 1 , 3 )) if __name__ = = "__main__" : url_lists = [ 'https://www.pexels.com/search/book/?page={}' . format (i) for i in range ( 1 , 21 )] for url in url_lists: get_picture(url, '/root/pexels' ) |
3. 下载图片使用方式
上面使用requests.get().content的方式来实现下载图片的方法,还可以通过urllib.urlretrieve()方法来实现图片的下载功能,该函数的使用参数为:retrieve(self, url, filename=None, reporthook=None, data=None),其中url地址为需要访问的url路径,filename为本地存放图片的路径,修改代码内容如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | #!/usr/bin/python #_*_ coding:utf _*_ #author: HappyLau #blog: https://www.cnblogs.com/cloudlab import os import sys import time import os.path import random import requests import urllib from lxml import etree reload (sys) sys.setdefaultencoding( 'utf8' ) def get_jianshu(url): ''' demo简书网站的获取信息 ''' headers = { "User-Agent" : "Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0" } try : req = requests.get(url,headers = headers) if req.status_code = = 200 : return req.text.encode( 'utf8' ) else : return '' except Exception as e: print e def get_picture(url,download_dir): ''' @params:获取url中的图片信息,并将其下载到download_dir目录中 @download_dir:图片下载的本地路径 通过利用urllib模块中的urlretrieve()方法实现图片的下载功能 ''' if not os.path.exists(download_dir): os.mkdir(download_dir) html = get_jianshu(url) selector = etree.HTML(html) for url in selector.xpath( '//img[@class="photo-item__img"]/@src' ): picture_name = download_dir + "/" + url.split( "?" )[ 0 ].split( "/" )[ - 1 ] print "downloading picutre %s" % (picture_name) urllib.urlretrieve(url,picture_name) #下载图片 time.sleep(random.randint( 1 , 3 )) if __name__ = = "__main__" : url_lists = [ 'https://www.pexels.com/search/book/?page={}' . format (i) for i in range ( 1 , 21 )] for url in url_lists: get_picture(url, '/root/pexels' ) |