图片懒加载
-
页面中的图片资源不是一次性全部请求到的,而是通过事件的监听结合着img标签的伪属性实现的懒加载机制
- 伪属性:自己任意定义的一个没有意义的属性名称即可
-
import requests from lxml import etree headers = { #伪装的头信息 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36' } url = 'https://sc.chinaz.com/tupian/renwusuxie.html' response = requests.get(url,headers=headers) response.encoding = 'utf-8' page_text = response.text #数据解析:图片的链接+名字 tree = etree.HTML(page_text) div_list = tree.xpath('//*[@id="container"]/div') for div in div_list: img_name = div.xpath('./div/a/img/@alt')[0]+'.jpg' img_src = 'https:'+div.xpath('./div/a/img/@src2')[0] print(img_name,img_src)
C:\Python\Python36\python.exe C:/Users/learn/爬虫/main.py 冬季欧美女生写真图片jpg https://scpic3.chinaz.net/Files/pic/pic9/202108/apic34778_s.jpg 欧美帅哥户外大片写真图片jpg https://scpic3.chinaz.net/Files/pic/pic9/202108/apic34772_s.jpg 美女低眸瞬间图片jpg https://scpic3.chinaz.net/Files/pic/pic9/202108/apic34751_s.jpg 戴防毒面具的两女孩互相拥抱图片jpg https://scpic2.chinaz.net/Files/pic/pic9/202108/hpic4368_s.jpg 春天花海美女小清新图片jpg https://scpic2.chinaz.net/Files/pic/pic9/202108/apic34739_s.jpg 秋季美女背影图片摄影jpg https://scpic2.chinaz.net/Files/pic/pic9/202108/apic34741_s.jpg 性感亚洲女神写真图片jpg https://scpic2.chinaz.net/Files/pic/pic9/202108/apic34715_s.jpg 可爱新生儿艺术照jpg https://scpic2.chinaz.net/Files/pic/pic9/202108/apic34717_s.jpg
requests高级操作
-
cookie
-
需求:将雪球网首页的咨询数据进行爬取,https://xueqiu.com/
-
分析:数据是动态加载的,然后根据抓包工具定位,找寻到了动态加载数据的url和请求参数等信息
-
import requests from lxml import etree headers = { #伪装的头信息 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36' } url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=242002&size=15' data = requests.get(url,headers=headers).json() print(data)
-
返回结果:
-
{'error_description': '遇到错误,请刷新页面或者重新登录帐号后再试', 'error_uri': '/statuses/hot/listV2.json', 'error_data': None, 'error_code': '400016'}
-
程序携带了UA的前提下还没有获取你想要的数据:
- 模拟浏览器的力度不够!
-
全力度模拟:数据爬取到了
-
import requests from lxml import etree headers = { #伪装的头信息 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', 'Referer': 'https://xueqiu.com/', 'Cookie':'acw_tc=b65cfd3816291614515901029e4c915ddf73bc3deb5210a205445051cc7566; xq_a_token=0de231800ecb3f75e824dc0a23866218ead61a8e; xqat=0de231800ecb3f75e824dc0a23866218ead61a8e; xq_r_token=55c21eea0ba3549a92f908d2f8ee69f0a03d067b; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTYzMDY5MDEwMSwiY3RtIjoxNjI5MTYxMzk5NDk1LCJjaWQiOiJkOWQwbjRBWnVwIn0.c8vwqYbhIVzPfCfSw8k_UieWLs2hM5tCsqHXMBjlp1A5C4dxVqcEgdkQE9Kn5TK7CKSmuFubsO231LnDsey52fcR6onDc2aaamRQCvbQRkEXgNLaD20P065Q5BRV-PqjhnLAG9E2cCqyz78awn8QTrbMxEd17Bktm-98bIbwtJ4L5fcLLQqWDxYWpuM1Tm_Sy0dozPAUYfJt9FtvnlTlknVO7vuS3Co-I8XFMRGJyDZDAbUllCPiVzfDdVum1Xs0V-94PPSEQi15IBRzwTruVuuFCk6ps2-x3Tu6RFtSmc3dAuLkpUITxRjzGRpoh-PEpUgH9_-k452bAbPPQAmsCg; u=801629161451595; Hm_lvt_1db88642e346389874251b5a1eded6e3=1629161324; device_id=d7ca45f0ef9ed7f63659797fc6dcba18; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1629161377' } url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=242002&size=15' data = requests.get(url,headers=headers).json() print(data)
-
动态捕获cookie,使用session会话机制
- 创建一个会话对象:requests.Session()
- 会话对象的作用:session是可以帮我们动态的捕获cookie,且可以基于session对象进行请求发送。
- 注意:session对象最少需要被进行两次请求操作。
-
import requests from lxml import etree headers = { #伪装的头信息 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', } sess = requests.Session() #创建了一个session对象 #尝试基于sess发起请求动态捕获cookie main_url = 'https://xueqiu.com/' sess.get(main_url,headers=headers) url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=242002&size=15' data = sess.get(url,headers=headers).json() #携带了cookie进行请求发送 print(data)
-
-
代理
-
如何理解代理:代理服务器
-
代理和爬虫之间有何关联?
- 代理作用:请求和响应的转发
- 爬虫程序遇到了IP受限的情况可以使用代理服务器更换请求的ip。
-
代理的类型:
- http:转发http的请求
- https:转发https的请求
-
代理的匿名度
- 透明
- 匿名
- 高匿
-
代理服务器的使用:
- 平台:http://http.zhiliandaili.cn/
-
测试:
-
没有使用代理的情况
-
import requests from lxml import etree headers = { #伪装的头信息 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', } url = 'https://www.sogou.com/web?query=ip' page_text = requests.get(url,headers=headers).text tree = etree.HTML(page_text) address = tree.xpath('//*[@id="ipsearchresult"]/strong/text()')[0] print(address)
-
用代理的情况
-
import requests from lxml import etree headers = { #伪装的头信息 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', } url = 'https://www.sogou.com/web?query=ip' #proxies实现应用代理操作 page_text = requests.get(url,headers=headers,proxies={'https':'27.42.139.248:45131'}).text tree = etree.HTML(page_text) address = tree.xpath('//*[@id="ipsearchresult"]/strong/text()')[0] print(address)
-
-
-
验证码的识别
-
打码平台:http://www.ttshitu.com/?spm=null、
-
import requests from lxml import etree import base64 import json headers = { #伪装的头信息 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', } def base64_api(uname, pwd, img, typeid): with open(img, 'rb') as f: base64_data = base64.b64encode(f.read()) b64 = base64_data.decode() data = {"username": uname, "password": pwd, "typeid": typeid, "image": b64} result = json.loads(requests.post("http://api.ttshitu.com/predict", json=data).text) if result['success']: return result["data"]["result"] else: return result["message"] return "" #解析验证码图片的地址,将图片保存到本地 main_url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx' page_text = requests.get(main_url,headers=headers).text tree = etree.HTML(page_text) code_img_src = 'https://so.gushiwen.cn'+tree.xpath('//*[@id="imgCode"]/@src')[0] code_img_data = requests.get(code_img_src,headers=headers).content with open('./code.jpg','wb') as fp: fp.write(code_img_data) #使用图鉴的接口识别验证码 img_path = "./code.jpg" result = base64_api(uname='xxx', pwd='xxx', img=img_path, typeid=3) print(result)
-
-
模拟登录
-
import requests from lxml import etree import base64 import json headers = { #伪装的头信息 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36', } sess = requests.Session() def base64_api(uname, pwd, img, typeid): with open(img, 'rb') as f: base64_data = base64.b64encode(f.read()) b64 = base64_data.decode() data = {"username": uname, "password": pwd, "typeid": typeid, "image": b64} result = json.loads(requests.post("http://api.ttshitu.com/predict", json=data).text) if result['success']: return result["data"]["result"] else: return result["message"] return "" #解析验证码图片的地址,将图片保存到本地 main_url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx' page_text = sess.get(main_url,headers=headers).text tree = etree.HTML(page_text) code_img_src = 'https://so.gushiwen.cn'+tree.xpath('//*[@id="imgCode"]/@src')[0] code_img_data = sess.get(code_img_src,headers=headers).content with open('./code.jpg','wb') as fp: fp.write(code_img_data) #使用图鉴的接口识别验证码 img_path = "./code.jpg" result = base64_api(uname='bb328410948', pwd='bb328410948', img=img_path, typeid=3) print(result) #模拟登录 login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx' username = input('enter user name:') password = input('enter password:') data = { '__VIEWSTATE': 'WDOU/FLrhpAUbZQlbYgLVRWKoFdIj7hS9dJdOGN1m4dkVeva94H7EaMo/tDa+sUqYk1zFiRvVg2jvYnKFmkXNM1JGpz1FPx3ibk+c6O5SbcKJDfPF+pYtSfnbTc=', '__VIEWSTATEGENERATOR': 'C93BE1AE', 'from': 'http://so.gushiwen.cn/user/collect.aspx', 'email': username, 'pwd': password, 'code': result, 'denglu': '登录' } #登录操作 logined_page_text = sess.post(login_url,data=data,headers=headers).text with open('./gushiwen.html','w',encoding='utf-8') as fp: fp.write(logined_page_text)
-
异步爬虫
-
线程池
-
同步效果:
-
import requests from multiprocessing.dummy import Pool #线程池 import time start = time.time() urls = [ 'www.1.com', 'www.2.com', 'www.3.com' ] def get_request(url): print('正在请求url:',url) time.sleep(2) print('请求结束:',url) for url in urls: get_request(url) print('总耗时:',time.time()-start)
-
线程池的异步效果:
-
import requests from multiprocessing.dummy import Pool #线程池 import time start = time.time() urls = [ 'www.1.com', 'www.2.com', 'www.3.com' ] def get_request(url): print('正在请求url:',url) time.sleep(2) print('请求结束:',url) return 123 #创建一个线程池对象 pool = Pool(3) #get_request的调用次数取决于urls列表的长度 result_list = pool.map(get_request,urls) print(result_list) print('总耗时:',time.time()-start)
生产者消费者模式
import threading import requests from lxml import etree import os from urllib import request from queue import Queue class Producer(threading.Thread): headers = { "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36", } def __init__(self, page_queue, img_queue, *args, **kwargs): super(Producer, self).__init__(*args, **kwargs) self.page_queue = page_queue self.img_queue = img_queue def run(self): while True: if self.page_queue.empty(): break url = self.page_queue.get() self.parse_page(url) def parse_page(self, url): response = requests.get(url=url,headers=self.headers) text = response.text html = etree.HTML(text) img_list = html.xpath('//div[@class="page-content text-center"]/div/a/img') for img in img_list: img_url = img.xpath('./@data-original')[0] img_name = img.xpath('./@alt')[0]+'.jpg' self.img_queue.put((img_url, img_name)) class Consumer(threading.Thread): def __init__(self, page_queue, img_queue, *args, **kwargs): super(Consumer, self).__init__(*args, **kwargs) self.page_queue = page_queue self.img_queue = img_queue def run(self): while True: if self.page_queue.empty() and self.img_queue.empty(): break img_url, img_name = self.img_queue.get() request.urlretrieve(img_url, "imgs/" + img_name) print(img_name + " 下载完成!") # 定义一个主方法,该方法向处理方法中传值 def main(): page_queue = Queue(50) #存储页码链接 img_queue = Queue(100)#存储解析出来的图片链接 #想要爬取前10也的数据 for x in range(1, 11): url = "https://www.doutula.com/photo/list/?page=%d" % x page_queue.put(url) #将10页的页码链接加入到了page_queue for x in range(3): t = Producer(page_queue, img_queue) t.start() for x in range(3): t = Consumer(page_queue, img_queue) t.start() if __name__ == '__main__': main()
-
-
单线程+多任务异步协程
-
特殊的函数
- 被async关键字修饰的函数定义,该函数就是一个特殊的函数
- 特殊之处:
- 特殊函数被调用后,函数定义的内部实现语句没有被立即执行
- 该函数会返回一个协程对象
- 特殊函数 == 一组指定形式的操作 == 协程对象
- 协程对象 == 一组指定形式的操作
-
协程
- 创建方式:通过特殊函数调用返回
- 协程对象 == 一组指定形式的操作
-
任务对象
-
任务对象就是一个高级的协程对象
-
任务对象 == 协程对象 == 一组指定形式的操作
- 任务对象 == 一组指定形式的操作
-
任务对象的创建:
-
asyncio.ensure_future(c)
-
-
给任务对象绑定回调函数:
-
#定义回调函数 def parse(t):#必须且只能有一个参数(回调函数的调用者表示的那一个任务对象) result = t.result() #result()返回的就是参数t表示的任务对象对应的特殊函数中的返回值 print(result) task.add_done_callback(parse) #给任务对象绑定回调函数
-
-
-
事件循环对象
- 作用:充当一个容器,用来装在任务对象。当事件循环启动后,就可以异步的执行其内部装载的一个或多个任务对象。
- 创建:loop = asyncio.get_event_loop()
- 启动加载:loop.run_until_complete(task)
-
完整实现:
-
import asyncio import time #特殊函数的定义 async def get_request(url): print('正在请求url:',url) time.sleep(2) print('请求结束:',url) return 123 #定义回调函数 def parse(t):#必须且只能有一个参数(回调函数的调用者表示的那一个任务对象) result = t.result() #result()返回的就是参数t表示的任务对象对应的特殊函数中的返回值 print(result) #协程对象 c = get_request('www.1.com') #任务对象 task = asyncio.ensure_future(c) task.add_done_callback(parse) #给任务对象绑定回调函数 #事件循环对象 loop = asyncio.get_event_loop() #创建一个事件循环对象 loop.run_until_complete(task) #将一个任务对象装载在loop对象中,切启动了时间循环对象
-
-
多任务的异步效果的体现
- wait()方法:可以将tasks列表中的每一个任务对象赋予可被挂起的权限。
- 挂起:让当前的任务对象交出cpu的使用权。
- 注意:在特殊函数的实现内部,不可以出现不支持异步模块的代码,否则会中断整个异步效果!
- await关键字:保证阻塞操作一定会被执行!、
import asyncio import time start = time.time() urls = [ 'www.1.com', 'www.2.com', 'www.3.com' ] #特殊函数的定义 async def get_request(url): print('正在请求url:',url) await asyncio.sleep(2) #异步阻塞,替换time print('请求结束:',url) return 123 tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) print('总耗时:',time.time()-start)
-
爬取自己服务器里的数据,实现异步效果
-
Server.py
-
from flask import Flask,render_template from time import sleep app = Flask(__name__) @app.route('/bobo') def index1(): sleep(2) return render_template('test.html') @app.route('/jay') def index2(): sleep(2) return render_template('test.html') @app.route('/tom') def index3(): sleep(2) return render_template('test.html') if __name__ == '__main__': app.run(debug=True)
-
import asyncio import requests import time from lxml import etree start = time.time() urls = ['http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/tom',] async def get_request(url): response = requests.get(url) page_text = response.text return page_text def parse(task): page_text = task.result() tree = etree.HTML(page_text) data = tree.xpath('//body/text()')[0] print(data) tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) task.add_done_callback(parse) tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) print('总耗时:',time.time()-start)
-
上述没有实现异步效果,原因在于requests模块不支持异步。
- 使用aiohttp来代替requests
-
aiohttp:
-
pip install aiohttp
-
编码流程:
-
编写大致架构
-
async def get_request(url): with aiohttp.ClientSession() as sess:#创建了一个请求对象 with sess.get(url) as response: #发起请求,返回响应对象 page_text = response.text() #read()返回byte类型 return page_text
-
-
补充细节
-
在每个with前加上async关键字
-
在每一步阻塞前加上await关键字
-
async def get_request(url): async with aiohttp.ClientSession() as sess:#创建了一个请求对象 async with await sess.get(url) as response: #发起请求,返回响应对象 page_text = await response.text() #read()返回byte类型 return page_text
-
-
完整操作
-
import asyncio import aiohttp import time from lxml import etree start = time.time() urls = ['http://127.0.0.1:5000/bobo', 'http://127.0.0.1:5000/jay', 'http://127.0.0.1:5000/tom',] async def get_request(url): async with aiohttp.ClientSession() as sess:#创建了一个请求对象 async with await sess.get(url) as response: #发起请求,返回响应对象 page_text = await response.text() #read()返回byte类型 return page_text def parse(task): page_text = task.result() tree = etree.HTML(page_text) data = tree.xpath('//body/text()')[0] print(data) tasks = [] for url in urls: c = get_request(url) task = asyncio.ensure_future(c) task.add_done_callback(parse) tasks.append(task) loop = asyncio.get_event_loop() loop.run_until_complete(asyncio.wait(tasks)) print('总耗时:',time.time()-start)
-
-
-
- wait()方法:可以将tasks列表中的每一个任务对象赋予可被挂起的权限。