爬虫基础与案例2

图片懒加载

  • 页面中的图片资源不是一次性全部请求到的,而是通过事件的监听结合着img标签的伪属性实现的懒加载机制

    • 伪属性:自己任意定义的一个没有意义的属性名称即可
  • import requests
    from lxml import etree
    headers = { #伪装的头信息
    	'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
    }
    url = 'https://sc.chinaz.com/tupian/renwusuxie.html'
    response = requests.get(url,headers=headers)
    response.encoding = 'utf-8'
    page_text = response.text
    
    #数据解析:图片的链接+名字
    tree = etree.HTML(page_text)
    div_list = tree.xpath('//*[@id="container"]/div')
    for div in div_list:
    	img_name = div.xpath('./div/a/img/@alt')[0]+'.jpg'
    	img_src = 'https:'+div.xpath('./div/a/img/@src2')[0]
    	print(img_name,img_src)
    
    C:\Python\Python36\python.exe C:/Users/learn/爬虫/main.py
    冬季欧美女生写真图片jpg https://scpic3.chinaz.net/Files/pic/pic9/202108/apic34778_s.jpg
    欧美帅哥户外大片写真图片jpg https://scpic3.chinaz.net/Files/pic/pic9/202108/apic34772_s.jpg
    美女低眸瞬间图片jpg https://scpic3.chinaz.net/Files/pic/pic9/202108/apic34751_s.jpg
    戴防毒面具的两女孩互相拥抱图片jpg https://scpic2.chinaz.net/Files/pic/pic9/202108/hpic4368_s.jpg
    春天花海美女小清新图片jpg https://scpic2.chinaz.net/Files/pic/pic9/202108/apic34739_s.jpg
    秋季美女背影图片摄影jpg https://scpic2.chinaz.net/Files/pic/pic9/202108/apic34741_s.jpg
    性感亚洲女神写真图片jpg https://scpic2.chinaz.net/Files/pic/pic9/202108/apic34715_s.jpg
    可爱新生儿艺术照jpg https://scpic2.chinaz.net/Files/pic/pic9/202108/apic34717_s.jpg
    

requests高级操作

  • cookie

    • 需求:将雪球网首页的咨询数据进行爬取,https://xueqiu.com/

    • 分析:数据是动态加载的,然后根据抓包工具定位,找寻到了动态加载数据的url和请求参数等信息

    • import requests
      from lxml import etree
      headers = { #伪装的头信息
      	'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'
      }
      url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=242002&size=15'
      data = requests.get(url,headers=headers).json()
      print(data)
      
    • 返回结果:

    • {'error_description': '遇到错误,请刷新页面或者重新登录帐号后再试', 'error_uri': '/statuses/hot/listV2.json', 'error_data': None, 'error_code': '400016'}
      
    • 程序携带了UA的前提下还没有获取你想要的数据:

      • 模拟浏览器的力度不够!
    • 全力度模拟:数据爬取到了

    • import requests
      from lxml import etree
      headers = { #伪装的头信息
      	'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
      	'Referer': 'https://xueqiu.com/',
      	'Cookie':'acw_tc=b65cfd3816291614515901029e4c915ddf73bc3deb5210a205445051cc7566; xq_a_token=0de231800ecb3f75e824dc0a23866218ead61a8e; xqat=0de231800ecb3f75e824dc0a23866218ead61a8e; xq_r_token=55c21eea0ba3549a92f908d2f8ee69f0a03d067b; xq_id_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiJ9.eyJ1aWQiOi0xLCJpc3MiOiJ1YyIsImV4cCI6MTYzMDY5MDEwMSwiY3RtIjoxNjI5MTYxMzk5NDk1LCJjaWQiOiJkOWQwbjRBWnVwIn0.c8vwqYbhIVzPfCfSw8k_UieWLs2hM5tCsqHXMBjlp1A5C4dxVqcEgdkQE9Kn5TK7CKSmuFubsO231LnDsey52fcR6onDc2aaamRQCvbQRkEXgNLaD20P065Q5BRV-PqjhnLAG9E2cCqyz78awn8QTrbMxEd17Bktm-98bIbwtJ4L5fcLLQqWDxYWpuM1Tm_Sy0dozPAUYfJt9FtvnlTlknVO7vuS3Co-I8XFMRGJyDZDAbUllCPiVzfDdVum1Xs0V-94PPSEQi15IBRzwTruVuuFCk6ps2-x3Tu6RFtSmc3dAuLkpUITxRjzGRpoh-PEpUgH9_-k452bAbPPQAmsCg; u=801629161451595; Hm_lvt_1db88642e346389874251b5a1eded6e3=1629161324; device_id=d7ca45f0ef9ed7f63659797fc6dcba18; Hm_lpvt_1db88642e346389874251b5a1eded6e3=1629161377'
      }
      url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=242002&size=15'
      data = requests.get(url,headers=headers).json()
      print(data)
      
    • 动态捕获cookie,使用session会话机制

      • 创建一个会话对象:requests.Session()
      • 会话对象的作用:session是可以帮我们动态的捕获cookie,且可以基于session对象进行请求发送。
      • 注意:session对象最少需要被进行两次请求操作。
    • import requests
      from lxml import etree
      headers = { #伪装的头信息
      	'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
      }
      sess = requests.Session() #创建了一个session对象
      #尝试基于sess发起请求动态捕获cookie
      main_url = 'https://xueqiu.com/'
      sess.get(main_url,headers=headers)
      
      url = 'https://xueqiu.com/statuses/hot/listV2.json?since_id=-1&max_id=242002&size=15'
      data = sess.get(url,headers=headers).json() #携带了cookie进行请求发送
      print(data)
      
  • 代理

    • 如何理解代理:代理服务器

    • 代理和爬虫之间有何关联?

      • 代理作用:请求和响应的转发
      • 爬虫程序遇到了IP受限的情况可以使用代理服务器更换请求的ip。
    • 代理的类型:

      • http:转发http的请求
      • https:转发https的请求
    • 代理的匿名度

      • 透明
      • 匿名
      • 高匿
    • 代理服务器的使用:

      • 平台:http://http.zhiliandaili.cn/
    • 测试:

      • 没有使用代理的情况

      • import requests
        from lxml import etree
        headers = { #伪装的头信息
        	'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
        }
        url = 'https://www.sogou.com/web?query=ip'
        page_text = requests.get(url,headers=headers).text
        tree = etree.HTML(page_text)
        address = tree.xpath('//*[@id="ipsearchresult"]/strong/text()')[0]
        print(address)
        
      • 用代理的情况

      • import requests
        from lxml import etree
        headers = { #伪装的头信息
        	'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
        }
        url = 'https://www.sogou.com/web?query=ip'
        #proxies实现应用代理操作
        page_text = requests.get(url,headers=headers,proxies={'https':'27.42.139.248:45131'}).text
        tree = etree.HTML(page_text)
        address = tree.xpath('//*[@id="ipsearchresult"]/strong/text()')[0]
        print(address)
        
  • 验证码的识别

    • 打码平台:http://www.ttshitu.com/?spm=null、

    • import requests
      from lxml import etree
      import base64
      import json
      headers = { #伪装的头信息
      	'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
      }
      def base64_api(uname, pwd, img, typeid):
          with open(img, 'rb') as f:
              base64_data = base64.b64encode(f.read())
              b64 = base64_data.decode()
          data = {"username": uname, "password": pwd, "typeid": typeid, "image": b64}
          result = json.loads(requests.post("http://api.ttshitu.com/predict", json=data).text)
          if result['success']:
              return result["data"]["result"]
          else:
              return result["message"]
          return ""
      
      #解析验证码图片的地址,将图片保存到本地
      main_url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
      page_text = requests.get(main_url,headers=headers).text
      tree = etree.HTML(page_text)
      code_img_src = 'https://so.gushiwen.cn'+tree.xpath('//*[@id="imgCode"]/@src')[0]
      code_img_data = requests.get(code_img_src,headers=headers).content
      with open('./code.jpg','wb') as fp:
      	fp.write(code_img_data)
      
      #使用图鉴的接口识别验证码
      img_path = "./code.jpg"
      result = base64_api(uname='xxx', pwd='xxx', img=img_path, typeid=3)
      print(result)
      
  • 模拟登录

    • import requests
      from lxml import etree
      import base64
      import json
      headers = { #伪装的头信息
      	'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36',
      }
      sess = requests.Session()
      def base64_api(uname, pwd, img, typeid):
          with open(img, 'rb') as f:
              base64_data = base64.b64encode(f.read())
              b64 = base64_data.decode()
          data = {"username": uname, "password": pwd, "typeid": typeid, "image": b64}
          result = json.loads(requests.post("http://api.ttshitu.com/predict", json=data).text)
          if result['success']:
              return result["data"]["result"]
          else:
              return result["message"]
          return ""
      
      #解析验证码图片的地址,将图片保存到本地
      main_url = 'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx'
      page_text = sess.get(main_url,headers=headers).text
      tree = etree.HTML(page_text)
      code_img_src = 'https://so.gushiwen.cn'+tree.xpath('//*[@id="imgCode"]/@src')[0]
      code_img_data = sess.get(code_img_src,headers=headers).content
      with open('./code.jpg','wb') as fp:
      	fp.write(code_img_data)
      
      #使用图鉴的接口识别验证码
      img_path = "./code.jpg"
      result = base64_api(uname='bb328410948', pwd='bb328410948', img=img_path, typeid=3)
      print(result)
      
      #模拟登录
      login_url = 'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx'
      username = input('enter user name:')
      password = input('enter password:')
      data = {
          '__VIEWSTATE': 'WDOU/FLrhpAUbZQlbYgLVRWKoFdIj7hS9dJdOGN1m4dkVeva94H7EaMo/tDa+sUqYk1zFiRvVg2jvYnKFmkXNM1JGpz1FPx3ibk+c6O5SbcKJDfPF+pYtSfnbTc=',
          '__VIEWSTATEGENERATOR': 'C93BE1AE',
          'from': 'http://so.gushiwen.cn/user/collect.aspx',
          'email': username,
          'pwd': password,
          'code': result,
          'denglu': '登录'
      }
      #登录操作
      logined_page_text = sess.post(login_url,data=data,headers=headers).text
      with open('./gushiwen.html','w',encoding='utf-8') as fp:
          fp.write(logined_page_text)
      

异步爬虫

  • 线程池

    • 同步效果:

    • import requests
      from multiprocessing.dummy import Pool #线程池
      import time
      start = time.time()
      urls = [
          'www.1.com',
          'www.2.com',
          'www.3.com'
      ]
      def get_request(url):
          print('正在请求url:',url)
          time.sleep(2)
          print('请求结束:',url)
      
      for url in urls:
          get_request(url)
      
      print('总耗时:',time.time()-start)
      
    • 线程池的异步效果:

    • import requests
      from multiprocessing.dummy import Pool #线程池
      import time
      start = time.time()
      urls = [
          'www.1.com',
          'www.2.com',
          'www.3.com'
      ]
      def get_request(url):
          print('正在请求url:',url)
          time.sleep(2)
          print('请求结束:',url)
          return 123
      
      #创建一个线程池对象
      pool = Pool(3)
      #get_request的调用次数取决于urls列表的长度
      result_list = pool.map(get_request,urls)
      print(result_list)
      
      print('总耗时:',time.time()-start)
      

    生产者消费者模式

    import threading
    import requests
    from lxml import etree
    import os
    from urllib import request
    from queue import Queue
    
    
    class Producer(threading.Thread):
        headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
        }
    
        def __init__(self, page_queue, img_queue, *args, **kwargs):
            super(Producer, self).__init__(*args, **kwargs)
            self.page_queue = page_queue
            self.img_queue = img_queue
    
        def run(self):
            while True:
                if self.page_queue.empty():
                    break
                url = self.page_queue.get()
                self.parse_page(url)
    
        def parse_page(self, url):
            response = requests.get(url=url,headers=self.headers)
            text = response.text
            html = etree.HTML(text)
    
            img_list = html.xpath('//div[@class="page-content text-center"]/div/a/img')
            for img in img_list:
                img_url = img.xpath('./@data-original')[0]
                img_name = img.xpath('./@alt')[0]+'.jpg'
                self.img_queue.put((img_url, img_name))
    
    
    
    
    class Consumer(threading.Thread):
        def __init__(self, page_queue, img_queue, *args, **kwargs):
            super(Consumer, self).__init__(*args, **kwargs)
            self.page_queue = page_queue
            self.img_queue = img_queue
    
        def run(self):
            while True:
                if self.page_queue.empty() and self.img_queue.empty():
                    break
                img_url, img_name = self.img_queue.get()
                request.urlretrieve(img_url, "imgs/" + img_name)
                print(img_name + " 下载完成!")
    
    # 定义一个主方法,该方法向处理方法中传值
    def main():
        page_queue = Queue(50) #存储页码链接
        img_queue = Queue(100)#存储解析出来的图片链接
        #想要爬取前10也的数据
        for x in range(1, 11):
            url = "https://www.doutula.com/photo/list/?page=%d" % x
            page_queue.put(url) #将10页的页码链接加入到了page_queue
    
        for x in range(3):
            t = Producer(page_queue, img_queue)
            t.start()
    
        for x in range(3):
            t = Consumer(page_queue, img_queue)
            t.start()
    
    
    if __name__ == '__main__':
        main() 
    
  • 单线程+多任务异步协程

    • 特殊的函数

      • 被async关键字修饰的函数定义,该函数就是一个特殊的函数
      • 特殊之处:
        • 特殊函数被调用后,函数定义的内部实现语句没有被立即执行
        • 该函数会返回一个协程对象
        • 特殊函数 == 一组指定形式的操作 == 协程对象
          • 协程对象 == 一组指定形式的操作
    • 协程

      • 创建方式:通过特殊函数调用返回
      • 协程对象 == 一组指定形式的操作
    • 任务对象

      • 任务对象就是一个高级的协程对象

      • 任务对象 == 协程对象 == 一组指定形式的操作

        • 任务对象 == 一组指定形式的操作
      • 任务对象的创建:

        • asyncio.ensure_future(c)
          
      • 给任务对象绑定回调函数:

        • #定义回调函数
          def parse(t):#必须且只能有一个参数(回调函数的调用者表示的那一个任务对象)
              result = t.result() #result()返回的就是参数t表示的任务对象对应的特殊函数中的返回值
              print(result)
              
           task.add_done_callback(parse) #给任务对象绑定回调函数
          
          
    • 事件循环对象

      • 作用:充当一个容器,用来装在任务对象。当事件循环启动后,就可以异步的执行其内部装载的一个或多个任务对象。
      • 创建:loop = asyncio.get_event_loop()
      • 启动加载:loop.run_until_complete(task)
    • 完整实现:

    • import asyncio
      import time
      #特殊函数的定义
      async def get_request(url):
          print('正在请求url:',url)
          time.sleep(2)
          print('请求结束:',url)
          return 123
      #定义回调函数
      def parse(t):#必须且只能有一个参数(回调函数的调用者表示的那一个任务对象)
          result = t.result() #result()返回的就是参数t表示的任务对象对应的特殊函数中的返回值
          print(result)
      #协程对象
      c = get_request('www.1.com')
      #任务对象
      task = asyncio.ensure_future(c)
      task.add_done_callback(parse) #给任务对象绑定回调函数
      #事件循环对象
      loop = asyncio.get_event_loop() #创建一个事件循环对象
      loop.run_until_complete(task) #将一个任务对象装载在loop对象中,切启动了时间循环对象
      
      
  • 多任务的异步效果的体现

    • wait()方法:可以将tasks列表中的每一个任务对象赋予可被挂起的权限。
      • 挂起:让当前的任务对象交出cpu的使用权。
    • 注意:在特殊函数的实现内部,不可以出现不支持异步模块的代码,否则会中断整个异步效果!
    • await关键字:保证阻塞操作一定会被执行!、
    import asyncio
    import time
    
    start = time.time()
    urls = [
        'www.1.com',
        'www.2.com',
        'www.3.com'
    ]
    #特殊函数的定义
    async def get_request(url):
        print('正在请求url:',url)
        await asyncio.sleep(2) #异步阻塞,替换time
        print('请求结束:',url)
        return 123
    tasks = []
    for url in urls:
        c = get_request(url)
        task = asyncio.ensure_future(c)
        tasks.append(task)
    
    loop = asyncio.get_event_loop()
    loop.run_until_complete(asyncio.wait(tasks))
    
    print('总耗时:',time.time()-start)
    
    • 爬取自己服务器里的数据,实现异步效果

      • Server.py

      • from flask import Flask,render_template
        from time import sleep
        app = Flask(__name__)
        
        @app.route('/bobo')
        def index1():
        	sleep(2)
        	return render_template('test.html')
        @app.route('/jay')
        def index2():
        	sleep(2)
        	return render_template('test.html')
        @app.route('/tom')
        def index3():
        	sleep(2)
        	return render_template('test.html')
        if __name__ == '__main__':
            app.run(debug=True)
        
      • import asyncio
        import requests
        import time
        from lxml import etree
        start = time.time()
        urls = ['http://127.0.0.1:5000/bobo',
               'http://127.0.0.1:5000/jay',
               'http://127.0.0.1:5000/tom',]
        
        async def get_request(url):
            response = requests.get(url)
            page_text = response.text
            return page_text
        
        def parse(task):
            page_text = task.result()
            tree = etree.HTML(page_text)
            data = tree.xpath('//body/text()')[0]
            print(data)
        
        tasks = []
        for url in urls:
            c = get_request(url)
            task = asyncio.ensure_future(c)
            task.add_done_callback(parse)
            tasks.append(task)
        
        loop = asyncio.get_event_loop()
        loop.run_until_complete(asyncio.wait(tasks))
        
        print('总耗时:',time.time()-start)
        
      • 上述没有实现异步效果,原因在于requests模块不支持异步。

        • 使用aiohttp来代替requests
      • aiohttp:

        • pip install aiohttp

        • 编码流程:

          • 编写大致架构

            • async def get_request(url):
                  with aiohttp.ClientSession() as sess:#创建了一个请求对象
                      with sess.get(url) as response: #发起请求,返回响应对象
                          page_text = response.text() #read()返回byte类型
                          return page_text
              
          • 补充细节

            • 在每个with前加上async关键字

            • 在每一步阻塞前加上await关键字

            • async def get_request(url):
                  async with aiohttp.ClientSession() as sess:#创建了一个请求对象
                      async with await sess.get(url) as response: #发起请求,返回响应对象
                          page_text = await response.text() #read()返回byte类型
                          return page_text
              
          • 完整操作

          • import asyncio
            import aiohttp
            import time
            from lxml import etree
            start = time.time()
            urls = ['http://127.0.0.1:5000/bobo',
                   'http://127.0.0.1:5000/jay',
                   'http://127.0.0.1:5000/tom',]
            
            async def get_request(url):
                async with aiohttp.ClientSession() as sess:#创建了一个请求对象
                    async with await sess.get(url) as response: #发起请求,返回响应对象
                        page_text = await response.text() #read()返回byte类型
                        return page_text
            
            def parse(task):
                page_text = task.result()
                tree = etree.HTML(page_text)
                data = tree.xpath('//body/text()')[0]
                print(data)
            
            tasks = []
            for url in urls:
                c = get_request(url)
                task = asyncio.ensure_future(c)
                task.add_done_callback(parse)
                tasks.append(task)
            
            loop = asyncio.get_event_loop()
            loop.run_until_complete(asyncio.wait(tasks))
            
            print('总耗时:',time.time()-start)
            
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值