2021-08-09-爬虫学习笔记

20210806_zhihu

知乎好像反爬虫,以后再学吧,费了我两个多小时,就当作实践吧。

20210808_baidu&nankai

  1. 服气了,f12和爬到的网页居然会出现属性前后位置颠倒,怪不得正则表达式找不到,又浪费了点时间,最后又是看网页代码发现的,真是奇葩。

    by the way,这回是百度😂

  2. 正则(.*?)不会匹配空字符,而对于空字符的结果,不要在前后加空格,并且用re.sub()替换多匹配的字符

  3. 解码

    问题描述:在使用python爬取斗鱼直播的数据时,使用str(读取到的字节,编码格式)进行解码时报错:‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte

    问题原因:断点调试的时候发现r.read()获取到的字节码是以‘b’\x1f\x8b\x08’开头的,说明它是gzip压缩过的数据,这也是报错的原因,所以我们需要对我们接收的字节码进行一个解码操作。修改之后的代码如下:

    from urllib import request
    from io import BytesIO
    import gzip
    
    
    class Spider():
    url = 'https://www.douyu.com/'
    
    def __fetch_content(self):
    r = request.urlopen(Spider.url)
    htmls = r.read()
    buff = BytesIO(htmls)
    f = gzip.GzipFile(fileobj=buff)
    htmls = f.read().decode('utf-8')
    
    # 入口方法
    def go(self):
    self.__fetch_content()
    
    
    spider = Spider()
    spider.go()
    

    作者: 狗子的进阶史

  4. html=requests.get(danmu_url).content这种方法可以解码,上面作者的方法会报错💢

20210809_zhihu

  1. requests适合gzip压缩的,但不适合没压缩的

  2. 好像就因为一个请求头,所以知乎不返回完整的包

  3. def getdata_zhihu_2(url):
        print('知乎热榜')
        data = []
        d = []
        agent_list = [
            "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533."
            "17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
            # IPod
            "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17."
            "9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
            # IPAD
            "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML"
            ", like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
            "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML,"
            " like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
            # Android
            "Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebK"
            "it/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 "
            "(KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            # QQ浏览器 Android版本
            "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod"
            "-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
            # Android Opera Mobile
            "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8."
            "149 Version/11.10",
            # Android Pad Moto Xoom
            "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML"
            ", like Gecko) Version/4.0 Safari/534.13",
            # BlackBerry
            "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) V"
            "ersion/6.0.0.337 Mobile Safari/534.1+",
            # WebOS HP Touchpad
            "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko)"
            " wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
            # Nokia N97
            "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuratio"
            "n/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
            # Windows Phone Mango
            "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
            # UC浏览器
            "UCWEB7.0.2.37/28/999",
            "NOKIA5700/ UCWEB7.0.2.37/28/999",
            # UCOpenwave
            "Openwave/ UCWEB7.0.2.37/28/999",
            # UC Opera
            "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999"
        ]
        head = {'Referer': 'https://www.zhihu.com/hot',
                "user-agent": random.choice(agent_list),
                'x-requested-with': 'fetch'}
        response = requests.get(url, headers=head)
        soup = BeautifulSoup(response.text, features="lxml")
        # print(soup)
        # for item in soup.find_all('div', class_='css-1mx3lj4'):
        #     item = str(item)
        #     print(item, '-'*100)
        for item in soup.find_all('a', class_='css-hi1lih'):
            # info_hot = item.find(class_='css-1ixcu37').string
            # info_title = item.find(class_='css-3yucnr').string
            # info_id = item.find(class_='css-qg55th').string
            item = str(item)
            # print(item, '-'*100)
            info_hot = re.findall(re.compile(r'<div class=".*?">(\d*? 万热度)</div>'), item)[0]
            info_title = re.findall(re.compile(r'<h1 class="css-3yucnr">(.*?)</h1>'), item)[0]
            info_id = re.findall(re.compile(r'<div class=".*?">(\d*?)</div>'), item)[0]
            info_link = re.findall(re.compile(r'a class="css-hi1lih" href="(.*?)" rel="noopener noreferrer"'), item)[0]
            d.clear()
            d.append(info_id)
            d.append(info_title)
            d.append(info_link)
            d.append(info_hot)
            print(d, '-'*100)
            data.append(d)
        if not d:
            getdata_zhihu_2(url)
    

    知乎被我征服了,哈哈哈哈哈,关键在于请求头,但是有时候还是会返回一些不好的信息,所以我加了最后两行代码,如果失败就重来,直到成功,然后是这个哥们助力,但那个find().string好像不能用,thanks啦

    import requests
    from bs4 import BeautifulSoup
    import lxml
    import random
    #头部列表
    url = 'https://www.zhihu.com/billboard'
    agent_list = [
        "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        # IPod
        "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        # IPAD
        "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5",
        "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
        # Android
        "Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        # QQ浏览器 Android版本
        "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
        # Android Opera Mobile
        "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
        # Android Pad Moto Xoom
        "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
        # BlackBerry
        "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
        # WebOS HP Touchpad
        "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
        # Nokia N97
        "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
        # Windows Phone Mango
        "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
        # UC浏览器
        "UCWEB7.0.2.37/28/999",
        "NOKIA5700/ UCWEB7.0.2.37/28/999",
        # UCOpenwave
        "Openwave/ UCWEB7.0.2.37/28/999",
        # UC Opera
        "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999"
        ]
    
    head = {"user-agent":random.choice(agent_list)}
    response = requests.get(url, headers = head)
    html = BeautifulSoup(response.text, features="lxml")
    ZY = html.select('.Card > .HotList-item')
    for Strzy in ZY:
        PM = Strzy.find(class_ = 'HotList-itemIndex').string
        BT = Strzy.find(class_ = 'HotList-itemTitle').string
        host = Strzy.find(class_ = 'HotList-itemMetrics').string
        print(PM,BT,host)
    
    
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值