20210806_zhihu
知乎好像反爬虫,以后再学吧,费了我两个多小时,就当作实践吧。
20210808_baidu&nankai
-
服气了,f12和爬到的网页居然会出现属性前后位置颠倒,怪不得正则表达式找不到,又浪费了点时间,最后又是看网页代码发现的,真是奇葩。
by the way,这回是百度😂
-
正则
(.*?)
不会匹配空字符,而对于空字符的结果,不要在前后加空格,并且用re.sub()
替换多匹配的字符 -
解码
问题描述:在使用python爬取斗鱼直播的数据时,使用str(读取到的字节,编码格式)进行解码时报错:‘utf-8’ codec can’t decode byte 0x8b in position 1: invalid start byte
问题原因:断点调试的时候发现r.read()获取到的字节码是以‘b’\x1f\x8b\x08’开头的,说明它是gzip压缩过的数据,这也是报错的原因,所以我们需要对我们接收的字节码进行一个解码操作。修改之后的代码如下:
from urllib import request from io import BytesIO import gzip class Spider(): url = 'https://www.douyu.com/' def __fetch_content(self): r = request.urlopen(Spider.url) htmls = r.read() buff = BytesIO(htmls) f = gzip.GzipFile(fileobj=buff) htmls = f.read().decode('utf-8') # 入口方法 def go(self): self.__fetch_content() spider = Spider() spider.go()
作者: 狗子的进阶史
-
html=requests.get(danmu_url).content
这种方法可以解码,上面作者的方法会报错💢
20210809_zhihu
-
requests适合gzip压缩的,但不适合没压缩的
-
好像就因为一个请求头,所以知乎不返回完整的包
-
def getdata_zhihu_2(url): print('知乎热榜') data = [] d = [] agent_list = [ "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533." "17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", # IPod "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17." "9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", # IPAD "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML" ", like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML," " like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", # Android "Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebK" "it/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 " "(KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", # QQ浏览器 Android版本 "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod" "-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", # Android Opera Mobile "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8." "149 Version/11.10", # Android Pad Moto Xoom "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML" ", like Gecko) Version/4.0 Safari/534.13", # BlackBerry "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) V" "ersion/6.0.0.337 Mobile Safari/534.1+", # WebOS HP Touchpad "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko)" " wOSBrowser/233.70 Safari/534.6 TouchPad/1.0", # Nokia N97 "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuratio" "n/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124", # Windows Phone Mango "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)", # UC浏览器 "UCWEB7.0.2.37/28/999", "NOKIA5700/ UCWEB7.0.2.37/28/999", # UCOpenwave "Openwave/ UCWEB7.0.2.37/28/999", # UC Opera "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999" ] head = {'Referer': 'https://www.zhihu.com/hot', "user-agent": random.choice(agent_list), 'x-requested-with': 'fetch'} response = requests.get(url, headers=head) soup = BeautifulSoup(response.text, features="lxml") # print(soup) # for item in soup.find_all('div', class_='css-1mx3lj4'): # item = str(item) # print(item, '-'*100) for item in soup.find_all('a', class_='css-hi1lih'): # info_hot = item.find(class_='css-1ixcu37').string # info_title = item.find(class_='css-3yucnr').string # info_id = item.find(class_='css-qg55th').string item = str(item) # print(item, '-'*100) info_hot = re.findall(re.compile(r'<div class=".*?">(\d*? 万热度)</div>'), item)[0] info_title = re.findall(re.compile(r'<h1 class="css-3yucnr">(.*?)</h1>'), item)[0] info_id = re.findall(re.compile(r'<div class=".*?">(\d*?)</div>'), item)[0] info_link = re.findall(re.compile(r'a class="css-hi1lih" href="(.*?)" rel="noopener noreferrer"'), item)[0] d.clear() d.append(info_id) d.append(info_title) d.append(info_link) d.append(info_hot) print(d, '-'*100) data.append(d) if not d: getdata_zhihu_2(url)
知乎被我征服了,哈哈哈哈哈,关键在于请求头,但是有时候还是会返回一些不好的信息,所以我加了最后两行代码,如果失败就重来,直到成功,然后是这个哥们助力,但那个
find().string
好像不能用,thanks啦import requests from bs4 import BeautifulSoup import lxml import random #头部列表 url = 'https://www.zhihu.com/billboard' agent_list = [ "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", # IPod "Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", # IPAD "Mozilla/5.0 (iPad; U; CPU OS 4_2_1 like Mac OS X; zh-cn) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8C148 Safari/6533.18.5", "Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5", # Android "Mozilla/5.0 (Linux; U; Android 2.2.1; zh-cn; HTC_Wildfire_A3333 Build/FRG83D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", # QQ浏览器 Android版本 "MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", # Android Opera Mobile "Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10", # Android Pad Moto Xoom "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", # BlackBerry "Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+", # WebOS HP Touchpad "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0", # Nokia N97 "Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124", # Windows Phone Mango "Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)", # UC浏览器 "UCWEB7.0.2.37/28/999", "NOKIA5700/ UCWEB7.0.2.37/28/999", # UCOpenwave "Openwave/ UCWEB7.0.2.37/28/999", # UC Opera "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999" ] head = {"user-agent":random.choice(agent_list)} response = requests.get(url, headers = head) html = BeautifulSoup(response.text, features="lxml") ZY = html.select('.Card > .HotList-item') for Strzy in ZY: PM = Strzy.find(class_ = 'HotList-itemIndex').string BT = Strzy.find(class_ = 'HotList-itemTitle').string host = Strzy.find(class_ = 'HotList-itemMetrics').string print(PM,BT,host)