请查收，一份让你年薪突破20W的Python爬虫笔记

最新推荐文章于 2024-08-16 18:55:06 发布

Python文泽老师

最新推荐文章于 2024-08-16 18:55:06 发布

阅读量100

点赞数

分类专栏： python 文章标签： python 爬虫开发语言 pycharm pygame

本文链接：https://blog.csdn.net/python_9988/article/details/120911640

版权

python 专栏收录该内容

275 篇文章 33 订阅

订阅专栏

本次主要学习内容有requests\BeautifulSoup\scrapy\re，目前除了scrapy其他刚好看完。并搬运实现了一些小项目如58同城租房信息爬取、淘宝搜索商品项目，现将从爬虫基本方法、实战和遇到的问题三个方面进行总结。

1.基本方法

首先就是requests库，是python最简易实用的HTTP库，是一个请求库。主要方法如下，其中requests.request()方法最常用，用于构造请求，是其他几种方法的总和。其余方法如get()获取HTML网页，head()获取网页head标签，post()\pu()t用于提交对应请求，patch()进行局部修改，delete()提交删除请求。

请查收，一份让你年薪突破20W的Python爬虫笔记！

着重介绍request.get()方法，requests.get(url, params=None,**kwargs)

其中url为页面链接，params为额外参数，字典格式，**kwargs包含了12个控制访问的参数。（params\data\json\headers\cookies\auth\files\timeout\proxies\allow_redirects\stream\verify\cert）

通常我们使用get()方法获取页面的内容。

接着介绍请求得到的Response对象，见下表。

请查收，一份让你年薪突破20W的Python爬虫笔记！

补充几段常用代码。

（1）爬取京东商品

import requestsurl = "https://item.jd.com/2967929.html"try:    
r = requests.get(url)    
r.raise_for_status()   
#如果发送了错误请求，可以抛出异常    
r.encoding = r.apparent_encoding  
#把文本内容的编码格式传递给头文件编码格式    print(r.text[:1000])except:    
print("爬取失败！")

（2）爬取亚马逊，需要修改headers字段，模拟请求

import requestsurl="https://www.amazon.cn/gp/product/B01M8L5Z3Y"try:    kv = {'user-agent':'Mozilla/5.0'}  
#模拟请求头    r=requests.get(url,headers=kv)    
r.raise_for_status()    
r.encoding=r.apparent_encoding    
print(r.status_code)    print(r.text[:1000])except:    
print("爬取失败")

（3）百度搜索关键词提交-params提交关键词

import requestsurl="http://www.baidu.com/s"try:    
kv={'wd':'Python'}    
r=requests.get(url,params=kv)    
print(r.request.url)    
r.raise_for_status()    
print(len(r.text))    
print(r.text[500:5000])except:    
print("爬取失败")

（4）图片爬取存储

import requestsimport osurl="http://tc.sinaimg.cn/maxwidth.800/tc.service.weibo.com/p3_pstatp_com/6da229b421faf86ca9ba406190b6f06e.jpg"root="D://pics//"path=root + url.split('/')[-1]try:    
if not os.path.exists(root):        
os.mkdir(root)    
if not os.path.exists(path):        
r = requests.get(url)        
with open(path, 'wb') as f:            
f.write(r.content)   
#r.content为图片            
f.close()            
print("文件保存成功")    
else:        
print("文件已存在")except:    
print("爬取失败")

下面介绍BeautifulSoup库，用于对网页内容进行解析。

BeautifulSoup(mk, ‘html.parser’)，可以用html.parser\lxml\xml\html5lib作为解析器，这里选取html.parser。

元素主要有Tag\Name\Attributes\NavigableString\Comment。其中Tag使用方法如(soup.a)，attrs使用如（a.attrs[‘class’]），Navigable（tag.string）为非属性字符串，comment即注释。~

标签树的遍历方法有（上行遍历、下行遍历、平行遍历）

请查收，一份让你年薪突破20W的Python爬虫笔记！

此外可以用soup.prettify()输出有层次感的段落。

信息提取方法如下：常用find_all，具体对标签搜索有soup.find_all(‘a’)，对属性搜索有soup.find_all(‘p’,class=‘course’)，对字符串搜索有soup.find_all(string=’…’)，配合正则表达式检索有soup.find_all(re.compile(‘link’))。

       find() 搜索且返回一个结果，字符串类型　　　　
find_parents() 在先辈节点中搜索，返回一个列表类型　　　　find_parent() 在先辈节点中返回一个结果，字符串类型　　　　find_next_siblings() 在后续平行节点搜索，返回列表类型　　　　find_next_sibling()　　　　
find_previous_siblings()　　　　
find_previous_sibling() 在前序平行节点中返回一个结果，字符串类型                   find_all(name,attrs,recursive,string,**kwargs) 返回一个列表类型，存储查找的结果　　　  
参数：           
name：对标签名称的检索字符串，可以使用列表查找多个标签，find_all(true)所有标签　　　　　　
attrs：对标签属性值的检索字符串，可标注属性检索 例如find_all('a','href')　　　　　　
recursive:是否对子孙所有节点搜索，默认值为true，false则值查找当前节点儿子的信息　　　　　　　　　　
string:<></>中字符串区域的检索字符串

最后介绍Re正则表达式库。

正则表达式限定符如下：

贪婪匹配指匹配的数据无限多，所谓的的非贪婪指的是匹配的次数有限多。一般情况下，非贪婪只要匹配1次。*、+限定符都是贪婪的，因为它们会尽可能多的匹配文字，只有在它们的后面加上一个?就可以实现非贪婪或最小匹配。

在re库中一般使用raw string类型即r’text’。其中遇到特殊字符需要 \ 加以转义。

方法如下

　　re.search(pattern,string,flag=0)在一个字符串中搜索匹配正则表达式的第一个位置，返回match对象　　
re.match() 在一个字符串的开始位置起匹配正则表达式，返回match对象 注意match为空　　
re.findall()搜索字符串，一列表类型返回全部能匹配的子串　　re.split()将一个字符串按照正则表达式匹配结果进行分割，返回列表类型　　
re.finditer() 搜索字符串，返回一个匹配结果的迭代类型，每个迭代元素是match对象　　
re.sub()在一个字符串中替换所有匹配正则表达式的子串，返回替换后的字符串　　
re.compile(pattern,flags) 将正则表达式的字符串形式编译成正则表达式对象

flag = 0中有三种选择类型，re.I忽略大小写、re.M从每行开始匹配、re.S匹配所有字符。
以上是函数式用法，此外还有面向对象用法。

pat = re.compile('')pat.search(text)

最后介绍match对象的属性和方法，见下。

　　1、属性　　　　
1）string 待匹配文本　　　　
2）re 匹配时使用的pattern对象（正则表达式）　　　　
3）pos 正则表达式搜索文本的开始位置　　　　
4）endpos 正则表达式搜索文本的结束为止　　
        2、方法　　　　
1）.group(0) 获得匹配后的字符串　　　　
2）.start() 匹配字符串在原始字符串的开始位置　　　　
3）.end() 匹配字符串在原始字符串的结束位置　　　　
4）.span() 返回（.start()，.end()）元组类型

2.实战演练

主要选取了淘宝商品搜索和58同城租房两个实例，链接分别为‘https://blog.csdn.net/u014135206/article/details/103216129?depth_1-utm_source=distribute.pc_relevant_right.none-task-blog-BlogCommendFromBaidu-8&utm_source=distribute.pc_relevant_right.none-task-blog-BlogCommendFromBaidu-8‘ ‘https://cloud.tencent.com/developer/article/1611414’

淘宝搜索

import requestsimport redef getHTMLText(url):    
headers = {       
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0'    
}    
#cookies在元素审查，网络里面刷新，找请求头下面的Cookie    
usercookies = ''       
 #这里需要使用客户端的淘宝登录cookies    
cookies = {}    for a in usercookies.split(';'):        
name, value = a.strip().split('=', 1)        
cookies[name] = value    
print(cookies)   
 try:       
 r = requests.get(url, headers=headers, cookies=cookies, timeout=60)        r.raise_for_status()  
#如果有错误返回异常        
print(r.status_code) #打印状态码        
r.encoding = r.apparent_encoding        
return r.text    except:        
return 'failed'def parsePage(ilt, html):   
 try:        
plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)        
tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)        
for i in range(len(plt)):            
price = eval(plt[i].split(':')[1])  
# 意义是进行分割其冒号            
title = eval(tlt[i].split(':')[1])            
ilt.append([price, title])    except:        
print("")def printGoodsList(ilt):    
tplt = "{:4}\t{:8}\t{:16}"    
print(tplt.format("序号", "价格", "商品名称"))  
# 输出信息    
count = 0    
for g in ilt:        
count = count + 1        
print(tplt.format(count, g[0], g[1]))def main():    
goods = '足球'    
depth = 3    
start_url = 'http://s.taobao.com/search?q={}&s='.format(goods) 
 # 找到起始页的url链接    
infoList = []    for i in range(depth):  
# 进行循环爬去每一页        try:            
url = start_url + str(44 * i)            
html = getHTMLText(url)            
parsePage(infoList, html)        
except:           
 continue    
printGoodsList(infoList)main()

58同城爬取租房，这部分代码较多，选取重要内容展示。

1.加密字体的解码

# 获取字体文件并转换为xml文件def get_font(page_url, page_num, proxies):    
response = requests.get(url=page_url, headers=headers, proxies=proxies)    
# 匹配 base64 编码的加密字体字符串    
base64_string = response.text.split("base64,")[1].split("'")[0].strip()    
# print(base64_string)    
# 将 base64 编码的字体字符串解码成二进制编码    
bin_data = base64.decodebytes(base64_string.encode())    
# 保存为字体文件    
with open('58font.woff', 'wb') as f:        
f.write(bin_data)    
print('第' + str(page_num) + '次访问网页，字体文件保存成功！')   
# 获取字体文件，将其转换为xml文件    
font = TTFont('58font.woff')    
font.saveXML('58font.xml')    
print('已成功将字体文件转换为xml文件！')    
return response.text# 将加密字体编码与真实字体进行匹配def find_font():    
# 以glyph开头的编码对应的数字    
glyph_list = {        
'glyph00001': '0',       
'glyph00002': '1',        
'glyph00003': '2',        
'glyph00004': '3',        
'glyph00005': '4',        
'glyph00006': '5',        
'glyph00007': '6',        
'glyph00008': '7',        
'glyph00009': '8',        
'glyph00010': '9'   
 }   
 # 十个加密字体编码    
unicode_list = ['0x9476', '0x958f', '0x993c', '0x9a4b', '0x9e3a', '0x9ea3', '0x9f64', '0x9f92', '0x9fa4', '0x9fa5']    
num_list = []    
# 利用xpath语法匹配xml文件内容    
font_data = etree.parse('./58font.xml')    
for unicode in unicode_list:        
# 依次循环查找xml文件里code对应的name        
result = font_data.xpath("//cmap//map[@code='{}']/@name".format(unicode))[0]        
# print(result)        
# 循环字典的key，如果code对应的name与字典的key相同，则得到key对应的value        
for key in glyph_list.keys():            
if key == result:                
num_list.append(glyph_list[key])    
print('已成功找到编码所对应的数字！')    
# print(num_list)    
# 返回value列表    
return num_list
# 替换掉网页中所有的加密字体编码def replace_font(num, page_response):    
# 9476 958F 993C 9A4B 9E3A 9EA3 9F64 9F92 9FA4 9FA5    
result = page_response
.replace('鑶', num[0])
.replace('閏', num[1])
.replace('餼', num[2])
.replace('驋', num[3])
.replace('鸺', num[4])
.replace('麣', num[5])
.replace('齤', num[6])
.replace('龒', num[7])
.replace('龤', num[8]).
replace('龥', num[9])    
print('已成功将所有加密字体替换！')    
return result

2.租房信息爬取

# 提取租房信息def parse_pages(pages):    
num = 0    
soup = BeautifulSoup(pages, 'lxml')    
# 查找到包含所有租房的li标签    
all_house = soup.find_all('li', class_='house-cell')    
for house in all_house:        
# 标题        
# title = house.find('a', class_='strongbox').text.strip()        
# print(title)        
# 价格        
price = house.find('div', class_='money').text.strip()        
price = str(price)        
print(price)        
# 户型和面积        
layout = house.find('p', class_='room').text.replace(' ', '')        
layout = str(layout)        
print(layout)        
# 楼盘和地址        
address = house.find('p', class_='infor').text.replace(' ', '').replace('\n', '')        
address = str(address)        
print(address)        
num += 1        
print('第' + str(num) + '条数据爬取完毕，暂停3秒！')        
time.sleep(3)        
with open('58.txt', 'a+', encoding='utf-8') as f:          
#这里需encoding编码为utf-8，因网络读取的文本和写入的文本编码格式不一；a+继续在文本底部追加内容。            
f.write(price + '\t' + layout + '\t' + address + '\n')

3.由于58会封禁爬虫IP地址，还需要爬取ip进行切换。

def getiplists(page_num):  
#爬取ip地址存到列表，爬取pages页    
proxy_list = []    
for page in range(1, page_num):        
url = "  "+str(page)        
r = requests.get(url, headers=headers)        
soup = BeautifulSoup(r.text, 'lxml')        
ips = soup.findAll('tr')        
for x in range(5, len(ips)):            
ip = ips[x]            
tds = ip.findAll("td")  
#找到td标签            
ip_temp = 'http://'+tds[1].contents[0]+":"+tds[2].contents[0]  
#.contents找到子节点，tag之间的navigable也构成了节点
proxy_list.append(ip_temp)    
proxy_list = set(proxy_list)  
#去重    
proxy_list = list(proxy_list)    
print('已爬取到'+ str(len(proxy_list)) + '个ip地址')    
return proxy_list通过更新proxies，作为参数更新到requests.get()中，可以一直刷新IP地址。           
proxies = {                
'http': item,                
'https': item,            
}

3.经验总结

期间遇到问题汇总如下：

1.大多数网站都需要模拟请求头，user-agent。

2.淘宝需要模拟cookies登陆，cookies信息可以在检查元素中找到。

3.这种方法只能爬取静态的网页，对于数据写入javascript的动态网页，还需要新的知识。

4.爬取过程中容易被封IP，需要在IP代理网站爬取IP地址，不断刷新IP地址。在get()方法中增加proxies参数即可。

5.58的价格字符串采用的加密的方式显示，需要解码。

6.写入文本时要用encoding='utf-8’编码，避免出错。

Python文泽老师

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
请查收，一份让你年薪突破20W的Python爬虫笔记

本次主要学习内容有requests\BeautifulSoup\scrapy\re，目前除了scrapy其他刚好看完。并搬运实现了一些小项目如58同城租房信息爬取、淘宝搜索商品项目，现将从爬虫基本方法、实战和遇到的问题三个方面进行总结。1.基本方法首先就是requests库，是python最简易实用的HTTP库，是一个请求库。主要方法如下，其中requests.request()方法最常用，用于构造请求，是其他几种方法的总和。其余方法如get()获取HTML网页，head()获取网页head标签，pos
复制链接

扫一扫

专栏目录