爬虫协议
爬虫协议:Robots协议(机器人协议),全名:网络爬虫排除标准。
用来告诉搜索引擎,哪些页面可以抓取,哪些页面不可以抓取。
该协议通常是一个robots文本文件。一般放在网站的根目录底下。
当我们用爬虫搜索某一网站时,会先检查该网站点的根目录下是否存在爬虫协议。
如果找到,则按照该协议进行爬取,如果没有,该网站点的所有内容都会被爬取。
1、百度的robot协议
可以直接访问https://www.baidu.com/robots.txt
robots.txt样例(百度):
User-agent: Baiduspider 爬虫名称:允许Baiduspider去访问 *(任何爬虫都可以爬取)
Disallow: /baidu Disallow指定了不允许抓取的记录 /(根目录,都不允许抓取)
2、解析爬虫协议
'''解析爬虫协议
该类根据网站的robots.txt文件来判断一个爬虫是否有权限
'''
from urllib.robotparser import RobotFileParser
rp = RobotFileParser("https://www.136book.com/robots.txt")
rp.read() # read()方法用来读取读取robots.txt文件并分析,该方法不会返回任何内容。但是执行了读取操作。
'''判断某一网站是否能被读取'''
# can_fetch(self, useragent, url)
flag = rp.can_fetch('*', 'https://www.136book.com/anlianjushenghuainanquanji/qlrcxeqleb/')
print(flag) # True
urllib库中parse解析器
urllib库中提供了parse模块,提供了一个处理URL的标准接口,比如:实现URL各部分的抽取,合并以及链接转换
urlparse(url,scheme,allow_fragments):可以实现url的识别和分段
url:网址
scheme:协议
allow_fragments:允许锚点
1、解析网页
使用urllib.parse的urlparse模块,可以把网址解析,分割成不同部分。
from urllib.parse import urlparse
result = urlparse("http://www.baidu.com/index.html;user?id=5#comment")
# result = urlparse("http://www.baidu.com/index.html;user?id=5#comment",scheme='',allow_fragments=False)
print(type(result),result)
# <class 'urllib.parse.ParseResult'>
# ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
'''分析一下上面打印的内容
scheme:协议
netloc:域名
path:路径
params:参数
?后的query:查询条件(一般用于get类型的URL)
#后的是锚点:直接定位页面的下拉位置
公式:scheme://netloc/path;params?query#fragment
'''
result = urlparse('www.ZangYushun.com',scheme='https')
print(result)
# ParseResult(scheme='https', netloc='', path='www.ZangYushun.com', params='', query='', fragment='')
# allow_fragments是否允许锚点
result = urlparse("http://www.baidu.com/index.html;user?id=5#comment",scheme='',allow_fragments=False)
print(result.scheme,result[0],result.netloc,result[1],sep='\n')
2、反解析unparse()
urlunpaese():参数式一个可迭代对象,但是它的长度必须为6,否则会抛出异常
'''scheme, netloc, url, params, query, fragment'''
from urllib.parse import urlunparse
data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data)) # http://www.baidu.com/index.html;user?a=6#comment
from urllib.parse import urlsplit
# urlsplit()方法和urlparse方法类似,只不过不再单独解析params这一部分。只返回5个结果
result = urlsplit('http://www.baidu.com/index.html;user?a=6#comment')
print(result) # SplitResult(scheme='http', netloc='www.baidu.com', pa
bs4.BeautifulSoup
和lxml一样,BeautifulSoup是一个HTML/XML解析器,主要的功能也是如何解析和提取HTML/XML中的数据
lxml只会局部遍历,而Beautifulsoup是基于HTML DOM结构的,会载入整个文档,解析整个DOM树结构。
因此,时间和内存开销都会比较大,所以性能要低于lxml
BeautifulSoup用来解析HTML比较简单,API比较人性化,
且支持CSS选择器,python标准库中的html解析器,也支持lxml的XML解析器
1、使用BeautifulSoup对网页进行解析
from bs4 import BeautifulSoup
html = '''
<html><head><title>The title</title></head>
<body>
<p class = "title" name="dromouse"><b>The Dormouse's story</b></p>
<p class = "story">
<a href = "https://www.baidu.com" class = "sister" id = "link1">1<!--测试注释--></a>
<a href = "https://www.baidu.com" class = "sister" id = "link2">2</a>
<a href = "https://www.baidu.com" class = "sister" id = "link3">3</a>
ceshi2</p>
<p class = "story">...</p>
'''
# 创建Beautifulsoup对象
soup = BeautifulSoup(html,features='lxml')
# soup = BeautifulSoup(open('index.html'))
# 格式化输出soup对象的内容,补齐标签
print(soup.prettify())
print(soup.title)
print(soup.head)
print(soup.a)
print(type(soup.p))
print(soup.head.name) # 对于其他内部标签,输出的值为标签本身的名字
print(soup.p.name)
'''上述代码打印内容:
<title>The title</title>
<head><title>The title</title></head>
<a class="sister" href="https://www.baidu.com" id="link1">1<!--测试注释--></a>
<class 'bs4.element.Tag'>
'''
# 以字典的形式吧p标签的所有属性打印输出
print(soup.p.attrs) # {'class': ['title'], 'name': 'dromouse'}
print(soup.p['class']) # ['title']
print(soup.p.get('class')) # ['title']
# 修改属性
soup.p['class']='newClass'
print(soup.p) # <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
# 删除属性
del soup.p['class']
print(soup.p) # <p name="dromouse"><b>The Dormouse's story</b></p>
# 输出标签包含的内容
print(soup.p.string) # The Dormouse's story
print(type(soup.name))
print(soup.name)
print(soup.a.string)
2、bs4案例
BeautifulSoup 将复杂的HTML文档转换成复杂的树形结构,每个节点都是python对象。所有对象可以归纳为4种:
1、Tag对象
2、NavigableString
3、BeautifulSoup
4、Comment
Tag: html中的标签,每个标签成称为tag对象。
我们可以使用soup加上标签轻松的获取这些标签的内容,这些对象的类型是bs4.element.Tag'
但是注意:它查找的是所有内容中的第一个符合要求的标签。。。。(不能查询所有的标签)
BeautifulSoup:对象表示的是一个文档的内容,大部分时候,可以把他当做tag对象,是一个特殊的tag
print(type(soup)) # unicode
Comment对象是一个特殊的NavigableString对象,输出的内容不包括注释符号,如果包含注释,通过.string获取内容为None
——————————————————————————————————————————————————————————
遍历文档树:
1、直接子节点:.contents .children属性
.contents属性 可以量tag的子节点以列表的方式输出/
.children属性:返回的值不是一个list,通过遍历获取所有的子节点
搜索文档树:
1、find_all(name, attrs, recursive, text, limit, **kwargs)
name 参数:name参数可以查找所有名字为name的tag对象,字符串对象会被自动忽略掉
列表参数:Beautifulsoup对象将会与列表中任一元素的匹配返回。
2、select()通过css选择器进行查找
①通过标签名查找
②通过类名来查找
③通过id来查找
④组合查找: 和写class文件时,标签名与类型、id进行组合。
⑤属性查找: 查找可以添加属性元素,属性需要用中括号括起来,注意属性和标签属于同一结点。
所以中间不能添加空格,否则无法匹配到
print(soup.select('a[class="sister"]'))
⑥获取内容
案例验证上述方法:
from bs4 import BeautifulSoup
html = '''
<html><head><title>The title</title></head>
<body>
<p class = "title" name="dromouse"><b>The Dormouse's story</b></p>
<p class = "story">
<a href = "https://www.baidu.com" class = "sister" id = "link1">1<!--测试注释--></a>
<a href = "https://www.baidu.com" class = "sister" id = "link2">2</a>
<a href = "https://www.baidu.com" class = "sister" id = "link3">3</a>
ceshi2</p>
<p class = "story">...</p>
'''
# 创建Beautifulsoup对象
soup = BeautifulSoup(html,features='lxml')
# print(soup.body.contents)
# print(soup.body.contents[1])
#
# soup.find_all()
#
# for child in soup.body.children: # 打印了所有的p标记
# print(child)
#
# text = soup.find_all('a')
# print(text)
text = soup.find_all(['a','b'])
text = soup.find_all(id = 'link2')
text = soup.find_all(text = '1')
text = soup.find_all(text = ['1','2','3'])
text = soup.select('title')
text = soup.select('.sister')
text = soup.select('#link1')
text = soup.select('p #link1')
text = soup.select('head > title')
text = soup.select('a[href="https://www.baidu.com"]')
# print(text.get_text())
print(soup.select('a[href="https://www.baidu.com"]'))
text = soup.select('a[href="https://www.baidu.com"]')[0].get_text()
print(text)
for title in soup.select('title'):
print(title.get_text())
爬取双色球信息案例
打开网页:http://zst.aicai.com/ssq/openInfo
爬取最新的双色球信息2021-5-16:
1、使用select方法,爬取信息
'''爬取最新日期的双色球信息,并打印出红球和蓝球的号码'''
import requests
from bs4 import BeautifulSoup
# 爬取数据
def get_html(url, headers):
request = requests.get(url,headers)
try:
if request.status_code == 200:
print("获取成功,网页长度:", len(response.text))
response.encoding = 'utf-8'
return response.text
except BaseException as e:
print("抓取信息错误", e)
# 分析数据
def get_soup(html):
soup = BeautifulSoup(html,'lxml') # 可以解析xml和html
# soup = BeautifulSoup(html,'html.parser') # 可以解析html
#date = soup.select('div.dataContent form div.mainTab.mainTab_lskj table tbody tr td')
date = soup.select('tbody tr:nth-child(3) td:nth-child(2)')
print(date)
red = soup.select('tbody tr:nth-child(3) td.redColor')
print(red)
blue = soup.select('tbody tr:nth-child(3) td.blueColor')
print(blue)
print("最新日期",end=':')
for i in range(len(date)):
print(date[i].string,end=' ')
print("红球号码",end=':')
for i in range(len(red)):
print(red[i].string,end=' ')
print("蓝球号码",end=':')
for i in range(len(blue)):
print(blue[i].string,end=' ')
if __name__ == '__main__':
url = 'http://zst.aicai.com/ssq/openInfo'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
response = requests.get(url,headers = headers)
html = get_html(url,headers)
get_soup(html)
2、使用find的方法爬取元素数据
import requests
from bs4 import BeautifulSoup
# 爬取数据
def get_html(url, headers):
request = requests.get(url,headers)
try:
if request.status_code == 200:
print("获取成功,网页长度:", len(response.text))
response.encoding = 'utf-8'
return response.text
except BaseException as e:
print("抓取信息错误", e)
# 分析数据
def get_soup(html):
soup = BeautifulSoup(html,'lxml') # 可以解析xml和html
tr = soup.find('tr', attrs={'onmouseout':"this.style.background=''"})
print(tr)
tds = tr.find_all('td')
print('最新日期:',tds[1].string)
print('红球号码',end=':')
for i in range(2,8):
print(tds[i].string,end=' ')
print('\n蓝球号码:',tds[8].string)
# 最新日期: 2021 - 05 - 16
# 红球号码: 07 10 14 16 24 33
# 蓝球号码: 16
if __name__ == '__main__':
url = 'http://zst.aicai.com/ssq/openInfo'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
response = requests.get(url,headers = headers)
html = get_html(url,headers)
get_soup(html)
爬取图片结合tkiner案例
爬取汽车排行前三的网页链接,再进入链接爬取图片,并且在tkiner中可以点击显示
网址:https://www.autohome.com.cn/channel2/bestauto/list.aspx?type=1
'''爬取汽车排行榜及其图片,在tinker中显示出来'''
import requests
from bs4 import BeautifulSoup
# 爬取数据
def get_html(url, headers):
request = requests.get(url,headers)
try:
if request.status_code == 200:
print("获取成功,网页长度:", len(response.text))
# response.encoding = 'utf-8'
return response.text
except BaseException as e:
print("抓取信息错误", e)
# 得到top3的链接
def get_soup_top(html):
soup = BeautifulSoup(html,'lxml') # 可以解析xml和html
Alllist = soup.select("div.pc_rank table tr td.n2 p a")
top = []
top.append(Alllist[0]['href'])
top.append(Alllist[1]['href'])
top.append(Alllist[2]['href'])
return top
# 访问top3的链接,并且爬取图片的路径
def get_top_pic(top,headers):
pics = []
for i in range(len(top)):
if "http" not in top[i]:
top[i] = 'https:' + top[i]
request = requests.get(top[i], headers)
soup1 = BeautifulSoup(request.text, features='lxml') # 可以解析xml和html
pic = soup1.find('img', attrs={'width': '744'})
if "http" not in pic:
pics.append("https:" + pic['src'])
else:
pics.append(pic['src'])
# print(top)
print(pics)
return pics
def save_pic(pics, headers):
for i in range(len(pics)):
path = "D://12119//pics//" + pics[i].split('/')[-1]
r = requests.get(pics[i], headers=headers)
with open(path, 'wb') as f:
f.write(r.content)
f.close()
def show_pic():
import tkinter,os
from PIL import Image, ImageTk
# 创建一个窗体,top
top = tkinter.Tk()
top.geometry("500x309")
text = tkinter.StringVar()
path = 'D://12119//pics//'
filelist = os.listdir(path)
img_open = Image.open(path + filelist[0])
img = ImageTk.PhotoImage(img_open)
img_open2 = Image.open(path + filelist[1])
img2 = ImageTk.PhotoImage(img_open2)
img_open3 = Image.open(path + filelist[2])
img3 = ImageTk.PhotoImage(img_open3)
label_img = tkinter.Label(top, image=img)
def one():
label_img.configure(image = img)
def two():
label_img.configure(image = img2)
def three():
label_img.configure(image = img3)
button1 = tkinter.Button(top, text="1", foreground='red',command = one,
activebackground='white', activeforeground='black').place(x=100,y=50)
button2 = tkinter.Button(top, text="2", foreground='red',command = two,
activebackground='white', activeforeground='black').place(x=100,y=100)
button3 = tkinter.Button(top, text="3", foreground='red',command = three,
activebackground='white', activeforeground='black').place(x=100,y=150)
label_img.pack()
top.mainloop()
if __name__ == '__main__':
url = 'https://www.autohome.com.cn/channel2/bestauto/list.aspx?type=1'
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
response = requests.get(url,headers = headers)
html = get_html(url,headers)
top = get_soup_top(html)
pics = get_top_pic(top, headers)
save_pic(pics, headers)
show_pic()
显示效果:点击左侧按钮可以切换图片
如果这篇文章对你有帮助的话,点个赞呗~