基于Python的爬虫协议、parser解析及bs4案例（旧）

川野先生

已于 2022-03-26 19:51:25 修改

阅读量942

点赞数 2

分类专栏：高级爬虫案例教程文章标签： python 爬虫

于 2021-05-19 16:07:07 首次发布

本文链接：https://blog.csdn.net/to_upper/article/details/117015483

版权

高级爬虫案例教程专栏收录该内容

16 篇文章

订阅专栏

爬虫协议

爬虫协议：Robots协议（机器人协议），全名：网络爬虫排除标准。
            用来告诉搜索引擎，哪些页面可以抓取，哪些页面不可以抓取。
            该协议通常是一个robots文本文件。一般放在网站的根目录底下。
            当我们用爬虫搜索某一网站时，会先检查该网站点的根目录下是否存在爬虫协议。
            如果找到，则按照该协议进行爬取，如果没有，该网站点的所有内容都会被爬取。

1、百度的robot协议

            可以直接访问https://www.baidu.com/robots.txt
            robots.txt样例（百度）：
            User-agent: Baiduspider     爬虫名称：允许Baiduspider去访问   *（任何爬虫都可以爬取）
            Disallow: /baidu            Disallow指定了不允许抓取的记录    /（根目录，都不允许抓取）

2、解析爬虫协议

'''解析爬虫协议
    该类根据网站的robots.txt文件来判断一个爬虫是否有权限
'''
from urllib.robotparser import RobotFileParser
rp = RobotFileParser("https://www.136book.com/robots.txt")
rp.read()   # read()方法用来读取读取robots.txt文件并分析，该方法不会返回任何内容。但是执行了读取操作。

'''判断某一网站是否能被读取'''
# can_fetch(self, useragent, url)
flag = rp.can_fetch('*', 'https://www.136book.com/anlianjushenghuainanquanji/qlrcxeqleb/')
print(flag) # True

urllib库中parse解析器

urllib库中提供了parse模块，提供了一个处理URL的标准接口，比如：实现URL各部分的抽取，合并以及链接转换
        urlparse(url,scheme,allow_fragments):可以实现url的识别和分段
        url:网址
        scheme:协议
        allow_fragments:允许锚点

1、解析网页

使用urllib.parse的urlparse模块，可以把网址解析，分割成不同部分。

from urllib.parse import urlparse
result = urlparse("http://www.baidu.com/index.html;user?id=5#comment")
# result = urlparse("http://www.baidu.com/index.html;user?id=5#comment",scheme='',allow_fragments=False)
print(type(result),result)
# <class 'urllib.parse.ParseResult'>
# ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
'''分析一下上面打印的内容
scheme:协议
netloc:域名
path:路径
params:参数
?后的query:查询条件（一般用于get类型的URL）
#后的是锚点：直接定位页面的下拉位置
公式：scheme://netloc/path;params?query#fragment
'''
result = urlparse('www.ZangYushun.com',scheme='https')
print(result)
# ParseResult(scheme='https', netloc='', path='www.ZangYushun.com', params='', query='', fragment='')

# allow_fragments是否允许锚点
result = urlparse("http://www.baidu.com/index.html;user?id=5#comment",scheme='',allow_fragments=False)
print(result.scheme,result[0],result.netloc,result[1],sep='\n')

2、反解析unparse()

urlunpaese()：参数式一个可迭代对象，但是它的长度必须为6,否则会抛出异常

'''scheme, netloc, url, params, query, fragment'''
from urllib.parse import urlunparse
data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data)) # http://www.baidu.com/index.html;user?a=6#comment

from urllib.parse import urlsplit
# urlsplit()方法和urlparse方法类似，只不过不再单独解析params这一部分。只返回5个结果
result = urlsplit('http://www.baidu.com/index.html;user?a=6#comment')
print(result)   # SplitResult(scheme='http', netloc='www.baidu.com', pa

bs4.BeautifulSoup

和lxml一样，BeautifulSoup是一个HTML/XML解析器，主要的功能也是如何解析和提取HTML/XML中的数据
    lxml只会局部遍历，而Beautifulsoup是基于HTML DOM结构的，会载入整个文档，解析整个DOM树结构。
    因此，时间和内存开销都会比较大，所以性能要低于lxml

    BeautifulSoup用来解析HTML比较简单，API比较人性化，
    且支持CSS选择器，python标准库中的html解析器，也支持lxml的XML解析器

1、使用BeautifulSoup对网页进行解析

from bs4 import BeautifulSoup
html = '''
<html><head><title>The title</title></head>
<body>
<p class = "title" name="dromouse"><b>The Dormouse's story</b></p>
<p class = "story">
    <a href = "https://www.baidu.com" class = "sister" id = "link1">1<!--测试注释--></a>
    <a href = "https://www.baidu.com" class = "sister" id = "link2">2</a>
    <a href = "https://www.baidu.com" class = "sister" id = "link3">3</a>
ceshi2</p>
<p class = "story">...</p>
'''
# 创建Beautifulsoup对象
soup = BeautifulSoup(html,features='lxml')
# soup = BeautifulSoup(open('index.html'))

# 格式化输出soup对象的内容，补齐标签
print(soup.prettify())
print(soup.title)
print(soup.head)
print(soup.a)
print(type(soup.p))
print(soup.head.name)   # 对于其他内部标签，输出的值为标签本身的名字
print(soup.p.name)
'''上述代码打印内容：
	<title>The title</title>
	<head><title>The title</title></head>
	<a class="sister" href="https://www.baidu.com" id="link1">1<!--测试注释--></a>
	<class 'bs4.element.Tag'>
'''
# 以字典的形式吧p标签的所有属性打印输出
print(soup.p.attrs) # {'class': ['title'], 'name': 'dromouse'}
print(soup.p['class'])  # ['title']
print(soup.p.get('class'))  # ['title']

# 修改属性
soup.p['class']='newClass'
print(soup.p)  # <p class="newClass" name="dromouse"><b>The Dormouse's story</b></p>
# 删除属性
del soup.p['class']
print(soup.p)   #  <p name="dromouse"><b>The Dormouse's story</b></p>
# 输出标签包含的内容
print(soup.p.string)    # The Dormouse's story
print(type(soup.name))
print(soup.name)
print(soup.a.string)

2、bs4案例

BeautifulSoup 将复杂的HTML文档转换成复杂的树形结构，每个节点都是python对象。所有对象可以归纳为4种：
    1、Tag对象
    2、NavigableString
    3、BeautifulSoup
    4、Comment

    Tag: html中的标签，每个标签成称为tag对象。
        我们可以使用soup加上标签轻松的获取这些标签的内容，这些对象的类型是bs4.element.Tag'
        但是注意：它查找的是所有内容中的第一个符合要求的标签。。。。（不能查询所有的标签）

    BeautifulSoup:对象表示的是一个文档的内容，大部分时候，可以把他当做tag对象，是一个特殊的tag
    print(type(soup)) # unicode

    Comment对象是一个特殊的NavigableString对象，输出的内容不包括注释符号，如果包含注释，通过.string获取内容为None

——————————————————————————————————————————————————————————

遍历文档树：
    1、直接子节点：.contents   .children属性
        .contents属性 可以量tag的子节点以列表的方式输出/
        .children属性：返回的值不是一个list，通过遍历获取所有的子节点

搜索文档树：
    1、find_all(name, attrs, recursive, text, limit, **kwargs)
        name 参数：name参数可以查找所有名字为name的tag对象，字符串对象会被自动忽略掉
        列表参数：Beautifulsoup对象将会与列表中任一元素的匹配返回。

    2、select()通过css选择器进行查找
        ①通过标签名查找
        ②通过类名来查找
        ③通过id来查找
        ④组合查找: 和写class文件时，标签名与类型、id进行组合。
        ⑤属性查找: 查找可以添加属性元素，属性需要用中括号括起来，注意属性和标签属于同一结点。
               所以中间不能添加空格，否则无法匹配到
            print(soup.select('a[class="sister"]'))
        ⑥获取内容

案例验证上述方法：

from bs4 import BeautifulSoup
html = '''
<html><head><title>The title</title></head>
<body>
<p class = "title" name="dromouse"><b>The Dormouse's story</b></p>
<p class = "story">
    <a href = "https://www.baidu.com" class = "sister" id = "link1">1<!--测试注释--></a>
    <a href = "https://www.baidu.com" class = "sister" id = "link2">2</a>
    <a href = "https://www.baidu.com" class = "sister" id = "link3">3</a>
ceshi2</p>
<p class = "story">...</p>
'''
# 创建Beautifulsoup对象
soup = BeautifulSoup(html,features='lxml')
# print(soup.body.contents)
# print(soup.body.contents[1])
#
# soup.find_all()
#
# for child in soup.body.children:    # 打印了所有的p标记
#     print(child)
#
# text = soup.find_all('a')
# print(text)

text = soup.find_all(['a','b'])
text = soup.find_all(id = 'link2')
text = soup.find_all(text = '1')
text = soup.find_all(text = ['1','2','3'])
text = soup.select('title')
text = soup.select('.sister')
text = soup.select('#link1')
text = soup.select('p #link1')
text = soup.select('head > title')
text = soup.select('a[href="https://www.baidu.com"]')

# print(text.get_text())
print(soup.select('a[href="https://www.baidu.com"]'))
text = soup.select('a[href="https://www.baidu.com"]')[0].get_text()
print(text)

for title in soup.select('title'):
    print(title.get_text())

爬取双色球信息案例

打开网页：http://zst.aicai.com/ssq/openInfo
爬取最新的双色球信息2021-5-16：

1、使用select方法，爬取信息

'''爬取最新日期的双色球信息,并打印出红球和蓝球的号码'''
import requests
from bs4 import BeautifulSoup
# 爬取数据
def get_html(url, headers):
    request = requests.get(url,headers)
    try:
        if request.status_code == 200:
            print("获取成功，网页长度：", len(response.text))
            response.encoding = 'utf-8'
            return response.text
    except BaseException as e:
        print("抓取信息错误", e)
# 分析数据
def get_soup(html):
    soup = BeautifulSoup(html,'lxml')   # 可以解析xml和html
    # soup = BeautifulSoup(html,'html.parser')  # 可以解析html
    #date = soup.select('div.dataContent form div.mainTab.mainTab_lskj table tbody tr td')
    date = soup.select('tbody tr:nth-child(3) td:nth-child(2)')
    print(date)
    red = soup.select('tbody tr:nth-child(3) td.redColor')
    print(red)
    blue = soup.select('tbody tr:nth-child(3) td.blueColor')
    print(blue)
    print("最新日期",end=':')
    for i in range(len(date)):
        print(date[i].string,end=' ')
    print("红球号码",end=':')
    for i in range(len(red)):
        print(red[i].string,end=' ')
    print("蓝球号码",end=':')
    for i in range(len(blue)):
        print(blue[i].string,end=' ')
if __name__ == '__main__':
    url = 'http://zst.aicai.com/ssq/openInfo'
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
    response = requests.get(url,headers = headers)
    html = get_html(url,headers)
    get_soup(html)

2、使用find的方法爬取元素数据

import requests
from bs4 import BeautifulSoup
# 爬取数据
def get_html(url, headers):
    request = requests.get(url,headers)
    try:
        if request.status_code == 200:
            print("获取成功，网页长度：", len(response.text))
            response.encoding = 'utf-8'
            return response.text
    except BaseException as e:
        print("抓取信息错误", e)
# 分析数据
def get_soup(html):
    soup = BeautifulSoup(html,'lxml')   # 可以解析xml和html
    tr = soup.find('tr', attrs={'onmouseout':"this.style.background=''"})
    print(tr)
    tds = tr.find_all('td')
    print('最新日期:',tds[1].string)
    print('红球号码',end=':')
    for i in range(2,8):
        print(tds[i].string,end=' ')
    print('\n蓝球号码:',tds[8].string)
    # 最新日期: 2021 - 05 - 16
    # 红球号码: 07 10 14 16 24 33
    # 蓝球号码: 16

if __name__ == '__main__':
    url = 'http://zst.aicai.com/ssq/openInfo'
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
    response = requests.get(url,headers = headers)
    html = get_html(url,headers)
    get_soup(html)

爬取图片结合tkiner案例

爬取汽车排行前三的网页链接，再进入链接爬取图片，并且在tkiner中可以点击显示

网址：https://www.autohome.com.cn/channel2/bestauto/list.aspx?type=1
汽车排行
汽车图片

'''爬取汽车排行榜及其图片，在tinker中显示出来'''
import requests
from bs4 import BeautifulSoup
# 爬取数据
def get_html(url, headers):
    request = requests.get(url,headers)
    try:
        if request.status_code == 200:
            print("获取成功，网页长度：", len(response.text))
            # response.encoding = 'utf-8'
            return response.text
    except BaseException as e:
        print("抓取信息错误", e)
# 得到top3的链接
def get_soup_top(html):
    soup = BeautifulSoup(html,'lxml')   # 可以解析xml和html
    Alllist = soup.select("div.pc_rank table tr td.n2 p a")
    top = []
    top.append(Alllist[0]['href'])
    top.append(Alllist[1]['href'])
    top.append(Alllist[2]['href'])
    return top
# 访问top3的链接，并且爬取图片的路径
def get_top_pic(top,headers):
    pics = []
    for i in range(len(top)):
        if "http" not in top[i]:
            top[i] = 'https:' + top[i]
        request = requests.get(top[i], headers)
        soup1 = BeautifulSoup(request.text, features='lxml')  # 可以解析xml和html
        pic = soup1.find('img', attrs={'width': '744'})
        if "http" not in pic:
            pics.append("https:" + pic['src'])
        else:
            pics.append(pic['src'])
    # print(top)
    print(pics)
    return pics

def save_pic(pics, headers):
    for i in range(len(pics)):
        path = "D://12119//pics//" + pics[i].split('/')[-1]
        r = requests.get(pics[i], headers=headers)
        with open(path, 'wb') as f:
            f.write(r.content)
            f.close()

def show_pic():
    import tkinter,os
    from PIL import Image, ImageTk
    # 创建一个窗体，top
    top = tkinter.Tk()
    top.geometry("500x309")
    text = tkinter.StringVar()
    path = 'D://12119//pics//'
    filelist = os.listdir(path)
    img_open = Image.open(path + filelist[0])
    img = ImageTk.PhotoImage(img_open)
    img_open2 = Image.open(path + filelist[1])
    img2 = ImageTk.PhotoImage(img_open2)
    img_open3 = Image.open(path + filelist[2])
    img3 = ImageTk.PhotoImage(img_open3)
    label_img = tkinter.Label(top, image=img)
    def one():
        label_img.configure(image = img)
    def two():
        label_img.configure(image = img2)
    def three():
        label_img.configure(image = img3)
    button1 = tkinter.Button(top, text="1", foreground='red',command = one,
                            activebackground='white', activeforeground='black').place(x=100,y=50)
    button2 = tkinter.Button(top, text="2", foreground='red',command = two,
                            activebackground='white', activeforeground='black').place(x=100,y=100)
    button3 = tkinter.Button(top, text="3", foreground='red',command = three,
                            activebackground='white', activeforeground='black').place(x=100,y=150)
    label_img.pack()

    top.mainloop()

if __name__ == '__main__':
    url = 'https://www.autohome.com.cn/channel2/bestauto/list.aspx?type=1'
    headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
    response = requests.get(url,headers = headers)
    html = get_html(url,headers)
    top = get_soup_top(html)
    pics = get_top_pic(top, headers)
    save_pic(pics, headers)
    show_pic()