数据提取-BeautifulSoup4基本使用

最新推荐文章于 2024-12-09 20:44:27 发布

十五十六

最新推荐文章于 2024-12-09 20:44:27 发布

阅读量1.2k

点赞数 1

分类专栏： python爬虫文章标签： CSS Beautiful Soup4 爬虫 python

本文链接：https://blog.csdn.net/L835311324/article/details/86553419

版权

python爬虫专栏收录该内容

15 篇文章

订阅专栏

简介

Beautiful Soup 是一个可以从HTML或XML文件中提取数据的Python库.它能够通过你喜欢的转换器实现惯用的文档导航,查找,修改文档的方式.用来解析HTML比较简单，API非常人性化，支持CSS选择器，Python标准库中的HTML解析器，也支持lxml的xml解析器。

四种常用的对象

Tag:Beautiful Soup中所有的标签都是Tag类型，并且Beautiful Soup的对象也是一个Tag类型，其实一些方法比如find,find_all并不是Beautiful Soup的，而是Tag的
NavigatebleString：继承自python中的str，用起来跟使用python的str是一样的。
BeautifulSoup：继承自Tag，用来生成BeautifulSoup4树的
Comment：就是继承自NavigatebleString

基本使用

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')

# 获取所有的p标签
# ps = soup.find_all('p')
# for p in ps:
#     print(p)
#     print("= "*30)

# 获取第三个p标签
# p = soup.find_all('p',limit=3)[2]
# print(p)

# 获取所有class为title的p标签
# 方法1：
# p = soup.find_all('p',class_='title')
# print(p)
# 方法2：
# p = soup.find_all('p',attrs={'class':'title'})
# print(p)

# 获取id为link3，class为sister 的a标签
# 方法1：
# a = soup.find_all('a',class_ = "sister",id = "link3" )
# print(a)
# 方法2：
# a = soup.find_all('a',attrs = {'class':'sister','id':'link3'})
# print(a)

# 获取所有a标签的href属性
# alist = soup.find_all('a')
# for a in alist:
    # 1.通过下标方式操作
    # href = a['href']
    # print(href)
    # 2.通过attrs属性方式
    # href =  a.attrs['href']
    # print(href)
    # 3. get方法
    # href = a.get('href')
    # print(href)

# 获取a标签下的字符串
# names = soup.find_all('a')
# for name in names:
#     print(name.string)

解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, “html.parser”)	1. Python的内置标准库 2. 执行速度适中 3.文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup, “lxml”)	1. 速度快 2. 文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, [“lxml-xml”]) 或BeautifulSoup(markup, “xml”)	1. 速度快 2. 唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, “html5lib”)	1. 最好的容错性 2.以浏览器的方式解析文档 3. 生成HTML5格式的文档	1. 速度慢 2.不依赖外部扩展

示例：

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')
print(soup.prettify())

string,strings, stripped_strings,get_text区别

# string：获取某个标签下的非标签字符串，返回的是字符串，如果这个标签有多行字符，则无法获取，需要用contents
# strings:获取某个标签下的子孙非标签字符串，返回来的是生成器
# stripped_strings:获取某个标签下的子孙非标签字符串,会去掉空白字符，返回生成器
# get_text:获取某个标签下的子孙非标签字符串，普通字符串返回

获取BOSS直聘运维工程师薪资

import requests
from bs4 import BeautifulSoup

url='https://www.zhipin.com/c101280600/?query=运维工程师&page=1'
header={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
r=requests.get(url,headers=header)

soup = BeautifulSoup(r.text,'lxml')

divs = soup.find_all('div',class_='job-primary')

jobs=[]

for div in divs:
    p=div.find_all('p')
    info1 = p[0]
    address = list(info1.stripped_strings)[0]
    exp = list(info1.stripped_strings)[1]
    edu = list(info1.stripped_strings)[2]
    info2 = p[1]
    category = list(info2.strings)[0]
    scale = " ".join(list(info2.strings)[1:])
    released = p[2].get_text()
    job = {
        'address':address,
        'exp ':exp,
        'edu ':edu,
        'category':category,
        'scale':scale,
        'released':released
    }
    jobs.append(job)

 for job in jobs:
     print(job)

网页部分源码

# soup.find_all('div',class_='job-primary') 获取到的部分

<li>
                                <div class="job-primary">
                                    <div class="info-primary">
                                        <h3 class="name">
                                            <a href="/job_detail/f9530c8151a4a52d1HZ83N6-FFI~.html" data-jid="f9530c8151a4a52d1HZ83N6-FFI~" data-itemid="1" data-lid="1ia40n0MBbj.search" data-jobid="32673340" data-index="1" ka="search_list_1" target="_blank">
                                                <div class="job-title">高级运维工程师</div>
                                                <span class="red">30k-60k</span>
                                                <div class="info-detail"></div>
                                            </a>
                                        </h3>
                                        <p>深圳  <em class="vline"></em>5-10年<em class="vline"></em>本科</p>
                                    </div>
                                    <div class="info-company">
                                        <div class="company-text">
                                            <h3 class="name"><a href="/gongsi/71f70f7aa52429bd33R43d28.html" ka="search_list_company_1_custompage" target="_blank">vivo</a></h3>
                                            <p>移动互联网<em class="vline"></em>不需要融资<em class="vline"></em>10000人以上</p>
                                        </div>
                                    </div>
                                    <div class="info-publis">
                                        <h3 class="name"><img src="https://img2.bosszhipin.com/boss/avatar/avatar_8.png?x-oss-process=image/resize,w_40,limit_0" />高先生<em class="vline"></em>运维</h3>
                                        <p>发布于01月15日</p>
                                    </div>
                                    <a href="javascript:;" data-url="/gchat/addRelation.json?jobId=f9530c8151a4a52d1HZ83N6-FFI~&lid=1ia40n0MBbj.search"
                                       redirect-url="/geek/new/index/chat?id=8464dfc5a8c2081a1HFz3Ny1F1Y~" target="_blank" class="btn btn-startchat">立即沟通
                                    </a>
                                </div>
                            </li>

运行结果
在这里插入图片描述

CSS选择器

from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc,'lxml')

# 通过标签名查找
print(soup.select('a'))

# 通过类名查找,要在类名前加'.'
print(soup.select('.sister'))

# 通过id查找，要在id前面加'#'
print(soup.select('#link1'))

# 组合查找
print(soup.select("p #link1"))

# 直接子标签查找，要用'>'
print(soup.select("head > title"))

# 通过属性查找,那么应该先写标签名，在中括号中写属性的值
print(soup.select('a[href="http://example.com/elsie"]'))