数据之路 - Python爬虫 - BeautifulSoup库

最新推荐文章于 2024-04-15 16:51:42 发布

weixin_30706691

最新推荐文章于 2024-04-15 16:51:42 发布

阅读量85

点赞数

文章标签： python 爬虫 c/c++

原文链接：http://www.cnblogs.com/Iceredtea/p/11286170.html

版权

一、BeautifulSoup介绍

Beautiful Soup就是Python的一个HTML或XML的解析库，可以用它来方便地从网页中提取数据。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为UTF-8编码。

使用流程：       
    - 导包：from bs4 import BeautifulSoup
    - 使用方式：可以将一个html文档，转化为BeautifulSoup对象，然后通过对象的方法或者属性去查找指定的节点内容
        （1）转化本地文件：
             - soup = BeautifulSoup(open('本地文件'), 'lxml')
        （2）转化网络文件：
             - soup = BeautifulSoup('字符串类型或者字节类型', 'lxml')
        （3）打印soup对象显示内容为html文件中的内容
基础巩固：
    （1）根据标签名查找
        - soup.a   只能找到第一个符合要求的标签
    （2）获取属性
        - soup.a.attrs  获取a所有的属性和属性值，返回一个字典
        - soup.a.attrs['href']   获取href属性
        - soup.a['href']   也可简写为这种形式
    （3）获取内容
        - soup.a.string
        - soup.a.text
        - soup.a.get_text()
       【注意】如果标签还有标签，那么string获取到的结果为None，而其它两个，可以获取文本内容
    （4）find：找到第一个符合要求的标签
        - soup.find('a')  找到第一个符合要求的
        - soup.find('a', title="xxx")
        - soup.find('a', alt="xxx")
        - soup.find('a', class_="xxx")
        - soup.find('a', id="xxx")
    （5）find_all：找到所有符合要求的标签
        - soup.find_all('a')
        - soup.find_all(['a','b']) 找到所有的a和b标签
        - soup.find_all('a', limit=2)  限制前两个
    （6）根据选择器选择指定的内容
               select:soup.select('#feng')
        - 常见的选择器：标签选择器(a)、类选择器(.)、id选择器(#)、层级选择器
            - 层级选择器：
                div .dudu #lala .meme .xixi  下面好多级
                div > p > a > .lala          只能是下面一级
        【注意】select选择器返回永远是列表，需要通过下标提取指定的对象

二、BeautifulSoup简单案例

from bs4 import BeautifulSoup

html = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''
soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title)
print(soup.title.name)
print(soup.title.string)
print(soup.title.parent.name)
print(soup.p)
print(soup.p["class"])
print(soup.a)
print(soup.find_all('a'))
print(soup.find(id='link3'))

三、Beautiful Soup支持的解析器

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3及Python 3.2.2之前的版本文档容错能力差
lxml HTML解析器	BeautifulSoup(markup, "lxml")	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	BeautifulSoup(markup, "xml")	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

四、BeautifulSoup基本用法

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

1.标签选择器

通过这种soup.标签名我们就可以获得这个标签的内容

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(soup.head)
print(soup.p)    # 如果有多个p标签，只输出第一个

2.标签选择器·获取名称

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)

3.标签选择器·获取属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

4.子节点和子孙节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)        # 获取子节点

print(soup.p.children)        # 获取子节点
for i,child in enumerate(soup.p.children):
    print(i,child)            
    
print(soup.p.descendants)     # 获取子孙节点
for i,child in enumerate(soup.p.descendants):
    print(i,child)

5.父节点、祖先节点、兄弟节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.a.parent)    　　　　　　　　　　　　　　　　 # 获取父节点
print(list(enumerate(soup.a.parents)))    　　　　　　# 获取祖先节点

print(list(enumerate(soup.a.next_siblings)))        # 获取下一兄弟节点
print(list(enumerate(soup.a.previous_siblings)))    # 获取上一个兄弟节点

五、方法选择器

find_all()根据标签名、属性、内容查找文档
find_all(narne,attrs,recursive,text,**kwargs)

# 标签名查询
print(soup.findall(name=’ul'))
print(type(soup.find_all(name=’ul’)[0]))

# 属性查询
print(soup.find_all(attrs＝｛’id＇：’list-1'｝））
print(soup.find_all(attrs＝｛’name＇：’elements’｝））

# 文本查询
print(soup.find_all(text=re.compile(’link')))

find_all()　　　　　        # 返回所有元素
find()　　　　　　　         # 返回单个元素
 find_parents()　　 # 返回所有祖先节点 find_parent()　　 # 返回直接父节点  find_next_siblings()　　 # 返回后面所有的兄弟节点 find_next_sibling()　　 # 返回后面第一个兄弟节点  find_previous_siblings() # 返回前面所有兄弟节点 find_previous_sibling() # 返回前面第一个兄弟节点  find_all_next() # 返回节点后所有符合条件的节点 find_next() # 返回第一个符合条件的节点  find_all_previous() # 返回节点后所有符合条件的节点 find_previous() # 返回第一个符合条件的节点

六、CSS选择器

通过select()直接传入CSS选择器即可完成选择

html= '''
<div class='panel'>
    <div class='panel-heading'>
        <h4>Hello</h4>
    </div>    
    <div class='panel-body'>
        <ul class='list' id='list-1'>
            <li class='element'>Foo</li>
            <li class='element'>Bar>
            <li class='element'>Jay</li>
        </ul>
        <ul class='list list-small' id='list-2'>
            <li class='element'>Foo</li>
            <li class='element'>Bar</li>
        </ul>
    </div>
</div>
'''

1.选择标签

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
print(soup.select('.panel.panel-heading'))    
print(soup.select('ul li'))
print(soup.select('#list-2.element'))

2.选择属性

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

3.选择文本

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
for ul in soup.select('li'):
    print(ul.get_text())

转载于:https://www.cnblogs.com/Iceredtea/p/11286170.html

weixin_30706691

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数据之路 - Python爬虫 - BeautifulSoup库

一、BeautifulSoup介绍Beautiful Soup就是Python的一个HTML或XML的解析库，可以用它来方便地从网页中提取数据。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为UTF-8编码。使用流程： - 导包：from bs4 import BeautifulSoup - 使用方式：可以将...
复制链接

扫一扫