BeautifulSoup的详细笔记

下载

如果全面安装了python组建,直接在命令行中输入

pip install beautifulsoup4

导入到项目中

from bs4 import BeautifulSoup

快速开始

from bs4 import BeautifulSoup

html_doc = """
    <html><head><title>The Dormouse's story</title></head>
    <body>
    <p class="title"><b>The Dormouse's story</b></p>

    <p class="story">Once upon a time there were three little sisters; and their names were
    <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
    <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
    <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
    and they lived at the bottom of a well.</p>
    <p class="story">...</p>
    """

BeautifulSoup的初始化方法

soup = BeautifulSoup(html_doc, 'html.parser')

漂亮的输出方法

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

获取标签的方法

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

获取标签内某个属性的方法

soup.p['class']
# u'title'

soup.a  #只能找到一个
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

条件查找的方法

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

获取显示在页面上的文本的方法

print(soup.get_text()) ,, 重要,用 soup.get_text("",True)能解决编码问题
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.

获取注释的方法

soup.tag.string

对象

  1. Tag
    1. name:tag.name 标签的名字
    2. attributes :tag[‘attributes’] 标签内属性的值
    3. tag.prettify() 漂亮的打印所有内容
  2. NavigableString

    1. tag.string :获取标签内注释
    2. unicode_string = unicode(tag.string) 转换成unicode类型
    3. tag.string.replace_with(“No longer bold”) tag 替换字符串
  3. BeautifulSoup :这个对象代表着文档本身,你也可以把它当作一个大标签,标签的方法大部分适用于他

不同标签切换的方法

soup.body.b

soup.find_all(‘a’)

# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tag.contents 返回所有第一 .children 标签

.contents and .children

.descendants 返回所有的子标签,子子标签,等等

#head_tag :
for child in head_tag.descendants:
    print(child)
# <title>The Dormouse's story</title>
# The Dormouse's story

string 和 strings

string 如果标签内只有一个NavigableString,用这个就能获取到

多余一个用 strings,他是一个generator。他会原封不动地返回信息,当然‘/n’也会出现
改而用 .stripped_strings 返回不带换行符的

标签的关系

.parent
.parents
.next_sibling and .previous_sibling
.next_element and .previous_element

查询

先看一下每个参数能够接受的东西
1. A string : soup.find_all(‘b’)
2. A regular expression :soup.find_all(re.compile(“^b”)):
3. A list :soup.find_all([“a”, “b”]) a标签或者b标签
4. A function :一个自定义的返回布尔值的函数,

详细考察find_all()的属性
Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

  • name :一个标签的名字,表明搜索范围,可以传递上述的值
  • The keyword arguments :id,href,data-foo等

    • .find_all(href=re.compile(“.jpg”))
    • .find_all(id=True)
    • .find_all(href=re.compile(“elsie”), id=’link1’)
  • Searching by CSS class :因为是保留字,不能直接写出来,用了class_代替

    • 以前的版本可以用 : soup.find_all(“a”, attrs={“class”: “sister”})
    • 值可以是字符串,返回布尔的方法,甚至一个正则
  • The string argument
    • 使用string参数你可以搜索字符串而不是标签,同样传递变态参数
    • 一般来讲只返回匹配到的字符串
    • 使用标签和string ,返回的匹配到字符串的标签
    • string就是text
  • The recursive argument : 如果设置为false表明只在直接子类中查找而不递归往下查找

Output

1. 格式化输出
tag.prettify() #用unicode格式化,标签独占一行,类似于格式化代码
2. 非格式化输出

如果只想要字符串,用

str(soup)
# '<html><head></head><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'

unicode(soup.a)
# u'<a href="http://example.com/">I linked to <i>example.com</i></a>'

encode() to get a bytestring, and decode() to get Unicode.
3. 格式化器 formatter

对于以上几个方法,可以传入参数 formatter=,来按照自己的意思格式化。
formatter = ‘html’

print(soup.prettify(formatter="html"))
# <html>
#  <body>
#   <p>
#    Il a dit &lt;&lt;Sacr&eacute; bleu!&gt;&gt;
#   </p>
#  </body>
# </html>

formatter=None

print(soup.prettify(formatter=None))
# <html>
#  <body>
#   <p>
#    Il a dit <<Sacré bleu!>>
#   </p>
#  </body>
# </html>

自定义方法

def uppercase(str):
    return str.upper()

print(soup.prettify(formatter=uppercase))
# <html>
#  <body>
#   <p>
#    IL A DIT <<SACRÉ BLEU!>>
#   </p>
#  </body>
# </html>

print(link_soup.a.prettify(formatter=uppercase))
# <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
#  A LINK
# </a>

If you’re writing your own function, you should know about the EntitySubstitution class in the bs4.dammit module. This class implements Beautiful Soup’s standard formatters as class methods: the “html” formatter is EntitySubstitution.substitute_html, and the “minimal” formatter is EntitySubstitution.substitute_xml. You can use these functions to simulate formatter=html or formatter==minimal, but then do something extra.

比如:

from bs4.dammit import EntitySubstitution
def uppercase_and_substitute_html_entities(str):
    return EntitySubstitution.substitute_html(str.upper())

print(soup.prettify(formatter=uppercase_and_substitute_html_entities))
# <html>
#  <body>
#   <p>
#    IL A DIT &lt;&lt;SACR&Eacute; BLEU!&gt;&gt;
#   </p>
#  </body>
# </html>

get_text()

If you only want the text part of a document or tag, you can use the get_text() method. It returns all the text in a document or beneath a tag, as a single Unicode string:

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)

soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'

You can specify a string to be used to join the bits of text together:

# soup.get_text("|")
u'\nI linked to |example.com|\n'

You can tell Beautiful Soup to strip whitespace from the beginning and end of each bit of text:

# soup.get_text("|", strip=True)
u'I linked to|example.com'

But at that point you might want to use the .stripped_strings generator instead, and process the text yourself:

[text for text in soup.stripped_strings]
# [u'I linked to', u'example.com']
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值