python-bs4模块

最新推荐文章于 2024-06-05 09:13:17 发布

淋巴不想动

最新推荐文章于 2024-06-05 09:13:17 发布

阅读量936

点赞数 1

分类专栏： python 爬虫

本文链接：https://blog.csdn.net/weixin_43067754/article/details/87778256

版权

0. 概括

获取页面: urllib, requests
解析页面信息: 正则表达式, BeautifulSoup4(BS4)

1. BS4简介

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个
工具箱，通过解析文档为tiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。
你不需要考虑编码方式，除非文档没有指定一个编一下原始编码方式就可以了。

2. BS4的4种对象

2-1. BeautifulSoup对象

2-2. Tag对象

Tag就是html中的一个标签，用BeautifulSoup就能解析出来Tag的具体内容，
具体的格式为soup.name,其中name是html下的标签。

2-3. NavigableString

既然我们已经得到了标签的内容，那么问题来了，我们要想获取标签内部的文字怎么办呢？很简单，用.string即可

2-4. 评论

注释对象是一个特殊类型的NavigableString对象，其输出的内容不包括注释符号。

安装bs4:pip install bs4
下面根据这个例子来具体了解bs4

html = """
<html>
<head><title class='title'>story12345</title></head>
<body>
<p class="title" name="dromouse">The Dormouse's story</p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><span>westos</span><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister1" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>

<input type="text">
<input type="password">
"""

这段html代码不太标准可以用prettify修正

import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.prettify())

输出：

<html>
 <head>
  <title>
   story12345
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sist

最低0.47元/天解锁文章

淋巴不想动

关注

1
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
python-bs4模块

0. 概括获取页面: urllib, requests解析页面信息: 正则表达式, BeautifulSoup4(BS4)1. BS4简介Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为tiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除...
复制链接

扫一扫