Python爬虫bs4的基本使用

啧不应该啊

已于 2024-09-25 21:06:11 修改

阅读量413

点赞数 2

分类专栏： Python爬虫文章标签： python 爬虫开发语言

于 2024-09-25 20:46:31 首次发布

本文链接：https://blog.csdn.net/Wgq0731/article/details/142532206

版权

Python爬虫专栏收录该内容

7 篇文章 0 订阅

订阅专栏

BS4是Python中一个用于从HTML或XML文件中提取数据的库，它提供了一种方便的方法来解析、遍历、搜索、修改文档的树形结构。

一、安装导入：

使用包管理器进行安装：

pip3 install beautifulsoup4

导入：

from bs4 import BeautifulSoup

beautifulsoup的解析器：html.parser、lxml、html5lib

默认为：html.parser

二、基本语法及案例

由于bs是针对标签提取数据，所以主要语法是围绕着标签的

比如有如下一组数据：

html='''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
'''

BeautifulSoup(html, 'lxml')这行代码实际上是在调用BeautifulSoup类并创建一个新的对象，该对象代表了整个HTML或XML文档的内容。这个对象被赋值给变量soup，之后你就可以使用soup来访问和操作文档的内容了。