BeautifulSoup主要介绍与基础爬虫项目实践

最新推荐文章于 2024-08-08 15:42:55 发布

SunLight Jr

最新推荐文章于 2024-08-08 15:42:55 发布

阅读量5.3k

点赞数 4

分类专栏：爬虫

本文链接：https://blog.csdn.net/qq_37597345/article/details/83781420

版权

本文介绍了Python库BeautifulSoup的使用，包括安装、创建BeautifulSoup对象、对象种类（Tag、NavigableString、BeautifulSoup、Comment）以及遍历和搜索文档树的方法。通过实例展示了如何利用BeautifulSoup进行网页内容的抓取，如查找特定标签、属性和CSS选择器。此外，还提到了基础爬虫项目：爬取百度百科100词条。

摘要由CSDN通过智能技术生成

强大的BeautifulSoup

1.简要介绍

BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库，它能够通过你喜欢的转换器实现惯用的文档导航，查找，修改文档的方式。

2.Beautiful Soup的安装
方法1： pip install bs4
方法2：在Pycharm中，可以在File -> Settings -> Project Interpreter -> 右侧有个加号按钮 -> 在弹出的窗口搜索bs4并安装。

3.BeautifulSoup的使用
[0].bs4库的导入
from bs4 import BeautifulSoup
[1].创建BeautifulSoup对象
以下'lxml'是手动指定的解析器。如果省略，BeautifulSoup一般会选择最合适的解析器来解析这段文档，如果手动指定，那么BeautifulSoup会选择指定的解析器来解析文档。
方式1. 直接通过字符串创建
soup = BeautifulSoup(html_str, 'lxml', from_encoding = 'utf-8')

   Example:
   from bs4 import BeautifulSoup
   import requests
   import chardet

   url = 'http://www.baidu.com'
   response = requests.get(url)
   response.encoding = chardet.detect(response.content)['encoding']
   text = response.text

   soup = BeautifulSoup(text, 'lxml')
   print(soup.prettify())
方式2. 通过html文件来创建
   from bs4 import BeautifulSoup
   import requests
   import chardet