爬虫之BeautifulSoup4介绍与使用

最新推荐文章于 2023-12-03 19:30:00 发布

若金

最新推荐文章于 2023-12-03 19:30:00 发布

阅读量191

点赞数

文章标签： python

本文链接：https://blog.csdn.net/weixin_45680364/article/details/106269284

版权

bs4简介
1.1 基本概念
Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库
1.2 源码分析
• github下载源码
• 安装
• pip install lxml
• pip install bs4
bs4的使用
2.1 快速开始
html_doc = “”"

The Dormouse's story

The Dormouse's story

Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.

...

""" # 获取bs对象 bs = BeautifulSoup(html_doc,'lxml') # 打印文档内容(把我们的标签更加规范的打印) print(bs.prettify()) print(bs.title) # 获取title标签内容 The Dormouse's story print(bs.title.name) # 获取title标签名称 title print(bs.title.string) # title标签里面的文本内容 The Dormouse's story print(bs.p) # 获取p段落 2.2 bs4的对象种类 • tag : 标签 • NavigableString : 可导航的字符串 • BeautifulSoup : bs对象 • Comment : 注释 3. 遍历树遍历子节点 bs里面有三种情况，第一个是遍历，第二个是查找，第三个是修改 3.1 contents children descendants • contents 返回的是一个列表 • children 返回的是一个迭代器通过这个迭代器可以进行迭代 • descendants 返回的是一个生成器遍历子子孙孙 3.2 .string .strings .stripped strings • string获取标签里面的内容 • strings 返回是一个生成器对象用过来获取多个标签内容 • stripped strings 和strings基本一致但是它可以把多余的空格去掉 4. 遍历树遍历父节点 parent 和 parents • parent直接获得父节点 • parents获取所有的父节点 5. 遍历树遍历兄弟结点 • next_sibling 下一个兄弟结点 • previous_sibling 上一个兄弟结点 • next_siblings 下一个所有兄弟结点 • previous_siblings上一个所有兄弟结点 6. 搜索树 • 字符串过滤器 • 正则表达式过滤器我们用正则表达式里面compile方法编译一个正则表达式传给 find 或者 findall这个方法可以实现一个正则表达式的一个过滤器的搜索 • 列表过滤器 • True过滤器 • 方法过滤器 7. find_all() 和 find() 7.1 find_all() • find_all()方法以列表形式返回所有的搜索到的标签数据 • find()方法返回搜索到的第一条数据 • find_all()方法参数 def find_all(self, name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs): • name : tag名称 • attr : 标签的属性 • recursive : 是否递归搜索 • text : 文本内容 • limli : 限制返回条数 • kwargs : 关键字参数 7.2 find_parents() find_parent() find_next_siblings() find_next_sibling() • find_parents() 搜索所有父亲 • find_parrent() 搜索单个父亲 • find_next_siblings()搜索所有兄弟 • find_next_sibling()搜索单个兄弟 7.3 find_previous_siblings() find_previous_sibling find_all_next() find_next() • find_previous_siblings() 往上搜索所有兄弟 • find_previous_sibling() 往上搜索单个兄弟 • find_all_next() 往下搜索所有元素 • find_next()往下查找单个元素 8. 修改文档树 • 修改tag的名称和属性 • 修改string 属性赋值,就相当于用当前的内容替代了原来的内容 • append() 像tag中添加内容,就好像Python的列表的 .append() 方法 • decompose() 修改删除段落，对于一些没有必要的文章段落我们可以给他删除掉

若金

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
爬虫之BeautifulSoup4介绍与使用

bs4简介1.1 基本概念Beautiful Soup 是一个可以从HTML或XML文件中提取数据的网页信息提取库1.2 源码分析• github下载源码• 安装• pip install lxml• pip install bs4bs4的使用2.1 快速开始html_doc = “”"The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and t.
复制链接

扫一扫