《python网络爬虫与信息提取》学习笔记（二）

最新推荐文章于 2024-08-23 16:53:34 发布

一只小白来了

最新推荐文章于 2024-08-23 16:53:34 发布

阅读量218

点赞数

分类专栏： python学习文章标签： python

本文链接：https://blog.csdn.net/weixin_44866139/article/details/104299498

版权

python学习专栏收录该内容

3 篇文章 0 订阅

订阅专栏

“The website is the API”

beautiful soup库的安装

升级pip的命令：
python -m pip install --upgrade pip
以管理员身份打开命令行
安装beautiful soup库的命令：
pip install beautiful soup4
在这里插入图片描述
安装小测
https://python123.io/ws/demo.html

如何熬成一锅汤？只需三行代码
1、from bs4（库的简写） import BeautifulSoup(一个类)
2、soup=BeautifulSoup(r.text,“html.parser”)(HTMLParser是Python内置的专门用来解析HTML的模块)
3、print(soup.prettify())
在这里插入图片描述

BeautifulSoup库的基本元素

BeautifulSoup库是解析、便历、维护“标签树”的功能库。
在这里插入图片描述

navigablestring可以跨越多个标签层次
无论是否标签当中存在属性都会返回一个字典类型
判断返回的string是否为注释部分，可根据string的类型来判断

基于bs4库的html内容便历方法

在这里插入图片描述

contents()方法返回字典列表
儿子节点不仅包含标签节点还包含字符串节点

html标签的父亲是自己，soup的父亲为空

import requests
from bs4 import BeautifulSoup

r = requests.get("https://python123.io/ws/demo.html")
demo = r.text
soup = BeautifulSoup(demo, "html.parser")
for parent in soup.a.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)

在这里插入图片描述

平行遍历的标签可能是NavigableString类型

基于bs4库的HTML格式化和编码

一只小白来了

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
《python网络爬虫与信息提取》学习笔记（二）

“The website is the API”beautiful soup库的安装升级pip的命令：python -m pip install --upgrade pip以管理员身份打开命令行安装beautiful soup库的命令：pip install beautiful soup4安装小测https://python123.io/ws/demo.html如何熬成一锅汤...
复制链接

扫一扫

专栏目录