python爬虫笔记（5）beautifulsoup

最新推荐文章于 2021-02-22 11:46:24 发布

mittyQAQ

最新推荐文章于 2021-02-22 11:46:24 发布

阅读量104

点赞数

分类专栏： python 文章标签： python

本文链接：https://blog.csdn.net/weixin_43525209/article/details/107505940

版权

python 专栏收录该内容

35 篇文章 0 订阅

订阅专栏

pip install beautifulsoup
安装beautifulsoup的时候出现错误

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full comman
d output.

网上搜了一下是版本不对，加个4就行了
pip install beautifulsoup4 在这里插入图片描述

import requests
from bs4 import BeautifulSoup
r = requests.get("https://python123.io/ws/demo.html")
# print(r.text)
demo = r.text
soup = BeautifulSoup(demo , "html.parser")
print(soup.prettify())

可以是网页也可以是HTML文档
soup = BeautifulSoup(“data”, “html.parser”)
soup2 = BeautifulSoup(open(“D://demo.html”), “html.parser”)
![在这里插入图片描述](https://img-blog.csdnimg.cn/20200722105435443.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzUyNTIwOQ==

#返回HTML网页
print(soup.prettify())

#返回title标签
print(soup.title)

在这里插入图片描述
#返回a标签
print(soup.a)

结果：

<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>

只要网页中有这个标签，就会返回，如果有多个就会返回第一个。如果想都要，就再多做一步。
tag = soup.a
print(tag.attrs)

{'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}

这样可以用方括号输出

print(tag.attrs['href'])

结果’http://www.icourse163.org/course/BIT-268001’
可以用type（）看属性
例：type（tag） type（soup.attr）

可以看父标签名，去掉parent就可以看a标签名
print(soup.a.parent.name)

在这里插入图片描述

标签树的下行遍历
.contents:子节点的列表，将<tag>所有儿子节点存入列表
.children:子节点的迭代类型，与.contents类似，用于循环遍历儿子节点
.descendants:子孙结点的迭代类型，包含所有子孙节点，用于循环遍历。

print(soup.head)
print(soup.head.contents)
print(soup.body.contents)
for child in doup.body.children:
    prient(child)

标签树的上行遍历
.parent：结点的父亲标签
.parents：结点的先辈标签的迭代类型，用于循环遍历先辈结点。
soup.title.parent
soup.html.parent(html是最高的表
.next_sibling:返回按照HTML文本顺序的下一个平行节点标签
.previous_sibling:返回按照HTML文本顺序的上一个平行节点标签
.next_sibling:迭代类型，返回按照HTML文本顺序的后续所有平行节点标签。
.previous_siblings:迭代类型，返回按照HTML文本顺序的前续所有平行节点标签。

在一个父标签下才是平行遍历。在这里插入图片描述

#遍历后续节点
for sibling in soup.a.next_siblings:
    print(sibling)
#遍历前续节点
for sibling in soup.a.previous_siblings:
    print(sibling)

mittyQAQ

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫笔记（5）beautifulsoup

pip install beautifulsoup安装beautifulsoup的时候出现错误ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.网上搜了一下是版本不对，加个4就行了pip install beautifulsoup4import requestsfrom bs4 import BeautifulSoupr
复制链接

扫一扫

专栏目录