Python的Beautiful Soup学习笔记

最新推荐文章于 2024-01-28 14:44:34 发布

wnma3mz

最新推荐文章于 2024-01-28 14:44:34 发布

阅读量298

点赞数

分类专栏：笔记文章标签： python html 标签文档爬虫-python

本文链接：https://blog.csdn.net/wnma3mz/article/details/74832678

版权

笔记专栏收录该内容

27 篇文章 2 订阅

订阅专栏

参考文章

# 安装好Python， 之后再安装bs4和lxml解析器
>>>pip install bs4
>>>pip install lxml

# -*- coding=utf8 -*-
from bs4 import BeautifulSoup

# 用lxml解析html这个文档
soup = bs4(html, 'lxml')

# 查找第一个出现的a标签
soup.find("a")

# 查找所有a标签, 返回值为列表
soup.find_all("a")

# 获取所有文件内容
soup.get_text()

# 获取a标签内class属性
tag_a = soup.find("a")
tag_a["class"]

# 获取a标签内的文字内容
tag_a = soup.find("a")
tag_a.string
# 可以直接转换为unicode字符串
unicode(tag_a.string)

# 加入正则表达式
import re 
# 找出所有含a的标签
soup.find_all(re.compile("a"))

# 找出所有含a、b标签
soup.find_all(["a", "b"])


# 详解find_all()
# 找出所有p标签中含有title属性的内容
soup.find_all("p", "title")

# 找出所有href属性符合这个正则表达式且id="link1"的内容
import re 
soup.find_all(href=re.compile("elsie"), id="link1")

# 找出所有a标签中有class为sister的内容，由于python含有class这个类名，产生冲突所以需要改成class_
soup.find_all("a", class_="sister")

# 找到所有a标签，限制返回列表的个数为2
soup.find_all("a", limit=2)