python爬虫html解析器_Python爬虫如何解析此类html?

最新推荐文章于 2023-03-25 10:55:47 发布

Sunflower向阳而生

最新推荐文章于 2023-03-25 10:55:47 发布

阅读量188

点赞数

文章标签： python爬虫html解析器

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

本文链接：https://blog.csdn.net/weixin_42394785/article/details/113672566

版权

1.re模块本质属于对字符串的强制提取

解析方法有

html = """

哈哈

www.baidu.com

你真

搞笑

"""

# 使用 scrapy 中的 Selector 进行解析

from scrapy import Selector

response = Selector(text=html)

text_list = response.css("div *::text").getall()

print([i for i in text_list if i.replace("\n", "")])

url = response.css("div img::attr(src)").get()

print(url)

"""

结果

['哈哈', '你真', '搞笑']

www.baidu.com

"""

# 使用 BeautifulSoup 进行解析

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

haha = soup.div.p.string

print(haha)

url = soup.img.attrs['src']

print(url)

nizhen = soup.div.get_text()[3:-3]

print(nizhen)

gaoxiao = soup.div.find_all('p')[-1].string

print(gaoxiao)

print(haha, url, nizhen, gaoxiao)

"""

结果

哈哈

www.baidu.com

你真

搞笑

哈哈 www.baidu.com 你真搞笑

"""

# 使用lxml进行解析

from lxml import etree

html = etree.HTML(html) # 初始化生成一个XPath解析对象

text_lists = html.xpath("//div//text()")

print([i for i in text_lists if i.replace("\n", "")])

url = html.xpath("//div/p/img/@src")

print(url)

"""

结果

['哈哈', '你真', '搞笑']

['www.baidu.com']

"""

还可以解析的模块有

pyquery re 等等

Sunflower向阳而生

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
python爬虫html解析器_Python爬虫如何解析此类html?

1.re模块本质属于对字符串的强制提取解析方法有html = """哈哈你真搞笑"""# 使用 scrapy 中的 Selector 进行解析from scrapy import Selectorresponse = Selector(text=html)text_list = response.css("div *::text").getall()print([i for i in text_l...
复制链接

扫一扫

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。