Python网络爬虫入门（三）

最新推荐文章于 2023-05-24 10:51:23 发布

wyyyyyyyy_

最新推荐文章于 2023-05-24 10:51:23 发布

阅读量204

点赞数 1

分类专栏： python爬虫文章标签： python python网络爬虫

本文链接：https://blog.csdn.net/wyyyyyyyy_/article/details/89054435

版权

python爬虫专栏收录该内容

4 篇文章 0 订阅

订阅专栏

XPath详解
选取节点
namenode 选取namenode底下所有子节点
/ 选取根元素下所有的bookstore节点
// 从全局节点中找到所有的bookstore节点
@ //book[@price]选择所有拥有price属性的book节点

谓词
/bookstore/book[1] 选取bookstore下的第一个子元素
/bookstore/book[last()] 选取bookstore下的倒数第二个book元素
bookstore/book[position() < 3] 选取bookstore下前面两个子元素
//book[@price=10] 选取所有属性price等于10的book元素

lxml解析html代码和文件

解析html字符串

from lxml import etree
text="""   """
html=etree.HTML(text)
print(etree.tostring(html,encoding='utf-8').decode("utf-8"))

解析html文件

from lxml import etree
html=etree.parse(“    .html")
print(etree.tostring(html,encoding='utf-8').decode("utf-8"))

碰到不规范的html

parser=etree.HTMLParser(encoding='utf-8')
html=etree.parser("    .html",parser=parser)
print(etree.tostring(html,encoding='utf-8').decode("utf-8"))

lxml和xpath结合使用

from lxml import etree
parser=etree.HTMLParser(encoding='utf-8')
html=etree.parser("    .html",parser=parser)

获取所有tr标签

trs=html.xpath("//tr")
for tr in trs:
	print(etree.tostring(tr,encoding='utf-8').decode("utf-8"))

获取第二个tr标签

trs=html.xpath("//tr[2]")[0]
for tr in trs:
	print(etree.tostring(tr,encoding='utf-8').decode("utf-8"))

获取所有a标签的href属性

as=html.xpath("//a/@href")

wyyyyyyyy_

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录