lxml与xpath_02

操作html与xml类似

不同点:xml只有节点,html有节点并且有任何文本

准备一个html文件

<!DOCTYPE html>
<html lang="en">
<head><meta charset="UTF-8">
    <title>这是一个网页</title>
</head>
<body>
</body>
</html>

读取并且分析html文件代码

from lxml import etree
# 创建lxml.etree.HTMLParser对象
parser=etree.HTMLParser()
print(type(parser))
# 读取并解析test.html文件
tree=etree.parse('test.html',parser)
# 获得根节点
root=tree.getroot()
# 将html文档转换为可读格式
result=etree.tostring(root,encoding='utf-8',pretty_print=True,method='html')
print(str(result,'utf-8'))
# 输出根节点的名称
print(root.tag)
# 输出根节点的lang属性的值
print('lang=',root.get('lang'))
# 输出meta节点的charset属性的值
print('charset=',root[0][0].get('charset'))
# 输出title节点的文本
print('title=',root[0][1].text)

运行结果 :

 

---------------------------------------------------------------------------------------------------------------------------------

使用xpath提取test.html文件中的标题

from lxml import etree
parser=etree.HTMLParser()
tree=etree.parse('test.html',parser)
# 使用xpath定位title节点,返回节点集合
titles=tree.xpath('/html/head/title')
if len(titles)>0:
    print(titles[0].text)

 

定义一段html,提取特定<a>节点中得href属性值和节点文本

html='''
<div>
    <ul>
        <li class="item1"<a href="https://geekori.com">geekori.com</a></li>
        <li class="item2"<a href="https://jd.com">京东商城</a></li>
        <li class="item3"<a href="https://taobao.com">淘宝</a></li>
    </ul>
</div>
'''
# 分析HTML代码
tree=etree.HTML(html)
# 使用xpath定位class属性值为item2的<li>节点
aTags=tree.xpath("//li[@class='item2']")
if len(aTags)>0:
    # 得到该<li>节点中<a>节点的href属性值和文本
    print(aTags[0].get('href'),aTags[0].text)

 

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
import requests from lxml import etree import csv headers={ "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36 Edg/112.0.1722.64" } url = 'https://www.bilibili.com/v/channel/17532487/?tab=featured' # headers如前 xpath_videoplay='//ul[@class="card-list"]/div/div[@class="video-card"]/div[@class="video-card__content"]/a/div[@class="video-card__info"]/span[@class="play-text"]/span[@class="count"]/text()' xpath_videolike='//ul[@class="card-list"]/div/div[@class="video-card"]/div[@class="video-card__content"]/a/div[@class="video-card__info"]/span[@class="like-text"]/span[@class="count"]/text()' xpath_videotime='//ul[@class="card-list"]/div/div[@class="video-card"]/div[@class="video-card__content"]/a/div[@class="video-card__info"]/span[@class="play-duraiton"]/text()' xpath_videoername='//ul[@class="card-list"]/div/div[@class="video-card"]/a/span[@class="up-name__text"]/text()' xpath_videoname='//ul[@class="card-list"]/div/div[@class="video-card"]/a[@class="video-name"]/text()' response = requests.get(url, headers=headers) response.encoding = 'utf8' dom = etree.HTML(response.text) videoplays=dom.xpath(xpath_videoplay) videolikes=dom.xpath(xpath_videolike) videotimes=dom.xpath(xpath_videotime) videoernames=dom.xpath(xpath_videoername) videonames=dom.xpath(xpath_videoname) data = [] for i in range(len(videoplays)): t = {} t['视频制作者']=videoernames[i] t['视频标题']=videonames[i] t['视频时长']=videotimes[i] t['视频播放量'] = videoplays[i] t['视频点赞数'] = videolikes[i] data.append(t) # for t in data: # print(t) # print(t.items()) # save_data(xpath_videoername, xpath_videoname,xpath_videotime, xpath_videoplay, xpath_videolike) # def save_data(xpath_videoername, xpath_videoname,xpath_videotime, xpath_videoplay, xpath_videolike)';' # with open('./video.csv', 'a+', encoding='utf-8-sig') as f; # video_info=f'{xpath_videoername},{xpath_videoname},{xpath_videotime},{xpath_videoplay},{xpath_videolike}\n' # f.write(video_info) file_path="D:/python/up主数据.csv" with open(file_path,"w",encoding="utf-8-sig",newline='') as f: fieldnames = list(t[0].keys()) f_csv=csv.DictWriter(f,fieldnames=fieldnames) f_csv.writeheader() for row in t: writer.writerow(row)
05-05

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值