python 使用xpath解析html

最新推荐文章于 2023-12-27 16:39:46 发布

爱唱歌de小青蛙

最新推荐文章于 2023-12-27 16:39:46 发布

阅读量3.8k

点赞数 1

分类专栏： python

本文为博主原创文章，未经博主允许不得转载。

本文链接：https://blog.csdn.net/xm_csdn/article/details/64444355

版权

python 专栏收录该内容

66 篇文章 1 订阅

订阅专栏

在进行网页抓取的时候，分析定位html节点是获取抓取信息的关键，有时候用正则无法匹配或很难准确匹配，可以lxml模块(用来分析XML文档结构的，当然也能分析html结构)，利用其lxml.html的xpath对html进行匹配。

首先，我们需要安装一个支持xpath的python库。目前在libxml2的网站上被推荐的python binding是lxml，也有beautifulsoup，不嫌麻烦的话还可以自己用正则表达式去构建，本文以lxml为例讲解。

直接使用lxml处理：

------------------------------------------------------------------------------------------------------

# basics.py

# -*- coding:utf-8 -*-

import urllib2

import urllib

import sys

from lxml import etree

reload(sys)

sys.setdefaultencoding("utf-8")

def basics_xpath(url, str_xpath):

# 定义headers

head = {'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'}

try:

# 构建请求的request

request = urllib2.Request(url, headers=head)

# 利用urlopen获取页面代码

response = urllib2.urlopen(request)

# 将页面保存并转化为UTF-8编码

pageCode = response.read().decode('utf-8')

# print pageCode

except urllib2.URLError, e:

if hasattr(e, "reason"):

print u"连接失败,错误原因", e.reason

return None

# 使用etree对html解析

tree = etree.HTML(pageCode)

# 使用xpath对网页内容进行筛选

items = tree.xpath(str_xpath)

# for item in items:

# print item

return list(items)

----------------------------------------------------------------------------------

#perform.py

import basics

def page_data():

# 定义目标网址

url="www.baidu.com"

# 定位xpath规则

str_xpath = ""

pg_data=basics.basics_xpath(url, str_xpath )

for item in pg_data:
print item.text

if __name__ == '__main__':
page_data()

爱唱歌de小青蛙

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。