关于xpath

最新推荐文章于 2022-11-08 22:20:10 发布

ljlhnick

最新推荐文章于 2022-11-08 22:20:10 发布

阅读量400

点赞数

分类专栏： python

本文链接：https://blog.csdn.net/ljlhnick/article/details/73437136

版权

python 专栏收录该内容

9 篇文章 0 订阅

订阅专栏

其功能和正则表达式和BeautifulSoup里的soup里的功能有点相似

#coding=utf-8
#xpath

''''
1、要保证传给lxml的参数都是unicode
2、用 urlopen() 抓到的 file-like object ，或者用open()打开的硬盘上的 file object 不一定是unicode
3、用 unicode(file-like-object.read(),"utf-8") 能得到肯定是unicode的东西
4、这样处理之后再传给lxml的fromstring
5、xml.etree.ElementTree 也是一样
6、虽然lxml.html.parse()可以接受file-like object 作为参数，但是不要用，因为你传进去一个file-like object 你也不知道是不是unicode，万一有中文就会有乱码。
7、总是用unicode(file-like-object.read(),"utf-8") 这么转换对性能肯定是不好，但目前我也只会这种笨方法
'''
# #XPath与html结构
#start(@xx,xxx)  string(.)

import urllib2
from lxml import etree
#1
html=urllib2.urlopen('http://yun.itheima.com/').read()
html=unicode(html,'utf-8') #------------unicode或者res.add_header('User-Agent','Mozilla 5.10')
selector=etree.HTML(html)
name=selector.xpath('/html/body/div[2]/div/div[1]/ul/li/a/@href')
c1=selector.xpath('/html/body/div[4]/div/div[1]/div[1]/ul/li/a/text()')
for i in name:
    print i
print
for i in c1:
    print i
print "******************************"

s1=selector.xpath('/html/body/div[4]/div/div[2]/div[1]/ul/li/a[starts-with(@href,"javascript:void(0);")]/text()')
for i in s1:
    print i

data=selector.xpath('/html/body/div[4]/div/div[@class="box box1"]')[0]
info=data.xpath('string(.)').split()
for i in range(len(info)):
    print info[i].encode('utf-8')       #print i

#2
res=urllib2.Request('http://www.qiushibaike.com/hot/')
res.add_header('User-Agent','Mozilla 5.10')
html=urllib2.urlopen(res).read()
sel=etree.HTML(html)

things1=sel.xpath('//*[@class="article block untagged mb15"]/a[1]/div/span/text()')
things2=sel.xpath('//*[starts-with(@id,"qiushi_tag_")]/a[1]/div/span/text()')

name1=sel.xpath('//*[starts-with(@id,"qiushi_tag_")]/@class')
print str(len(things1))+"   "+str(len(things2))
for i in things2:
    print i
print "***************************************"
data=sel.xpath('//*[@id="qiushi_tag_119111465"]')[0]
info=data.xpath('string(.)').split()
for i in info:
    print i
print "****************************************"

ljlhnick

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
关于xpath

其功能和正则表达式和BeautifulSoup里的soup里的功能有点相似#coding=utf-8#xpath''''1、要保证传给lxml的参数都是unicode2、用 urlopen() 抓到的 file-like object ，或者用open()打开的硬盘上的 file object 不一定是unicode3、用 unicode(file-like-object.read
复制链接

扫一扫

专栏目录