其功能和正则表达式和BeautifulSoup里的soup里的功能有点相似
#coding=utf-8
#xpath
''''
1、要保证传给lxml的参数都是unicode
2、用 urlopen() 抓到的 file-like object ,或者用open()打开的硬盘上的 file object 不一定是unicode
3、用 unicode(file-like-object.read(),"utf-8") 能得到肯定是unicode的东西
4、这样处理之后再传给lxml的fromstring
5、xml.etree.ElementTree 也是一样
6、虽然lxml.html.parse()可以接受file-like object 作为参数,但是不要用,因为你传进去一个file-like object 你也不知道是不是unicode,万一有中文就会有乱码。
7、总是用unicode(file-like-object.read(),"utf-8") 这么转换对性能肯定是不好,但目前我也只会这种笨方法
'''
# #XPath与html结构
#start(@xx,xxx) string(.)
import urllib2
from lxml import etree
#1
html=urllib2.urlopen('http://yun.itheima.com/').read()
html=unicode(html,'utf-8') #------------unicode或者res.add_header('User-Agent','Mozilla 5.10')
selector=etree.HTML(html)
name=selector.xpath('/html/body/div[2]/div/div[1]/ul/li/a/@href')
c1=selector.xpath('/html/body/div[4]/div/div[1]/div[1]/ul/li/a/text()')
for i in name:
print i
print
for i in c1:
print i
print "******************************"
s1=selector.xpath('/html/body/div[4]/div/div[2]/div[1]/ul/li/a[starts-with(@href,"javascript:void(0);")]/text()')
for i in s1:
print i
data=selector.xpath('/html/body/div[4]/div/div[@class="box box1"]')[0]
info=data.xpath('string(.)').split()
for i in range(len(info)):
print info[i].encode('utf-8') #print i
#2
res=urllib2.Request('http://www.qiushibaike.com/hot/')
res.add_header('User-Agent','Mozilla 5.10')
html=urllib2.urlopen(res).read()
sel=etree.HTML(html)
things1=sel.xpath('//*[@class="article block untagged mb15"]/a[1]/div/span/text()')
things2=sel.xpath('//*[starts-with(@id,"qiushi_tag_")]/a[1]/div/span/text()')
name1=sel.xpath('//*[starts-with(@id,"qiushi_tag_")]/@class')
print str(len(things1))+" "+str(len(things2))
for i in things2:
print i
print "***************************************"
data=sel.xpath('//*[@id="qiushi_tag_119111465"]')[0]
info=data.xpath('string(.)').split()
for i in info:
print i
print "****************************************"