这里主要是做一个关于数据爬取以后的数据解析功能的整合,方便查阅,以防混淆
主要讲到的技术有Xpath,BeautifulSoup,PyQuery,re(正则)
首先举出两个作示例的代码,方便后面举例
解析之前需要先将html代码转换成相应的对象,各自的方法如下:
Xpath:
In [7]: from lxml importetree
In [8]: text = etree.HTML(html)
BeautifulSoup:
In [2]: from bs4 importBeautifulSoup
In [3]: soup = BeautifulSoup(html, 'lxml')
PyQuery:
In [10]: from pyquery importPyQuery as pq
In [11]: doc = pq(html)
re:没有需要的对象,他是直接对字符串进行匹配的规则
示例1
html = '''
The Dormouse's storyThe Dormouse's story
Once upon a time there were three little sisters; and their names wereElsie,LacieandTillie;
and they lived at the bottom of a well.
...
'''
接下来我们来用不同的解析方法分析示例的HTML代码
匹配标题内容:
Xpath:
In [16]: text.xpath('//title/text()')[0]
Out[16]: "The Dormouse's story"
BeautifulSoup:
In [18]: soup.title.string
Out[18]: "The Dormouse's story"
PyQuery:
In [20]: doc('title').text()
Out[20]: "The Dormouse's story"
re:
In [11]: re.findall(r'
(.*?)', html)[0]Out[11]: "The Dormouse's story"
匹配第三个a标签的href属性:
Xpath:#推荐
In [36]: text.xpath('//a[@id="link3"]/@href')[0]
Out[36]: 'http://example.com/tillie'
BeautifulSoup:
In [27]: soup.find_all(attrs={'id':'link3'})
Out[27]: [Tillie]
In [33]: soup.find_all(attrs={'id':'link3'})[0].attrs['href']
Out[33]: 'http://example.com/tillie'
PyQuery:#推荐
In [45]: doc("#link3").attr.hrefOut[45]: 'http://example.com/tillie'
re:
In [46]: re.findall(r'Tillie;', html)[0]
Out[46]: 'http://example.com/tillie'
匹配P标签便是内容的全部数据:
Xpath:
In [48]: text.xpath('string(//p[@class="story"])').strip()
Out[48]: 'Once upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.'In [51]: ' '.join(text.xpath('string(//p[@class="story"])').split('\n'))
Out[51]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well.'
BeautifulSoup:
In [89]: ' '.join(list(soup.body.stripped_strings)).replace('\n', '')
Out[89]: "The Dormouse's story Once upon a time there were three little sisters; and their names were Elsie,Lacie and Tillie; and they lived at the bottom of a well. ..."
PyQuery:
In [99]: doc('.story').text()
Out[99]: 'Once upon a time there were three little sisters; and their names were Elsie, Lacie and Tillie; and they lived at the bottom of a well. ...'
re:不推荐使用,过于麻烦
In [101]: re.findall(r'
(.*?)(.*?)(.*?)(.*?)(.*?)(.*?);(.*?)
', html, re.S)[0]Out[101]:
('Once upon a time there were three little sisters; and their names were\n','Elsie',',\n','Lacie','and\n','Tillie','\nand they lived at the bottom of a well.')
示例2
html = '''
first item
second item
third item
fourth item
fifth item
匹配second item
Xpath:
In [14]: text.xpath('//li[2]/a/text()')[0]
Out[14]: 'second item'
BeautifulSoup:
In [23]: soup.find_all(attrs={'class': 'item-1'})[0].string
Out[23]: 'second item'
PyQuery:
In [34]: doc('.item-1>a')[0].text
Out[34]: 'second item'
re:
In [35]: re.findall(r'
(.*?)', html)[0]Out[35]: 'second item'
匹配第五个li标签的href属性:
Xpath:
In [36]: text.xpath('//li[@class="item-0"]/a/@href')[0]
Out[36]: 'link5.html'
BeautifulSoup:
In [52]: soup.find_all(attrs={'class': 'item-0'})
Out[52]:
[
first item, third item, fifth item]In [53]: soup.find_all(attrs={'class': 'item-0'})[-1].a.attrs['href']
Out[53]: 'link5.html'
PyQuery:
In [75]: [i.attr.href for i in doc('.item-0 a').items()][1]
Out[75]: 'link5.html'
re:
In [95]: re.findall(r'
fifth item',html)[0]Out[95]: 'link5.html'
示例3
房屋用途普通住宅分别获取出房屋用途和普通住宅
Xpath:
In [47]: text.xpath('//li/span/text()')[0]
Out[47]: '房屋用途'In [49]: text.xpath('//li/text()')[0]
Out[49]: '普通住宅'
BeautifulSoup:
In [65]: soup.span.string
Out[65]: '房屋用途'In [69]: soup.li.contents[1] #contents 获取直接子节点
Out[69]: '普通住宅'
PyQuery:
In [70]: doc('li span').text()
Out[70]: '房屋用途'In [75]: doc('li .label')[0].tail
Out[75]: '普通住宅'
re: 略
示例4
26667元/平米