PyQuery是强大而又灵活的网页解析库,正则写起来太麻烦,BeautifulSoup语法太难记,如果你熟悉jQuery的语法那么,PyQuery就是你绝佳的选择。
代码实现
'''
部分源码
<tr class="mBg2"onmouseover="jQuery(this).addClass('bgon');
"οnmοuseοut="jQuery(this).removeClass('bgon');">
<td class="wd1">1</td>
<td class="wd2 bOS">
<a target="_blank" href="http://xf.house.163.com/gz/0SJN.html">南沙心意华庭</a></td>
<td class="wd3"><a href="#" onclick="gotrend('南沙心意华庭')"></a></td>
<td class="wd4">南沙</td>
<td class="wc5 bgOnS">4</td>
<td class="wc6">425</td>
<td class="wc7">--</td>
<td class="wc8">--</td>
<td class="wc9">658</td>
<td class="wc10">54202</td>
<td class="wc11">18</td>
<td class="wc12">1944</td>
'''
from pyquery import PyQuery as pq
while True:
url ='http://data.house.163.com/'
for j in range(3): # 自定义重新运行次数,最多重复运行3次
try:
doc = pq(url=url, encoding='utf-8') # 获取html,若运行时长超出,则重新运行
break
except:
continue
if doc('.pager_a.next-page') == []: # 判断下一页是否存在
break
else:
name = [i.text() for i in doc('.wd2.bOS a').items()] # 寻找class节点
place = [i.text() for i in doc('.wd4').items()]
number1 = [i.text() for i in doc('.wc5.bgOnS').items()][1:]
area1 = [i.text() for i in doc('.wc6').items()][1:]
number2 = [i.text() for i in doc('.wc9').items()][1:]
area2 = [i.text() for i in doc('.wc10').items()][1:]
number3 = [i.text() for i in doc('.wc11').items()][1:]
area3 = [i.text() for i in doc('.wc12').items()][1:]
quhua = [i.text() for i in doc('.wd14').items()]
pyquery和BeautifulSoup的使用方法差不多,pyquery也可以结合requests使用