1.xpath
XPath 是一门在 XML 文档中查找信息的语言。
教程如下:http://www.runoob.com/xpath/xpath-intro.html
2.xpath爬取丁香园
xpath有关套路我在我的爬虫第2篇博客讲得比较清楚。
网站如下:http://www.dxy.cn/bbs/thread/626626#626626
(1)查看评论的xpath:
//*[@id="post_1"]/table/tbody/tr/td[2]/div[2]/div[2]/table/tbody/tr/td
套路代码:
from lxml import etree
import requests
url='http://www.dxy.cn/bbs/thread/626626#626626'
data=requests.get(url).text
s=etree.HTML(data)
info=s.xpath('//*[@id="post_1"]/table/tbody/tr/td[2]/div[2]/div[2]/table/tbody/tr/td/text()')
print(info)
结果如下:
我们再看每一个评论的xpath:
//*[@id="post_1"]/table/tbody/tr/td[2]/div[2]/div[2]/table/tbody/tr/td
//*[@id="post_2"]/table/tbody/tr/td[2]/div[2]/div[1]/table/tbody/tr/td
//*[@id="post_3"]/table/tbody/tr/td[2]/div[2]/div[1]/table/tbody/tr/td
//*[@id="post_4"]/table/tbody/tr/td[2]/div[2]/div[1]/table/tbody/tr/td
然后看看每一个名字的xpath:
//*[@id="post_1"]/table/tbody/tr/td[1]/div[2]/a
//*[@id="post_2"]/table/tbody/tr/td[1]/div[2]/a
//*[@id="post_3"]/table/tbody/tr/td[1]/div[2]/a
//*[@id="post_4"]/table/tbody/tr/td[1]/div[2]/a
积分:
//*[@id="post_1"]/table/tbody/tr/td[1]/div[4]/ul/li[1]/div/a
//*[@id="post_2"]/table/tbody/tr/td[1]/div[4]/ul/li[1]/div/a
//*[@id="post_3"]/table/tbody/tr/td[1]/div[4]/ul/li[1]/div/a
//*[@id="post_4"]/table/tbody/tr/td[1]/div[4]/ul/li[1]/div/a
因为评论的最后一个div有点不一样,所以我用了分支,代码如下:
from lxml import etree
import requests
url='http://www.dxy.cn/bbs/thread/626626#626626'
data=requests.get(url).text
s=etree.HTML(data)
for i in range(1,5):
try:
name=s.xpath('//*[@id="post_{}"]/table/tbody/tr/td[1]/div[2]/a/text()'.format(i))
info=s.xpath('//*[@id="post_{}"]/table/tbody/tr/td[2]/div[2]/div[1]/table/tbody/tr/td/text()'.format(i))
info1= info[0].replace(" ", "").replace("\n", "")
scores=s.xpath('//*[@id="post_{}"]/table/tbody/tr/td[1]/div[4]/ul/li[1]/div/a/text()'.format(i))
print(name[0])
print(info1)
print(scores[0])
except:
name=s.xpath('//*[@id="post_{}"]/table/tbody/tr/td[1]/div[2]/a/text()'.format(i))
info=s.xpath('//*[@id="post_{}"]/table/tbody/tr/td[2]/div[2]/div[2]/table/tbody/tr/td/text()'.format(i))
scores=s.xpath('//*[@id="post_{}"]/table/tbody/tr/td[1]/div[4]/ul/li[1]/div/a/text()'.format(i))
info1= info[0].replace(" ", "").replace("\n", "")
print(name[0])
print(info1)
print(scores[0])
结果: