Python3--利用XPath解析豆瓣top250数据
开发工具:PyCharm
运行环境:Mac系统 Python3
爬虫地址:https://movie.douban.com/top250(仅供学习之用)
一.环境搭建,引入对应的包
直接在PyCharm工具终端输入
pip install lxml
二.代码引入包
import requests
from lxml import etree
三.正式进入代码编写,引入请求头:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
四.声明main方法
在学习过程中,发现xpath还是比较灵活的,写法没有唯一的标准,可以看出输出结果是一样的。
def main():
url = 'https://movie.douban.com/top250'
responseIndex = requests.get(url, headers=headers)
selector = etree.HTML(responseIndex.text)
# 从content开始解析
resultContentJson = selector.xpath('//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]/text()')[0]
# 从ol节点开始解析
resultOLJson = selector.xpath('//ol/li[1]/div/div[2]/div[1]/a/span[1]/text()')[0]
print('从content开始解析:' + resultContentJson)
#
print('从ol节点开始解析:' + resultOLJson)
main()
输出结果:
五.获取全部标题,那就来个循环咯
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
def main():
url = 'https://movie.douban.com/top250'
responseIndex = requests.get(url, headers=headers)
selector = etree.HTML(responseIndex.text)
# 从content开始解析
resultContentJson = selector.xpath('//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]/text()')[0]
# 从ol节点开始解析
resultOLJson = selector.xpath('//ol/li[1]/div/div[2]/div[1]/a/span[1]/text()')[0]
print('从content开始解析:' + resultContentJson)
#
print('从ol节点开始解析:' + resultOLJson)
#
# 标题循环
resultTitleJson = selector.xpath('//ol/li/div/div[2]/div[1]/a/span[1]/text()')
resultJson = selector.xpath('//div[@class="info"]/div/a/span[1]/text()')
for moviesTitle in resultTitleJson:
print(moviesTitle)
for moviesJson in resultJson:
print(moviesJson)
注意了,上面写的
selector.xpath('//*[@id="content"]/div/div[1]/ol/li[1]/div/div[2]/div[1]/a/span[1]/text()')[0]
和
selector.xpath('//ol/li[1]/div/div[2]/div[1]/a/span[1]/text()')[0]
效果是一样的!!做好笔记了,效果是一样的 写法不一样而已
输出结果:
六.举一反三,来抓取其他的字段,完整的代码
import requests
from lxml import etree
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
def main():
url = 'https://movie.douban.com/top250'
responseIndex = requests.get(url, headers=headers)
for itme in parse_reslut(responseIndex):
print(itme)
def parse_reslut(responseIndex):
selector = etree.HTML(responseIndex.text)
resultInfo = selector.xpath('//ol[@class="grid_view"]/li')
for getResult in resultInfo:
movies_index = getResult.xpath('./div/div/em/text()')[0]
movies_name = getResult.xpath('./div/div[2]/div[1]/a/span[1]/text()')[0]
movies_author = getResult.xpath('./div/div[2]/div[2]/p/text()')[0]
movies_star = getResult.xpath('./div/div[2]/div[2]/div/span[2]/text()')[0]
movies_num = getResult.xpath('./div/div[2]/div[2]/div/span[4]/text()')[0]
moview_pic = getResult.xpath('./div/div/a/img/@src')[0]
#
# print(movies_index, movies_name, movies_star, movies_num, moview_pic, movies_author)
#
yield {
'index': movies_index,
'name': movies_name,
'star': movies_star,
'contentNum': movies_num,
'pic': moview_pic,
'author': movies_author
}
main()
输出结果:
如果需要保存的本地文件的,网上应该有很多例子了,大家可以去参考相关的例子就可以了。今天先到这了。