使用requests库及XPath解析网页数据

最新推荐文章于 2024-08-13 18:29:56 发布

小马-向前冲

最新推荐文章于 2024-08-13 18:29:56 发布

阅读量260

点赞数 1

分类专栏：数据采集文章标签： python 爬虫 Powered by 金山文档

本文链接：https://blog.csdn.net/weixin_62383575/article/details/128520628

版权

数据采集专栏收录该内容

2 篇文章 0 订阅

订阅专栏

任务描述：

使用requests以及XPath提取景点以及网址和相应景点的评论信息，并保存为txt以及csv文件。

任务实现：

import requests
from lxml import etree
import pandas as pd

url='https://travel.qunar.com/search/place/23-shandong-298984/4-----0/1'
headers={'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'}

r=requests.get(url,headers=headers)
r.encoding='utf-8'
print(r.status_code)
r=r.text
tree=etree.HTML(r)

x=tree.xpath('//div[@class="right_bar"]/ul//li')
#print(x)
y=tree.xpath('//div[@class="right_bar"]/ul/li/div[2]/h2/a/@href')
#print(y)
for a in y:
    rep=requests.get(a)
    rep.encoding='utf-8'
    rep=rep.text
    trees=etree.HTML(rep)
    p=trees.xpath('//*[@id="gs"]/div[1]//p/text()')

    for page in range(0,4):
        page=str(page)
        data={
            'poiList':'true',
            'sortField':'1',
            'rank':'0',
            'pageSize':'5',
            'page':page
             }
        response = requests.post(url, data=data, headers=headers)

    q=trees.xpath('//*[@id="comment_box"]//li//div[1]//div//div[3]//p[1]/text()')

    file = "网址："+a+"，景点信息："+str(p)+"，评论信息："+str(q)
    with open('./评论.txt', 'a', encoding="utf-8") as fp:
        fp.write(file + '\n')
    with open('./评论.csv', 'a', encoding="utf-8") as f:
        f.write(file + '\n')

存储结果：

csv文件：