利用selenium爬取携程旅游网的景区评论

最新推荐文章于 2024-04-19 15:01:16 发布

迟到不早退的边牧

最新推荐文章于 2024-04-19 15:01:16 发布

阅读量1.9k

点赞数 2

文章标签： python 爬虫 selenium chrome

本文链接：https://blog.csdn.net/liu_xuemin/article/details/118967823

版权

第一步：打开携程网，获取该景点的网址。以我的为例，我爬取的是湖北省恩施州的恩施大峡谷景区的评论。网址为：https://you.ctrip.com/sight/enshi487/51386.html#ctm_ref=www_hp_his_lst
大家可以根据自己的需求更改自己搜索的携程网址。

第二步：编写代码。

import requests
from selenium import webdriver
import time
import pandas as pd

#利用谷歌内核chromedriver爬取数据，如果没有设置环境变量，需要手动添加chromedriver的路径。
driver = webdriver.Chrome(executable_path='C:/Users/***/Desktop/chromedriver.exe')
#利用get方法请求访问网址。
driver.get('https://you.ctrip.com/sight/enshi487/51386.html#ctm_ref=www_hp_his_lst')
comment_list = [] #定义一个空列表存放爬取的评论
for i in range(0,300):#爬取300页的评论。
    driver.execute_script("window.scrollTo(0,document.body.scrollHeight);")#下滑到页面底端
    comments = driver.find_elements_by_css_selector('div.commentDetail')#定位到commentDetail节点，爬取当前页的全部评论。
    for comment in comments:
        comment_list.append(comment.text)
    driver.execute_script("arguments[0].click();", driver.find_element_by_class_name('ant-pagination-next'))#实现翻页功能，定位到ant-pagination-next节点，单击实现翻页。
    print('正在爬取',i,'页')
    time.sleep(2)  #休眠2秒.
    
    
comment_dataframe = pd.DataFrame(comment_list) #利用pandas将列表转换成dataframe类型

#保存爬取的评论为csv格式。路径根据自己的情况定。解码格式为：utf_8_sig，否则打开的csv是乱码。
comment_dataframe.to_csv('C:/Users/***/Desktop/恩施大峡谷.csv',encoding = 'utf_8_sig')

注意：携程网是动态网页，单击下一页评论的时候你会发现网址并没有变化，因此使用selenium自动操作谷歌chromedriver内核进行翻页爬取。chromedriver需要自行下载。

迟到不早退的边牧

关注

2
点赞
踩
30

收藏

觉得还不错? 一键收藏
6
评论
利用selenium爬取携程旅游网的景区评论

第一步：打开携程网，获取该景点的网址。以我的为例，我爬取的是湖北省恩施州的恩施大峡谷景区的评论。网址为：https://you.ctrip.com/sight/enshi487/51386.html#ctm_ref=www_hp_his_lst大家可以根据自己的需求更改自己搜索的携程网址。第二步：编写代码。import requestsfrom selenium import webdriverimport timeimport pandas as pd#利用谷歌内核chromedri
复制链接

扫一扫