python 爬取TripAdvisor评论(曾经可以爬下来,但是似乎网页结构改了,爬不下来了)

TripAdvisor 的爬虫python实现

(曾经可以爬下来,但是似乎网页结构改了,爬不下来了)
这次爬虫分两个大步骤,第一,以评论ID为索引的评论内容文本,所以第一个步骤是按照不同的筛选方法获取需要的ID列表。第二,根据获取的ID列表作为request post 请求的Data 输入,发送请求,解析数据,存储数据。

// An highlighted block
import requests
from lxml import etree
from openpyxl import Workbook

wb = Workbook()  #我把数据存储到了Excel表格中,实例化一个workbook工作表
ws = wb.active   #激活它
ws.cell(row=1, column=1).value = 'sequence number'  #写表头
ws.cell(row=1, column=2).value = 'comment content'
j = 2  #这是个重要的
headers1={
'Accept': 'text/html, */*',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Connection': 'keep-alive',
'Content-Length': '191',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie': 'TART=%1%enc%3AsOCgP97OMc94qTI7BbqgvJQJCDUDFss0n3x%2FZ%2Fd6XM8rnz7FqGQc%2FLugm%2F%2F0288qOIAVHiKul%2B0%3D; TAUnique=%1%enc%3AtldHBELviqGpBw07J2agR%2Fh5NzRGtcROrG%2FndUeNzaA3Ld2Nq8Qi%2Fg%3D%3D; TASSK=enc%3AAJpg4anlEb%2Fy78qh4oqLBdI6%2BaDGUCf54WvbP2G3NacydYkpvvBNbaP78PeAJZq8hpjaN8qEGtpiCDa%2B%2Ff8mH2G9lEOxYgYdEi96%2FMH4X35EhZV8K6f3jQSERO%2FKDFTQYg%3D%3D; _ga=GA1.2.904448747.1548682673; _gid=GA1.2.1300052258.1548682673; _smt_uid=5c4f05b1.155ab2bd; __gads=ID=e479a7f803325efe:T=1548682674:S=ALNI_MbCp05Ak_tfFsAWIrbTOWSJYQE_EA; TATravelInfo=V2*AY.2019*AM.2*AD.10*DY.2019*DM.2*DD.11*A.2*MG.-1*HP.2*FL.3*DSM.1548687914272*RS.1; CommercePopunder=SuppressAll*1548687992996; TALanguage=ALL; ServerPool=A; VRMCID=%1%V1*id.16631*llp.%2F-m16631-a_ttcampaign%5C.MTYpc-a_ttgroup%5C.title*e.1549330640723; CM=%1%PremiumMobSess%2C%2C-1%7Ct4b-pc%2C%2C-1%7CRestAds%2FRPers%2C%2C-1%7CRCPers%2C%2C-1%7CWShadeSeen%2C%2C-1%7CTheForkMCCPers%2C%2C-1%7CHomeASess%2C4%2C-1%7CPremiumSURPers%2C%2C-1%7CPremiumMCSess%2C%2C-1%7CRestPartSess%2C%2C-1%7CUVOwnersSess%2C%2C-1%7CCCUVOwnPers%2C%2C-1%7CRestPremRSess%2C%2C-1%7CCCSess%2C%2C-1%7CPremRetPers%2C%2C-1%7CViatorMCPers%2C%2C-1%7Csesssticker%2C%2C-1%7CPremiumORSess%2C%2C-1%7Ct4b-sc%2C%2C-1%7CRestAdsPers%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS2%2C%2C-1%7Cb2bmcpers%2C%2C-1%7CPremMCBtmSess%2C%2C-1%7CPremiumSURSess%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS%2C%2C-1%7CLaFourchette+Banners%2C%2C-1%7Csess_rev%2C%2C-1%7Csessamex%2C%2C-1%7CPremiumRRSess%2C%2C-1%7CTADORSess%2C%2C-1%7CAdsRetPers%2C%2C-1%7CTARSWBPers%2C%2C-1%7CSPMCSess%2C%2C-1%7CTheForkORSess%2C%2C-1%7CTheForkRRSess%2C%2C-1%7Cpers_rev%2C%2C-1%7CSPMCWBPers%2C%2C-1%7CRBAPers%2C%2C-1%7CRestAds%2FRSess%2C%2C-1%7CHomeAPers%2C%2C-1%7CPremiumMobPers%2C%2C-1%7CRCSess%2C%2C-1%7CLaFourchette+MC+Banners%2C%2C-1%7CRestAdsCCSess%2C%2C-1%7CRestPartPers%2C%2C-1%7CRestPremRPers%2C%2C-1%7CCCUVOwnSess%2C%2C-1%7CUVOwnersPers%2C%2C-1%7Csh%2C%2C-1%7Cpssamex%2C%2C-1%7CTheForkMCCSess%2C%2C-1%7CCCPers%2C%2C-1%7Cb2bmcsess%2C%2C-1%7CSPMCPers%2C%2C-1%7CPremRetSess%2C%2C-1%7CViatorMCSess%2C%2C-1%7CPremiumMCPers%2C%2C-1%7CAdsRetSess%2C%2C-1%7CPremiumRRPers%2C%2C-1%7CRestAdsCCPers%2C%2C-1%7CTADORPers%2C%2C-1%7CTheForkORPers%2C%2C-1%7CPremMCBtmPers%2C%2C-1%7CTheForkRRPers%2C%2C-1%7CTARSWBSess%2C%2C-1%7CPremiumORPers%2C%2C-1%7CRestAdsSess%2C%2C-1%7CRBASess%2C%2C-1%7CSPORPers%2C%2C-1%7Cperssticker%2C%2C-1%7CSPMCWBSess%2C%2C-1%7C; TAReturnTo=%1%%2FAttraction_Review-g294212-d325811-Reviews-Mutianyu_Great_Wall-Beijing.html; roybatty=TNI1625!AJhFLb2y1T2wjWIj0nZ%2Fn2y4GflEeZBMemyC6d%2F8wchv1Dczm9RbSQQeA97E7bMrwRblS2I0%2BtTLucrkB5pBCrTH561lfiNtsd7ZC0i2bvNZ5SzdEJ6L9beTk6vyZ5ZAjSY4LM9oEmvAuOpXvVB5maxfbgx0XPutfzEN5uTmrJCo%2C1; TASession=%1%V2ID.09B9CACFFB2F6B5B99B3E1309F5F2BE0*SQ.24*MC.16631*LR.https%3A%2F%2Fsp0%5C.baidu%5C.com%2F9q9JcDHa2gU2pMbgoY3K%2Fadrc%5C.php%3Fssl_s%3D1%26ssl_c%3Dssl1_1689740e32a%26h_search_ext%3D%257B%2522count%2522%253A4%252C%2522list%2522%253A%255B%257B%2522txt%2522%253A%2522%255Cu59dc%255Cu6210%255Cu52cb%255Cu88ab%255Cu7206%255Cu8d2a%255Cu6c61%2522%252C%2522cid%2522%253A%252246606222%2522%252C%2522sellv%2522%253A*LP.%2F-m16631-a_ttcampaign%5C.MTYpc-a_ttgroup%5C.title*LS.DemandLoadAjax*GR.41*TCPAR.97*TBR.65*EXEX.10*ABTR.89*PHTB.11*FS.48*CPU.4*HS.recommended*ES.popularity*DS.5*SAS.popularity*FPS.oldFirst*LF.zhCN*FA.1*DF.0*TRA.false*LD.325811; TAUD=LA-1548682670471-1*RDD-1-2019_01_28*HC-71023*HDD-5243681-2019_02_10.2019_02_11*LD-43273088-2019.2.10.2019.2.11*LG-43273090-2.1.F.',
'Host': 'www.tripadvisor.cn',
'Origin': 'https://www.tripadvisor.cn',
'Referer': 'https://www.tripadvisor.cn/Attraction_Review-g294212-d325811-Reviews-Mutianyu_Great_Wall-Beijing.html',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36',
'X-Puid': 'XE@utMCoCwwAAEmaUNMAAAAP',
'X-Requested-With': 'XMLHttpRequest',
}
#采用的post方法发送请求,其中filterLang代表评论使用的语言,这里还可以使用其他的方法进行过滤,这是第一个步骤中使用的data,和headers,其中筛选在headers1实现
data1={
    'preferFriendReviews': 'FALSE',
't': '',
'q': '',
'filterSeasons':'',
'filterLang': 'en',
'filterSegment': '',
'trating': '',
'reqNum': '1',
'isLastPoll': 'false',
'paramSeqId': '1',
'waitTime': '107',
'changeSet': 'REVIEW_LIST',
'puid': 'XE%40utMCoCwwAAEmaUNMAAAAP'
}
headers2 = {
    'Accept': 'text/html, */*; q=0.01',
    'Accept-Encoding': 'gzip, deflate, br',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Connection': 'keep-alive',
    'Content-Length': '1069',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    'Cookie': 'TART=%1%enc%3AsOCgP97OMc94qTI7BbqgvJQJCDUDFss0n3x%2FZ%2Fd6XM8rnz7FqGQc%2FLugm%2F%2F0288qOIAVHiKul%2B0%3D; TAUnique=%1%enc%3AtldHBELviqGpBw07J2agR%2Fh5NzRGtcROrG%2FndUeNzaA3Ld2Nq8Qi%2Fg%3D%3D; TASSK=enc%3AAJpg4anlEb%2Fy78qh4oqLBdI6%2BaDGUCf54WvbP2G3NacydYkpvvBNbaP78PeAJZq8hpjaN8qEGtpiCDa%2B%2Ff8mH2G9lEOxYgYdEi96%2FMH4X35EhZV8K6f3jQSERO%2FKDFTQYg%3D%3D; _ga=GA1.2.904448747.1548682673; _gid=GA1.2.1300052258.1548682673; _smt_uid=5c4f05b1.155ab2bd; __gads=ID=e479a7f803325efe:T=1548682674:S=ALNI_MbCp05Ak_tfFsAWIrbTOWSJYQE_EA; CommercePopunder=SuppressAll*1548687992996; TALanguage=ALL; ServerPool=A; TATravelInfo=V2*AY.2019*AM.2*AD.10*DY.2019*DM.2*DD.11*A.2*MG.-1*HP.2*FL.3*DSM.1548731398056*RS.1; _gat_UA-79743238-4=1; VRMCID=%1%V1*id.16631*llp.%2F-m16631-a_ttcampaign%5C.MTYpc-a_ttgroup%5C.title*e.1549336282886; CM=%1%PremiumMobSess%2C%2C-1%7Ct4b-pc%2C%2C-1%7CRestAds%2FRPers%2C%2C-1%7CRCPers%2C%2C-1%7CWShadeSeen%2C%2C-1%7CTheForkMCCPers%2C%2C-1%7CHomeASess%2C5%2C-1%7CPremiumSURPers%2C%2C-1%7CPremiumMCSess%2C%2C-1%7CRestPartSess%2C%2C-1%7CUVOwnersSess%2C%2C-1%7CCCUVOwnPers%2C%2C-1%7CRestPremRSess%2C%2C-1%7CCCSess%2C%2C-1%7CPremRetPers%2C%2C-1%7CViatorMCPers%2C%2C-1%7Csesssticker%2C%2C-1%7CPremiumORSess%2C%2C-1%7Ct4b-sc%2C%2C-1%7CRestAdsPers%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS2%2C%2C-1%7Cb2bmcpers%2C%2C-1%7CPremMCBtmSess%2C%2C-1%7CPremiumSURSess%2C%2C-1%7CMC_IB_UPSELL_IB_LOGOS%2C%2C-1%7CLaFourchette+Banners%2C%2C-1%7Csess_rev%2C%2C-1%7Csessamex%2C%2C-1%7CPremiumRRSess%2C%2C-1%7CTADORSess%2C%2C-1%7CAdsRetPers%2C%2C-1%7CTARSWBPers%2C%2C-1%7CSPMCSess%2C%2C-1%7CTheForkORSess%2C%2C-1%7CTheForkRRSess%2C%2C-1%7Cpers_rev%2C%2C-1%7CSPMCWBPers%2C%2C-1%7CRBAPers%2C%2C-1%7CRestAds%2FRSess%2C%2C-1%7CHomeAPers%2C%2C-1%7CPremiumMobPers%2C%2C-1%7CRCSess%2C%2C-1%7CLaFourchette+MC+Banners%2C%2C-1%7CRestAdsCCSess%2C%2C-1%7CRestPartPers%2C%2C-1%7CRestPremRPers%2C%2C-1%7CCCUVOwnSess%2C%2C-1%7CUVOwnersPers%2C%2C-1%7Csh%2C%2C-1%7Cpssamex%2C%2C-1%7CTheForkMCCSess%2C%2C-1%7CCCPers%2C%2C-1%7Cb2bmcsess%2C%2C-1%7CSPMCPers%2C%2C-1%7CPremRetSess%2C%2C-1%7CViatorMCSess%2C%2C-1%7CPremiumMCPers%2C%2C-1%7CAdsRetSess%2C%2C-1%7CPremiumRRPers%2C%2C-1%7CRestAdsCCPers%2C%2C-1%7CTADORPers%2C%2C-1%7CTheForkORPers%2C%2C-1%7CPremMCBtmPers%2C%2C-1%7CTheForkRRPers%2C%2C-1%7CTARSWBSess%2C%2C-1%7CPremiumORPers%2C%2C-1%7CRestAdsSess%2C%2C-1%7CRBASess%2C%2C-1%7CSPORPers%2C%2C-1%7Cperssticker%2C%2C-1%7CSPMCWBSess%2C%2C-1%7C; TAReturnTo=%1%%2FAttraction_Review-g294212-d325811-Reviews-Mutianyu_Great_Wall-Beijing.html; roybatty=TNI1625!ANG6fwAEw4MiYShLuTZ9N9WPeY6fUh4dRd78w9OaSBkvQxj%2F60hlYf6y0oPtxKiK3BS1eW2%2FOjsVDQO0MRIVGMNgdm214FfrygtGMgt1eh6uLbPio2a1wOeAgDDbGaFbpZWzO1gHlhqRrmTZtTIbmGKmi81WnuJvqgNNp%2Fu3wlRa%2C1; TASession=%1%V2ID.09B9CACFFB2F6B5B99B3E1309F5F2BE0*SQ.76*MC.16631*LR.https%3A%2F%2Fsp0%5C.baidu%5C.com%2F9q9JcDHa2gU2pMbgoY3K%2Fadrc%5C.php%3Fssl_s%3D1%26ssl_c%3Dssl1_1689740e32a%26h_search_ext%3D%257B%2522count%2522%253A4%252C%2522list%2522%253A%255B%257B%2522txt%2522%253A%2522%255Cu59dc%255Cu6210%255Cu52cb%255Cu88ab%255Cu7206%255Cu8d2a%255Cu6c61%2522%252C%2522cid%2522%253A%252246606222%2522%252C%2522sellv%2522%253A*LP.%2F-m16631-a_ttcampaign%5C.MTYpc-a_ttgroup%5C.title*LS.DemandLoadAjax*GR.41*TCPAR.97*TBR.65*EXEX.10*ABTR.89*PHTB.11*FS.48*CPU.4*HS.recommended*ES.popularity*DS.5*SAS.popularity*FPS.oldFirst*LF.ALL*FA.1*DF.0*MS.-1*RMS.-1*FLO.293920*TRA.false*LD.325811; TAUD=LA-1548682670471-1*RDD-1-2019_01_28*HC-48657123*HDD-48727532-2019_02_10.2019_02_11*LD-48857965-2019.2.10.2019.2.11*LG-48857967-2.1.F.',
    'Host': 'www.tripadvisor.cn',
    'Origin': 'https://www.tripadvisor.cn',

    'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36',
    'X-Puid': 'XE@utMCoCwwAAEmaUNMAAAAP',
    'X-Requested-With': 'XMLHttpRequest',
}

def get_tripadvisor_comment(reviewid):
    data2 = {
        'reviews': reviewid,
        'contextChoice': 'DETAIL',
        'loadMtHeader': 'true',
        'haveJses': 'earlyRequireDefine,amdearly,promise-polyfill-standalone,global_error,long_lived_global,apg-Attraction_Review,apg-Attraction_Review-in,bootstrap,responsive-calendar-templates-dust-zh_CN,@ta/common.global,@ta/tracking.interactions,@ta/public.maps,@ta/overlays.pieces,@ta/overlays.shift,@ta/overlays.internal,@ta/overlays.attached-overlay,@ta/overlays.managers,@ta/overlays.attached-arrow-overlay,@ta/overlays.popover,social.share-cta,attractions.tab-bar-commerce,@ta/overlays.fullscreen-overlay,@ta/overlays.modal,attractions.attraction-detail-about-card,@ta/daodao.mobile-app-smartbutton,@ta/platform.import,@ta/platform.runtime,masthead_search_late_load,p13n_masthead_search__deferred__lateHandlers',
        'haveCsses': 'apg-Attraction_Review-in,responsive_calendars_control',
        'Action': 'install'
    }
    url = 'https://www.tripadvisor.cn/OverlayWidgetAjax?Mode=EXPANDED_HOTEL_REVIEWS_RESP&metaReferer='
    response=requests.post(url,data=data2,headers=headers2) #发送请求
    html=etree.HTML(response.text)  #这里是声明获得的reponse对象的文本为HTML对象,证明之后才可以对他使用Xpath
    comment=[]  #定义一个列表用于接收评论
    comment=html.xpath("//p[@class='partial_entry']")#这里采用xpath获取游客评论,可以使用谷歌浏览器插件非常方便。
    print(comment[9].xpath('string(.)'))  
    #还需要循环2获取全部页面的评论哈哈哈哈哈哈哈哈哈哈哈哈哈哈
    global j
    for i in range(10):#这是遍历每一个评论页面的10条评论的循环1
        ws.cell(row=j,column=1).value=j-1    #现在你知道了为啥j=2了吧,因为内容从第二行开始
        ws.cell(row=j,column=2).value=comment[i].xpath('string(.)')
        #这里使用string(.)这非常重要,可以获得里面的所有文本。
        j=j+1  #写下一行的数据
        print(j)  #这里打印一下 可以是否出错

def get_reviewid():   #这里是因为网站的结构,按照评论的id作为索引,来定位到具体的评论,所以要先获得评论的id

    for k in range(0,3160,10):#获得评论ID列表 然后传入第二个函数 3160代表要获取的评论数量
        url='https://www.tripadvisor.cn/Attraction_Review-g14133707-d480640-Reviews-or%d-Isetan_Shinjuku_Store-Shinjuku_3_Chome_Shinjuku_Tokyo_Tokyo_Prefecture_Kanto.html' %k
        response=requests.post(url,data=data1,headers=headers1)
        html=etree.HTML(response.text)
        reviewid=[]
        reviewid=html.xpath("//div[@class='review-container']/@data-reviewid") #获得id号码 进一步用来作为获取评论的索引
        print(reviewid)
        reviewid_final=",".join(i for i in reviewid) #使用,将ID连接起来
        get_tripadvisor_comment(reviewid=reviewid_final)  #这里调用获得评论的函数,通过这个函数的嵌套来获得最终所需要的评论


get_reviewid()
wb.save('伊勢丹.xlsx')













  • 2
    点赞
  • 10
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
Python3爬虫课程资料代码 │ ├─章节1: 环境配置 │ 课时01:Python3+Pip环境配置.mp4 │ 课时02:MongoDB环境配置.mp4 │ 课时03:Redis环境配置.mp4 │ 课时04:MySQL的安装.mp4 │ 课时05:Python多版本共存配置.mp4 │ 课时06:Python爬虫常用库的安装.mp4 │ ├─章节2: 基础篇 │ 课时07:爬虫基本原理讲解.mp4 │ 课时08:Urllib库基本使用.mp4 │ 课时09:Requests库基本使用.mp4 │ 课时10:正则表达式基础.mp4 │ 课时11:BeautifulSoup库详解.mp4 │ 课时12:PyQuery详解.mp4 │ 课时13:Selenium详解.mp4 │ ├─章节3: 实战篇 │ 课时14:Requests+正则表达式爬取猫眼电影.mp4 │ 课时15:分析Ajax请求并抓取今日头条街拍美图 .mp4 │ 课时16:使用Selenium模拟浏览器抓取淘宝商品美食信息.mp4 │ 课时17:使用Redis+Flask维护动态代理池.mp4 │ 课时18:使用代理处理反抓取微信文章.mp4 │ 课时19:使用Redis+Flask维护动态Cookies池.mp4 │ ├─章节4: 框架篇 │ 课时20:PySpider框架基本使用及抓取TripAdvisor实战.mp4 │ 课时21:PySpider架构概述及用法详解.mp4 │ 课时22:Scrapy框架安装.mp4 │ 课时23:Scrapy框架基本使用.mp4 │ 课时24:Scrapy命令行详解.mp4 │ 课时25:Scrapy中选择器用法.mp4 │ 课时26:Scrapy中Spiders用法.mp4 │ 课时27:Scrapy中Item Pipeline的用法.mp4 │ 课时28:Scrapy中Download Middleware的用法.mp4 │ 课时29:Scrapy爬取知乎用户信息实战.mp4 │ 课时30:Scrapy+Cookies池抓取新浪微博.mp4 │ 课时31:Scrapy+Tushare爬取微博股票数据.mp4 │ └─章节5: 分布式篇 课时32:Scrapy分布式原理及Scrapy-Redis源码解析.mp4 课时33:Scrapy分布式架构搭建抓取知乎.mp4 课时34:Scrapy分布式的部署详解.mp4
爬取旅游网站数据是指使用python编程语言来获取旅游网站上的相关数据。使用python进行数据爬取的过程可以分为以下几个步骤: 1. 确定目标:首先,您需要确定您要从哪个旅游网站上获取数据。可以选择一些常见的旅游网站,如TripAdvisor、Booking.com等。 2. 安装所需库:在python中,您可以使用一些库来进行网页爬取,比如BeautifulSoup、Requests等。在开始之前,您需要确保这些库已经安装在您的Python环境中。 3. 发送请求:使用Requests库发送HTTP请求,获取旅游网站的页面内容。您可以使用GET请求获取页面的HTML代码。 4. 解析HTML:使用BeautifulSoup库解析HTML代码,提取您需要的数据。您可以使用它来查找特定的HTML元素,如标题、价格、评分等。 5. 数据处理:在获取到数据后,您可以对其进行处理和清洗,以适应您的需求。例如,您可以将数据存储到数据库中,或者将其导出为CSV或Excel文件。 总结起来,使用python爬取旅游网站数据的步骤包括确定目标、安装所需库、发送请求、解析HTML和数据处理。通过这些步骤,您可以获取旅游网站上的数据,并进行进一步的分析和应用。<span class="em">1</span> #### 引用[.reference_title] - *1* [python进行数据增强](https://download.csdn.net/download/doyoboy/88278532)[target="_blank" data-report-click={"spm":"1018.2226.3001.9630","extra":{"utm_source":"vip_chatgpt_common_search_pc_result","utm_medium":"distribute.pc_search_result.none-task-cask-2~all~insert_cask~default-1-null.142^v93^chatsearchT3_2"}}] [.reference_item style="max-width: 100%"] [ .reference_list ]

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值