山东大学创新项目实训个人工作日志（八）

最新推荐文章于 2021-07-28 20:12:37 发布

afyzju

最新推荐文章于 2021-07-28 20:12:37 发布

阅读量192

点赞数

本文链接：https://blog.csdn.net/afyzju/article/details/116545096

版权

完成了携程门票信息的爬取。
在爬取携程的过程中，我发现携程的请求参数中有一个动态的数据，也就是说我们是没办法动态获取这个数据并进行请求的，所以这次爬取采用的是selenium。
selenium 是一个用于Web应用程序测试的工具。它直接运行在浏览器中，就像真正的用户在操作一样。
所以我们只需要模拟用户操作访问页面，就可以获得我们想要的信息。

options = Options()
options.add_argument('--headless')
self.chrome = Chrome(executable_path='D:\\py\\aaaaaaaaa\\selenuim\\chromedriver.exe', options=options)
self.chrome.get(url)

			content = self.chrome.find_element_by_class_name('right-content-list').get_attribute('innerHTML')
			cons = re.findall(r'href="(.*?)" title="(.*?)"', content)
			# print(content)
			for con in cons:
                self.detail_url = 'https:' + con[0]
                self.title = con[1]
                result = fuzz.token_sort_ratio(self.title, keyword)
                if result <= 20:
                    # print(self.title,result)
                    continue
                # print(self.detail_url, self.title)
                self.get_detail()
            return

这样就可以获得相应关键词所对应的所有的景点的信息，接下来的任务就是进入到详情页面，获得门票信息，门票信息是储存在该网页中的一个json格式的信息中的：

    def get_ticket(self):
        id = self.detail_url.split('/')[-1]
        # print(id)
        ticket_url = f'https://piao.ctrip.com/ticket/dest/{id}?onlyContent=true&onlyShelf=true'
        # print(ticket_url)
        ticket_res = requests.get(ticket_url, verify=False, headers=self.headers).text
        # time.sleep(1)
        ticket_res = ticket_res.replace('\n','').replace(' ','')
        ticket_res = ticket_res[ticket_res.find('window.__INITIAL_STATE__')+25:ticket_res.find('window.__APP_SETTINGS__')]
        info = json.loads(ticket_res)
        ticketinfos = info['detailInfo']['ressHash']
        slist = {}
        for ticketinfo in ticketinfos.values():
            title = ticketinfo['name']
            price = ticketinfo['price']
            type = ticketinfo['saleunitinfo']['propleproperty']
            fromw = '携程旅游 '+ticketinfo['brandname']
            slist.setdefault(type, [])
            slist[type].append(
                {'name': title, 'type': type, 'price': price, 'url': self.detail_url,
                 'buy': '', 'from': fromw, 'isReturnable': '',
                 'bookTime': '', 'outTime': '', 'useTime': '',
                 'discription': ''})
        self.spotsInfo[self.title] = slist