pyspider框架之Tripadvisor酒店数据爬取

需求

网站入口:www.tripadvisor.com
这里写图片描述
网页下端,遍历点开进入所有城市链接:
这里写图片描述
点击后进入该城市的所有hotel
这里写图片描述

代码
#!/usr/bin/env python
# -*- encoding: utf-8 -*-
# Created on 2018-09-06 11:16:59
# Project: trip_hotel

from pyspider.libs.base_handler import *
import datetime
import re
import json
import copy

from pymongo import MongoClient

# 连接线下数据库
DB_IP = ''
DB_PORT = 

#DB_IP = '127.0.0.1'
#DB_PORT = 27017

client = MongoClient(host=DB_IP, port=DB_PORT)

# admin 数据库有帐号,连接-认证-切换
db_auth = client.admin
db_auth.authenticate("", "")

DB_NAME = 'research'
db = client[DB_NAME]



def get_today():
    return datetime.datetime.strptime(datetime.datetime.now().strftime('%Y-%m-%d'), '%Y-%m-%d')

class Handler(BaseHandler):
    crawl_config = {
        'headers': {
  'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
                   'cookie':'SetCurrency=USD'},
        'proxy': 'http://10.15.100.94:6666',
        'retries': 5
    }

    url = 'https://www.tripadvisor.com/'
    @every(minutes=24 * 60)
    def on_start(self):
        self.crawl(self.url, callback=self.index_page)

    @config(age=60)
    def index_page(self, response):
        page = response.etree

        city_list = page.xpath("//div[@class='customSelection']/div[@class='boxhp collapsibleLists']/div[@class='section']/div[@class='ui_columns' or @class='ui_columns no-collapse']/ul[@class='lst ui_column is-4']/li[@class='item']")

        print(len(city_list))
        base_url = 'https://www.tripadvisor.com'
        for each in city_list:
            city_name = each.xpath("./a/text()")[0]
            city_link = base_url + each.xpa
  • 1
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论提供了有关酒店的大量信息。这个数据可用于许多nlp项目:推荐系统,情绪分析,同类酒店的图网,基于评论的酒店细分。该数据集包含25个城市的酒店列表和评论。 file/opensearch/documents/92885/hotelReviewsInAustin__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBali__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBangkok__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBarcelona__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInBombay__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInChicago__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInDubai__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInHong Kong__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInIstanbul__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInLondon__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInMiami__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInMilan__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInNew York__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInOsaka__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInParis__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInPhuket__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInPrague__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInRome__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInSan Francisco__en2019100120191005.csv file/opensearch/documents/92885/hotelReviewsInSantorini__en2019100120191005.csv file/opense

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值