Python爬取马蜂窝各城市游记总数

Python爬取马蜂窝各城市游记总数

完整代码在最下方
马蜂窝的省和直辖市都有一个对应的编号,编写程序获取编号
第一步创建一个Spider的类,headers为一个字典,user-agent从浏览器中获取,使用Chrome打开任意网页,右键检查,选中network,便可找到user-agent,如下图:
在这里插入图片描述

将user-agent的内容复制到headers字典中

class Spider:
	# ua伪装
    headers = {'user-agent': '复制到此处'}
    # 起始url
    start_url = 'http://www.mafengwo.cn/mdd/'
    food_id = 1  # 美食id
    sight_id = 1  # 景点id
    province_id = []  # 存放省id
    municipality_id = []  # 存放直辖市id
    city_id = []  # 存放城市id
    city_name = []  # 存放城市名称

然后创建一个类来获取省和直辖市的编号,使用requests的get方法发送一个请求,返回一个response对象,使用text方法,将response转为str,通过lxml的etree中的xpath寻找到对应的路径,比如云南的路径,右键检查,点击Elements,点击左上角的箭头,再点击对应的控件,会在Elements找到对应的代码,右键代码,复制xpath
在这里插入图片描述

将a标签中的href存入列表,使用for循环遍历href,使用正则表达式获取id,先存直辖市id,再存省id

    # 获取省id
    def get_province_id(self):
        response = requests.get(url=self.start_url, headers=self.headers)
        page_text = response.text
        tree = etree.HTML(page_text)
        # 先存入直辖市的href列表
        city_list = tree.xpath('/html/body/div[2]/div[2]/div/div[3]/div[1]/div[1]/dl[1]/dd/a/@href')
        for city_href in city_list:
            id_ = re.findall('/travel-scenic-spot/mafengwo/(.*?).html', city_href)[0]
            self.municipality_id.append(id_)
        # 获取所有dt下的a列表
        a_list = tree.xpath('/html/body/div[2]/div[2]/div/div[3]/div[1]/div/dl/dt/a')
        # 存放省href
        province_href = []
        for a in a_list:
            href = a.xpath('./@href')[0]
            province_href.append(href)
        # 省id存入province_id
        for href in province_href:
            id_str = re.findall('/travel-scenic-spot/mafengwo/(.*?).html', href)[0]
            self.province_id.append(id_str)

check_csv是检查目录下是否存在指定目录和csv文件,如果没有则创建

    # 检查csv文件是否存在
    def check_csv(self, csv_name):
        path = './CSV'
        file = path + '/' + csv_name + '.csv'
        if not os.path.exists(path):
            os.makedirs(path)
        if not os.path.exists(file):
            # os.mknod(file)
            f = open(file, 'w')
            f.close()

获取了省的编号后,将省下所有城市的编号都获取,因为页面为ajax动态加载,所以使用selenium自动化测试工具进行爬取,Chrome会自动更新,所以这里使用Firefox浏览器运行selenium,使用pandas的dataframe来存放爬取到的数据,使用while循环该省的所有城市页面,直到页码等于最大最大页数,跳出循环,因为点击后一页按钮有时会出现异常,所以使用try点击

	# 获取城市id
    def get_city_id(self):
        bro = webdriver.Firefox(executable_path='Crawler/geckodriver')  # executable_path='此处为geckodriver的地址'
        # chrome浏览器同理
        # bro = webdriver.Chrome(executable_path='./chromedriver')
        city_id = 1
        # 新建空dataframe用来存放爬到的数据
        city_list_dataframe = pd.DataFrame([], columns=['city_id', 'province_name', 'city_name', 'city_href'])
        for id_str in self.province_id:
            url = 'http://www.mafengwo.cn/mdd/citylist/%s.html'
            new_url = format(url % id_str)
            # 浏览器发送请求
            bro.get(new_url)
            # 获取当前源码数据
            page_text = bro.page_source
            tree = etree.HTML(page_text)
            max_page_number_str = tree.xpath('//*[@id="citylistpagination"]/div/span[1]/text()')[0]
            # 获取当前页面最大页数,并转换为int型
            max_page_number = int(re.findall('共(.*?)页', max_page_number_str)[0])
            province_name = tree.xpath('//*[@id="container"]/div[1]/div/div[1]/div[3]/div/span/a/text()')[0]
            while True:
                # 将滚动条拉到最下面的位置,因为往下拉才能将这一页的信息全部加载出来
                bro.execute_script('document.documentElement.scrollTop=10000')
                # 等待页面加载完成
                WebDriverWait(bro, 30).until(lambda driver: bro.execute_script("return jQuery.active == 0"))
                page_ = bro.page_source
                tree = etree.HTML(page_)
                a_list = tree.xpath('//div[@class="bd"]/ul[@class="clearfix"]/li/div/a')
                page_number = tree.xpath('//*[@id="citylistpagination"]/div/span[2]/text()')[0]
                for a in a_list:
                    href = a.xpath('./@href')[0]
                    name = a.xpath('./div/text()')[0]
                    id_ = re.findall('/travel-scenic-spot/mafengwo/(.*?).html', href)[0]

                    city_list_dataframe.loc[city_id - 1, 'city_id'] = city_id
                    city_list_dataframe.loc[city_id - 1, 'province_name'] = province_name
                    # 清除空格
                    name = name.strip()
                    city_list_dataframe.loc[city_id - 1, 'city_name'] = name
                    city_list_dataframe.loc[city_id - 1, 'city_href'] = id_
                    city_id += 1
                    # self.schedule = int((city_id*100)/2193)

                if int(page_number) == max_page_number:
                    break
                else:
                    # 后一页的按钮
                    btn = bro.find_element_by_link_text('后一页')
                    # 点击
                    try:
                        bro.execute_script("arguments[0].click();", btn)
                    except Exception:
                        btn = bro.find_element_by_link_text('后一页')
                        btn.click()

        self.check_csv("city_list_dataframe")
        city_list_dataframe.set_index('city_id', inplace=True)
        city_list_dataframe.to_csv('./CSV/city_list_dataframe.csv')
        bro.quit()

会在当前目录的CSV中创建city_list_dataframe.csv文件
在这里插入图片描述

csv中的内容如下图所示
在这里插入图片描述

这里获取的是城市的游记总数,先访问直辖市,再访问各省的城市,因为马蜂窝会进行反爬虫操作,所以访问需要延时,设置随机访问时间,每0.2秒到1秒进行一次访问

	# 获取数据
    def travel_notes(self):
        mafengwo_id = 1
        # 读csv的内容,格式为dataframe,header=0,意思是以第一行为列索引
        df = pd.read_csv('./city_list_dataframe.csv', header=0)
        # 获取city_href的列,格式为series
		x = df['city_href']
		# 将series转换为list
		city_id = x.values.tolist()
        # 新建空dataframe用来存放爬到的数据
        mafengwo_dataframe = pd.DataFrame([], columns=['id', 'province', 'city', 'county', 'travel_number'])
        all_num = len(self.municipality_id) + len(city_id)
        print('总数:', all_num)
        # 直辖市的信息
        response = requests.get(url=self.start_url, headers=self.headers)
        page_text = response.text
        tree = etree.HTML(page_text)
        # 先存入直辖市的href列表
        municipality = []  # 用来存放直辖市的id
        city_list_municipality = tree.xpath('/html/body/div[2]/div[2]/div/div[3]/div[1]/div[1]/dl[1]/dd/a/@href')
        for city_href in city_list_municipality:
            id_ = re.findall('/travel-scenic-spot/mafengwo/(.*?).html', city_href)[0]
            municipality.append(id_)
        for _id in municipality:
            url = 'http://www.mafengwo.cn/yj/%s/'
            city_url = format(url % _id)
            number = 0
            # 游记数量
            for num in range(1, 5):
                url_ = city_url + 's-0-0-%s-0-1-0.html'
                new_url = format(url_ % num)
                response_ = requests.get(url=new_url, headers=self.headers)
                page_text_ = response_.text
                tree_ = etree.HTML(page_text_)
                try:
                    num = tree_.xpath('//div[@class="_pagebar"]/div/span[1]/span[2]/text()')[0]
                    number_ = int(num)
                except Exception:
                    list_ = []
                    a_list = tree_.xpath('//div[@class="post-list"]/ul/li/h2/a[2]')
                    for a in a_list:
                        text = a.xpath('./text()')[0]
                        list_.append(text)
                    number_ = len(list_)
                number += number_
            # 随机生成0.2-1秒的一位小数
            random_sleep_time = float(round(random.uniform(0.2, 1), 1))
            # 休眠随机数秒
            time.sleep(random_sleep_time)
            # 省名、市名、区县名
            response = requests.get(url=city_url, headers=self.headers)
            page_text = response.text
            tree = etree.HTML(page_text)
            province_name = tree.xpath('//div[@class="crumb"]/div[3]/div[1]/span/a/text()')[0]
            city_name = province_name
            county = '空'
            # print(mafengwo_id, province_name, city_name, county, number)
            mafengwo_dataframe.loc[mafengwo_id - 1, 'id'] = mafengwo_id
            mafengwo_dataframe.loc[mafengwo_id - 1, 'province'] = province_name
            mafengwo_dataframe.loc[mafengwo_id - 1, 'city'] = city_name
            mafengwo_dataframe.loc[mafengwo_id - 1, 'county'] = county
            mafengwo_dataframe.loc[mafengwo_id - 1, 'travel_number'] = number
            mafengwo_id += 1
        # 城市信息
        for _id_ in city_id:
            url = 'http://www.mafengwo.cn/yj/%s/'
            city_url = format(url % _id_)
            number = 0  # 保存游记数量
            # 游记数量
            for num in range(1, 5):
                url_ = city_url + 's-0-0-%s-0-1-0.html'
                new_url = format(url_ % num)
                response_ = requests.get(url=new_url, headers=self.headers)
                page_text_ = response_.text
                tree_ = etree.HTML(page_text_)
                try:
                    num = tree_.xpath('//div[@class="_pagebar"]/div/span[1]/span[2]/text()')[0]
                    number_ = int(num)
                except Exception:
                    list_ = []
                    a_list = tree_.xpath('//div[@class="post-list"]/ul/li/h2/a[2]')
                    for a in a_list:
                        text = a.xpath('./text()')[0]
                        list_.append(text)
                    number_ = len(list_)
                # 随机生成0.2-1秒的一位小数
                random_sleep_time = float(round(random.uniform(0.2, 1), 1))
                # 休眠随机数秒
                time.sleep(random_sleep_time)
                number += number_
            # 省名、市名、区县名
            response = requests.get(url=city_url, headers=self.headers)
            page_text = response.text
            tree = etree.HTML(page_text)
            div_list = tree.xpath('//div[@class="crumb"]/div')
            if len(div_list) == 5:
                province_name = tree.xpath('//div[@class="crumb"]/div[2]/div[1]/span/a/text()')[0]
                city_name = tree.xpath('//div[@class="crumb"]/div[3]/div[1]/span/a/text()')[0]
                county = tree.xpath('//div[@class="crumb"]/div[4]/div[1]/span/a/text()')[0]
            else:
                province_name = tree.xpath('//div[@class="crumb"]/div[2]/div[1]/span/a/text()')[0]
                city_name = tree.xpath('//div[@class="crumb"]/div[3]/div[1]/span/a/text()')[0]
                county = '空'

            # print(city_id, province_name, city_name, county, number)
            mafengwo_dataframe.loc[mafengwo_id - 1, 'id'] = mafengwo_id
            mafengwo_dataframe.loc[mafengwo_id - 1, 'province'] = province_name
            mafengwo_dataframe.loc[mafengwo_id - 1, 'city'] = city_name
            mafengwo_dataframe.loc[mafengwo_id - 1, 'county'] = county
            mafengwo_dataframe.loc[mafengwo_id - 1, 'travel_number'] = number
            mafengwo_id += 1

        self.check_csv("mafengwo_dataframe")
        mafengwo_dataframe.set_index('id', inplace=True)
        mafengwo_dataframe.to_csv('./CSV/mafengwo_dataframe.csv')

下面是景点和美食的内容,此页面内容不光有景点、美食还有交通、购物等信息,所以需要对标题进行筛选,此处美食、景点各获取30条数据,够30条或超过10页,跳出循环
在这里插入图片描述

	# 景点信息
    def sight(self):
        # 新建空dataframe用来存放爬到的数据
        food_dataframe = pd.DataFrame([], columns=['id', 'city', 'food', 'place', 'comment_number', 'travel_notes_number'])
        sight_dataframe = pd.DataFrame([], columns=['id', 'city', 'sight', 'place', 'comment_number', 'travel_notes_number'])
        food_food_id = 1
        sight_sight_id = 1
		df = pd.read_csv('./city_list_dataframe.csv', header=0)
		x = df['city_name']
		city_name = x.values.tolist()
        # unique函数给city_name去重
        city_name_list = np.unique(city_name)
        for c_name in city_name_list:
            last_id_food = food_food_id
            last_id_sight = sight_sight_id
            # 10页数据
            for i in range(1, 11):
                url = 'http://www.mafengwo.cn/search/q.php?q=' + c_name + '&p=' + str(i) + '&t=pois&kt=1'
                response = requests.get(url=url, headers=self.headers)
                page_text = response.text
                tree = etree.HTML(page_text)
                li_list = tree.xpath('//*[@id="_j_search_result_left"]/div/div/ul/li')
                for li in li_list:
                    try:
                        # 景点名字
                        a_name = li.xpath('./div/div[2]/h3/a/text()')[0]
                        # 位置
                        place = li.xpath('./div/div[2]/ul/li[1]/a/text()')[0]
                        # 蜂评数量
                        fengping = li.xpath('./div/div[2]/ul/li[2]/a/text()')[0]
                        # 游记提及数量
                        youji = li.xpath('./div/div[2]/ul/li[3]/a/text()')[0]
                    except Exception:
                        continue
                    # 如果标题中没有景点和美食就过滤掉
                    if a_name.find('景点') == -1:
                        if a_name.find('美食') == -1:
                            # 跳出本次循环
                            continue
                        # 标题有美食
                        else:
                            if food_food_id == (last_id_food + 30):
                                continue
                            # 美食名称
                            food_name = a_name.replace('美食 -', '')
                            # 点评数量 蜂评(4633)
                            comment_number = int(re.findall('蜂评\((\d+)\)', fengping)[0])
                            # 游记提及数量
                            travel_notes_number = int(re.findall('游记\((\d+)\)', youji)[0])

                            # sql = "INSERT INTO food(id,city,food,place,comment_number,travel_notes_number)VALUE('%d','%s','%s','%s','%d','%d')" % (
                            #     food_food_id, c_name, food_name, place, comment_number, travel_notes_number)

                            food_dataframe.loc[food_food_id-1, 'id'] = food_food_id
                            food_dataframe.loc[food_food_id-1, 'city'] = c_name
                            food_dataframe.loc[food_food_id-1, 'food'] = food_name
                            food_dataframe.loc[food_food_id-1, 'place'] = place
                            food_dataframe.loc[food_food_id-1, 'comment_number'] = comment_number
                            food_dataframe.loc[food_food_id-1, 'travel_notes_number'] = travel_notes_number

                            print(food_food_id, c_name, food_name, place, comment_number, travel_notes_number)
                            food_food_id += 1

                            # 跳出本次循环
                            continue
                    if sight_sight_id == (last_id_sight + 30):
                        if food_food_id == (last_id_food + 30):
                            break
                        else:
                            continue
                    # 景点名字
                    sight_name = a_name.replace('景点 - ', '')
                    # 点评数量 蜂评(4633)
                    comment_number = int(re.findall('蜂评\((\d+)\)', fengping)[0])
                    # 游记提及数量
                    travel_notes_number = int(re.findall('游记\((\d+)\)', youji)[0])
                    # sql = "INSERT INTO sight(id,city,sight,place,comment_number,travel_notes_number)VALUE('%d','%s','%s','%s','%d','%d')" % (
                    #     self.sight_id, c_name, sight_name, place, comment_number, travel_notes_number)

                    sight_dataframe.loc[food_food_id - 1, 'id'] = sight_sight_id
                    sight_dataframe.loc[food_food_id - 1, 'city'] = c_name
                    sight_dataframe.loc[food_food_id - 1, 'food'] = sight_name
                    sight_dataframe.loc[food_food_id - 1, 'place'] = place
                    sight_dataframe.loc[food_food_id - 1, 'comment_number'] = comment_number
                    sight_dataframe.loc[food_food_id - 1, 'travel_notes_number'] = travel_notes_number
                    print(sight_sight_id, c_name, sight_name, place, comment_number, travel_notes_number)
                    sight_sight_id += 1

                # 随机生成0.2-1秒的一位小数
                random_sleep_time = float(round(random.uniform(0.2, 1), 1))
                # 休眠随机数秒
                time.sleep(random_sleep_time)
        self.check_csv("food_dataframe")
        self.check_csv("sight_dataframe")
        food_dataframe.set_index('id', inplace=True)
        food_dataframe.to_csv('./CSV/food_dataframe.csv')
        sight_dataframe.set_index('id', inplace=True)
        sight_dataframe.to_csv('./CSV/food_dataframe.csv')

run函数

    def run(self):
        self.get_province_id()
        self.get_city_id()
        self.travel_notes()
        self.sight()

main函数

if __name__ == '__main__':
    spider = Spider()
    spider.run()

项目目录
在这里插入图片描述

完整代码如下

import os
import re
import csv
import random
import time
import requests
import pymysql
import numpy as np
import pandas as pd
from lxml import etree
from sqlalchemy import create_engine
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait


class Spider:
    # ua伪装
    headers = {'user-agent': '浏览器中复制的user-agent'}
    # 起始url
    start_url = 'http://www.mafengwo.cn/mdd/'
    # schedule = 0  # 进度
    db_id = 1  # 数据库中id
    food_id = 1  # 美食id
    sight_id = 1  # 景点id
    province_id = []  # 存放省id
    municipality_id = []  # 存放直辖市id
    city_id = []  # 存放城市id
    city_name = []  # 存放城市名称

    # 获取省id
    def get_province_id(self):
        response = requests.get(url=self.start_url, headers=self.headers)
        page_text = response.text
        tree = etree.HTML(page_text)
        # 先存入直辖市的href列表
        city_list = tree.xpath('/html/body/div[2]/div[2]/div/div[3]/div[1]/div[1]/dl[1]/dd/a/@href')
        for city_href in city_list:
            id_ = re.findall('/travel-scenic-spot/mafengwo/(.*?).html', city_href)[0]
            self.municipality_id.append(id_)
        # 获取所有dt下的a列表
        a_list = tree.xpath('/html/body/div[2]/div[2]/div/div[3]/div[1]/div/dl/dt/a')
        # 存放省href
        province_href = []
        for a in a_list:
            href = a.xpath('./@href')[0]
            province_href.append(href)
        # 省id存入province_id
        for href in province_href:
            id_str = re.findall('/travel-scenic-spot/mafengwo/(.*?).html', href)[0]
            self.province_id.append(id_str)

    # 获取城市id
    def get_city_id(self):
        bro = webdriver.Firefox(executable_path='./geckodriver')
        city_id = 1
        # 新建空dataframe用来存放爬到的数据
        city_list_dataframe = pd.DataFrame([], columns=['city_id', 'province_name', 'city_name', 'city_href'])
        for id_str in self.province_id:
            url = 'http://www.mafengwo.cn/mdd/citylist/%s.html'
            new_url = format(url % id_str)
            # 浏览器发送请求
            bro.get(new_url)
            # 获取当前源码数据
            page_text = bro.page_source
            tree = etree.HTML(page_text)
            max_page_number_str = tree.xpath('//*[@id="citylistpagination"]/div/span[1]/text()')[0]
            # 获取当前页面最大页数,并转换为int型
            max_page_number = int(re.findall('共(.*?)页', max_page_number_str)[0])
            province_name = tree.xpath('//*[@id="container"]/div[1]/div/div[1]/div[3]/div/span/a/text()')[0]
            while True:
                # 将滚动条拉到最下面的位置,因为往下拉才能将这一页的信息全部加载出来
                bro.execute_script('document.documentElement.scrollTop=10000')
                # 等待页面加载完成
                WebDriverWait(bro, 30).until(lambda driver: bro.execute_script("return jQuery.active == 0"))
                page_ = bro.page_source
                tree = etree.HTML(page_)
                a_list = tree.xpath('//div[@class="bd"]/ul[@class="clearfix"]/li/div/a')
                page_number = tree.xpath('//*[@id="citylistpagination"]/div/span[2]/text()')[0]
                for a in a_list:
                    href = a.xpath('./@href')[0]
                    name = a.xpath('./div/text()')[0]
                    id_ = re.findall('/travel-scenic-spot/mafengwo/(.*?).html', href)[0]

                    city_list_dataframe.loc[city_id - 1, 'city_id'] = city_id
                    city_list_dataframe.loc[city_id - 1, 'province_name'] = province_name
                    # 清除空格
                    name = name.strip()
                    city_list_dataframe.loc[city_id - 1, 'city_name'] = name
                    city_list_dataframe.loc[city_id - 1, 'city_href'] = id_
                    city_id += 1
                    # self.schedule = int((city_id*100)/2193)

                if int(page_number) == max_page_number:
                    break
                else:
                    # 后一页的按钮
                    btn = bro.find_element_by_link_text('后一页')
                    # 点击
                    try:
                        bro.execute_script("arguments[0].click();", btn)
                    except Exception:
                        btn = bro.find_element_by_link_text('后一页')
                        btn.click()
                if city_id > 50:
                    break
            if city_id > 50:
                break

        self.check_csv("city_list_dataframe")
        city_list_dataframe.set_index('city_id', inplace=True)
        city_list_dataframe.to_csv('./CSV/city_list_dataframe.csv')
        bro.quit()

    # 获取数据
    def travel_notes(self):
        mafengwo_id = 1
        # 读csv的内容,格式为dataframe,header=0,意思是以第一行为列索引
        df = pd.read_csv('./CSV/city_list_dataframe.csv', header=0)
        # 获取city_href的列,格式为series
        x = df['city_href']
        # 将series转换为list
        city_id = x.values.tolist()
        # 新建空dataframe用来存放爬到的数据
        mafengwo_dataframe = pd.DataFrame([], columns=['id', 'province', 'city', 'county', 'travel_number'])
        all_num = len(self.municipality_id) + len(city_id)
        print('总数:', all_num)
        # 直辖市的信息
        response = requests.get(url=self.start_url, headers=self.headers)
        page_text = response.text
        tree = etree.HTML(page_text)
        # 先存入直辖市的href列表
        municipality = []  # 用来存放直辖市的id
        city_list_municipality = tree.xpath('/html/body/div[2]/div[2]/div/div[3]/div[1]/div[1]/dl[1]/dd/a/@href')
        for city_href in city_list_municipality:
            id_ = re.findall('/travel-scenic-spot/mafengwo/(.*?).html', city_href)[0]
            municipality.append(id_)
        for _id in municipality:
            url = 'http://www.mafengwo.cn/yj/%s/'
            city_url = format(url % _id)
            number = 0
            # 游记数量
            for num in range(1, 5):
                url_ = city_url + 's-0-0-%s-0-1-0.html'
                new_url = format(url_ % num)
                response_ = requests.get(url=new_url, headers=self.headers)
                page_text_ = response_.text
                tree_ = etree.HTML(page_text_)
                try:
                    num = tree_.xpath('//div[@class="_pagebar"]/div/span[1]/span[2]/text()')[0]
                    number_ = int(num)
                except Exception:
                    list_ = []
                    a_list = tree_.xpath('//div[@class="post-list"]/ul/li/h2/a[2]')
                    for a in a_list:
                        text = a.xpath('./text()')[0]
                        list_.append(text)
                    number_ = len(list_)
                number += number_
            # 随机生成0.2-1秒的一位小数
            random_sleep_time = float(round(random.uniform(0.2, 1), 1))
            # 休眠随机数秒
            time.sleep(random_sleep_time)
            # 省名、市名、区县名
            response = requests.get(url=city_url, headers=self.headers)
            page_text = response.text
            tree = etree.HTML(page_text)
            province_name = tree.xpath('//div[@class="crumb"]/div[3]/div[1]/span/a/text()')[0]
            city_name = province_name
            county = '空'
            # print(mafengwo_id, province_name, city_name, county, number)
            mafengwo_dataframe.loc[mafengwo_id - 1, 'id'] = mafengwo_id
            mafengwo_dataframe.loc[mafengwo_id - 1, 'province'] = province_name
            mafengwo_dataframe.loc[mafengwo_id - 1, 'city'] = city_name
            mafengwo_dataframe.loc[mafengwo_id - 1, 'county'] = county
            mafengwo_dataframe.loc[mafengwo_id - 1, 'travel_number'] = number
            mafengwo_id += 1
        # 城市信息
        for _id_ in city_id:
            url = 'http://www.mafengwo.cn/yj/%s/'
            city_url = format(url % _id_)
            number = 0  # 保存游记数量
            # 游记数量
            for num in range(1, 5):
                url_ = city_url + 's-0-0-%s-0-1-0.html'
                new_url = format(url_ % num)
                response_ = requests.get(url=new_url, headers=self.headers)
                page_text_ = response_.text
                tree_ = etree.HTML(page_text_)
                try:
                    num = tree_.xpath('//div[@class="_pagebar"]/div/span[1]/span[2]/text()')[0]
                    number_ = int(num)
                except Exception:
                    list_ = []
                    a_list = tree_.xpath('//div[@class="post-list"]/ul/li/h2/a[2]')
                    for a in a_list:
                        text = a.xpath('./text()')[0]
                        list_.append(text)
                    number_ = len(list_)
                # 随机生成0.2-1秒的一位小数
                random_sleep_time = float(round(random.uniform(0.2, 1), 1))
                # 休眠随机数秒
                time.sleep(random_sleep_time)
                number += number_
            # 省名、市名、区县名
            response = requests.get(url=city_url, headers=self.headers)
            page_text = response.text
            tree = etree.HTML(page_text)
            div_list = tree.xpath('//div[@class="crumb"]/div')
            if len(div_list) == 5:
                province_name = tree.xpath('//div[@class="crumb"]/div[2]/div[1]/span/a/text()')[0]
                city_name = tree.xpath('//div[@class="crumb"]/div[3]/div[1]/span/a/text()')[0]
                county = tree.xpath('//div[@class="crumb"]/div[4]/div[1]/span/a/text()')[0]
            else:
                province_name = tree.xpath('//div[@class="crumb"]/div[2]/div[1]/span/a/text()')[0]
                city_name = tree.xpath('//div[@class="crumb"]/div[3]/div[1]/span/a/text()')[0]
                county = '空'

            # print(city_id, province_name, city_name, county, number)
            mafengwo_dataframe.loc[mafengwo_id - 1, 'id'] = mafengwo_id
            mafengwo_dataframe.loc[mafengwo_id - 1, 'province'] = province_name
            mafengwo_dataframe.loc[mafengwo_id - 1, 'city'] = city_name
            mafengwo_dataframe.loc[mafengwo_id - 1, 'county'] = county
            mafengwo_dataframe.loc[mafengwo_id - 1, 'travel_number'] = number
            mafengwo_id += 1

        self.check_csv("mafengwo_dataframe")
        mafengwo_dataframe.set_index('id', inplace=True)
        mafengwo_dataframe.to_csv('./CSV/mafengwo_dataframe.csv')

    # 景点信息
    def sight(self):
        # 新建空dataframe用来存放爬到的数据
        food_dataframe = pd.DataFrame([], columns=['id', 'city', 'food', 'place', 'comment_number', 'travel_notes_number'])
        sight_dataframe = pd.DataFrame([], columns=['id', 'city', 'sight', 'place', 'comment_number', 'travel_notes_number'])
        food_food_id = 1
        sight_sight_id = 1
        df = pd.read_csv('./CSV/city_list_dataframe.csv', header=0)
        x = df['city_name']
        city_name = x.values.tolist()
        # unique函数给city_name去重
        city_name_list = np.unique(city_name)
        for c_name in city_name_list:
            last_id_food = food_food_id
            last_id_sight = sight_sight_id
            # 10页数据
            for i in range(1, 11):
                url = 'http://www.mafengwo.cn/search/q.php?q=' + c_name + '&p=' + str(i) + '&t=pois&kt=1'
                response = requests.get(url=url, headers=self.headers)
                page_text = response.text
                tree = etree.HTML(page_text)
                li_list = tree.xpath('//*[@id="_j_search_result_left"]/div/div/ul/li')
                for li in li_list:
                    try:
                        # 景点名字
                        a_name = li.xpath('./div/div[2]/h3/a/text()')[0]
                        # 位置
                        place = li.xpath('./div/div[2]/ul/li[1]/a/text()')[0]
                        # 蜂评数量
                        fengping = li.xpath('./div/div[2]/ul/li[2]/a/text()')[0]
                        # 游记提及数量
                        youji = li.xpath('./div/div[2]/ul/li[3]/a/text()')[0]
                    except Exception:
                        continue
                    # 如果标题中没有景点和美食就过滤掉
                    if a_name.find('景点') == -1:
                        if a_name.find('美食') == -1:
                            # 跳出本次循环
                            continue
                        # 标题有美食
                        else:
                            if food_food_id == (last_id_food + 30):
                                continue
                            # 美食名称
                            food_name = a_name.replace('美食 -', '')
                            # 点评数量 蜂评(4633)
                            comment_number = int(re.findall('蜂评\((\d+)\)', fengping)[0])
                            # 游记提及数量
                            travel_notes_number = int(re.findall('游记\((\d+)\)', youji)[0])

                            # sql = "INSERT INTO food(id,city,food,place,comment_number,travel_notes_number)VALUE('%d','%s','%s','%s','%d','%d')" % (
                            #     food_food_id, c_name, food_name, place, comment_number, travel_notes_number)

                            food_dataframe.loc[food_food_id-1, 'id'] = food_food_id
                            food_dataframe.loc[food_food_id-1, 'city'] = c_name
                            food_dataframe.loc[food_food_id-1, 'food'] = food_name
                            food_dataframe.loc[food_food_id-1, 'place'] = place
                            food_dataframe.loc[food_food_id-1, 'comment_number'] = comment_number
                            food_dataframe.loc[food_food_id-1, 'travel_notes_number'] = travel_notes_number

                            print(food_food_id, c_name, food_name, place, comment_number, travel_notes_number)
                            food_food_id += 1

                            # 跳出本次循环
                            continue
                    if sight_sight_id == (last_id_sight + 30):
                        if food_food_id == (last_id_food + 30):
                            break
                        else:
                            continue
                    # 景点名字
                    sight_name = a_name.replace('景点 - ', '')
                    # 点评数量 蜂评(4633)
                    comment_number = int(re.findall('蜂评\((\d+)\)', fengping)[0])
                    # 游记提及数量
                    travel_notes_number = int(re.findall('游记\((\d+)\)', youji)[0])
                    # sql = "INSERT INTO sight(id,city,sight,place,comment_number,travel_notes_number)VALUE('%d','%s','%s','%s','%d','%d')" % (
                    #     self.sight_id, c_name, sight_name, place, comment_number, travel_notes_number)

                    sight_dataframe.loc[food_food_id - 1, 'id'] = sight_sight_id
                    sight_dataframe.loc[food_food_id - 1, 'city'] = c_name
                    sight_dataframe.loc[food_food_id - 1, 'food'] = sight_name
                    sight_dataframe.loc[food_food_id - 1, 'place'] = place
                    sight_dataframe.loc[food_food_id - 1, 'comment_number'] = comment_number
                    sight_dataframe.loc[food_food_id - 1, 'travel_notes_number'] = travel_notes_number
                    print(sight_sight_id, c_name, sight_name, place, comment_number, travel_notes_number)
                    sight_sight_id += 1

                # 随机生成0.2-1秒的一位小数
                random_sleep_time = float(round(random.uniform(0.2, 1), 1))
                # 休眠随机数秒
                time.sleep(random_sleep_time)
        self.check_csv("food_dataframe")
        self.check_csv("sight_dataframe")
        food_dataframe.set_index('id', inplace=True)
        food_dataframe.to_csv('./CSV/food_dataframe.csv')
        sight_dataframe.set_index('id', inplace=True)
        sight_dataframe.to_csv('./CSV/food_dataframe.csv')

    # 关闭浏览器对象、关闭数据库
    # def close(self):
    #     # 关闭浏览器对象
    #     # self.bro.quit()
    #     # 关闭游标 关闭数据库
    #     self.cursor.close()
    #     self.db.close()

    # 检查csv文件是否存在
    def check_csv(self, csv_name):
        path = './CSV'
        file = path + '/' + csv_name + '.csv'
        if not os.path.exists(path):
            os.makedirs(path)
        if not os.path.exists(file):
            # os.mknod(file)
            f = open(file, 'w')
            f.close()

    def run(self):
        self.get_province_id()
        self.get_city_id()
        self.travel_notes()
        self.sight()
        # self.close()


if __name__ == '__main__':
    spider = Spider()
    spider.run()

因为马蜂窝的反爬机制,所以每0.2到1秒爬取一页,速度较慢,大约需要数小时爬完,本文仅爬取50座城市的数据也用了20分钟,仅供学习交流

  • 1
    点赞
  • 27
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值