携程旅行网广东省景点搜集

小小爬虫爱好者

于 2024-12-09 01:59:19 发布

阅读量1k

点赞数 18

文章标签： python 爬虫旅游

本文链接：https://blog.csdn.net/m0_74417296/article/details/144335895

版权

本项目内容是：在携程旅行网中对广东省的景点数据进行爬取。

以下为详细步骤及思路：
1. 对爬取网页中数据的存储位置进行分析，广东省景点网页
打开开发者工具，使用搜索功能找到数据存放的位置

2. 通过搜索到的文件，可以发现数据存放地址

3. 获取该链接的json数据，即可找到所有数据内容。

4. 对第二页的网页进行对比，目标便明确了，高亮为需进行逆向的属性
爬取目标链接为：
**https://m.ctrip.com/restapi/soa2/18109/json/getAttractionList?_fxpcqlniredt=09031099117558213960&x-traceID=09031099117558213960-1733674776796-5297961**
请求方式为：
post请求
参数为：
_fxpcqlniredt: 09031099117558213960
**x-traceID: 09031099117558213960-1733674776796-5297961**
请求负载为：
{"head":{ **"cid":"09031099117558213960"** ,"ctok":"","cver":"1.0","lang":"01","sid":"8888","syscode":"999","auth":"","xsid":"","extension":[]},"scene":"online","districtId":100051, **"index":1** ,"sortType":1,"count":10,"filter":{"filterItems":["0"]},"coordinate":{"latitude":21.28874136580043,"longitude":110.33244777197878,"coordinateType":"WGS84"},"returnModuleType":"all"}

5. 通过重复比较，可以发现 x-traceID 的值时刻变化着，便确实x-traceID存在加密，因此需要进行js逆向，而index的值是由于翻页产生的变化。

从上图可找到x-traceID的加密方法是：

#var n = e + "-" + (new Date).getTime() + "-" + Math.floor(1e7 * Math.random());

#用python语言表示为：

import time
import random

n1 = int( time.time() * 1000 )
n2 = int(1e7 * random.random())

6.找到变化原因后，接着就是对网页进行爬取
由于数据存储在json里，因此需要对该链接的json数据进行获取

resp = requests.post(
        "https://m.ctrip.com/restapi/soa2/18109/json/getAttractionList",
        headers=headers,
        cookies=cookies,
        params=params,
        json=json_data
    )
html_json = resp.json()

7. json数据的获取与字典相似，参考以下图片可编写如下代码：

card = html_json['attractionList'][i]['card']
poiName = card['poiName'] #景点名
districtName = card['districtName']  # 城市
displayField = card['displayField']  # 地址
url = card['detailUrlInfo']['url']  # 链接
commentScore = card.get('commentScore',0) #评分
heatScore = card.get('heatScore',0) #热度
price = card.get('price','未提供')  # 门票
features = card.get('shortFeatures','未提供')[0]  # 特点
sightCategoryInfo = card.get('sightCategoryInfo','未提供')  # 类别

8. 完整代码如下所示：

import pymysql
import requests
import time
import random

#连接数据库
conn = pymysql.connect(host="127.0.0.1", database="gsc_db", user='root', password='123456', charset='utf8')
cursor = conn.cursor()

cookies = {
    'UBT_VID': '1728978531192.7a32iNJ24XYh',
    'MKT_CKID': '1728978531548.ta9kw.rmqq',
    'GUID': '09031099117558213960',
    '_RSG': 'rJaVC01L4g5oWzt1DGJhE9',
    '_RDG': '2835c4e1c5f8bb2ed003d7adb458381615',
    '_RGUID': '5a845410-0c7e-44f5-93ab-4d0341dfd832',
    'nfes_isSupportWebP': '1',
    '_ga': 'GA1.1.691848462.1728980346',
    '_bfaStatusPVSend': '1',
    'cticket': '67A823297D1D925133ED11DC609DBD186BF1B271E9FE1C4759326381FF82177B',
    'login_type': '0',
    'login_uid': '5DB309DF4469D70D7DE6CDF186BEE12E',
    'DUID': 'u=274F4632961D3B1B682CA5A0E3A1FDD1&v=0',
    'IsNonUser': 'F',
    'AHeadUserInfo': 'VipGrade=0&VipGradeName=%C6%D5%CD%A8%BB%E1%D4%B1&UserName=&NoReadMessageCount=0',
    '_RF1': '2408%3A8459%3A6a30%3A2848%3A555%3A3bd2%3A55be%3A5372',
    'MKT_Pagesource': 'PC',
    '_ubtstatus': '%7B%22vid%22%3A%221728978531192.7a32iNJ24XYh%22%2C%22sid%22%3A20%2C%22pvid%22%3A4%2C%22pid%22%3A0%7D',
    '_bfi': 'p1%3D0%26p2%3D0%26v1%3D4%26v2%3D0',
    '_bfaStatus': 'success',
    'Hm_lvt_a8d6737197d542432f4ff4abc6e06384': '1732626315,1733409627,1733454424,1733471439',
    'Hm_lpvt_a8d6737197d542432f4ff4abc6e06384': '1733471439',
    'HMACCOUNT': 'C4603FF5FD89DD5B',
    '_ga_9BZF483VNQ': 'GS1.1.1733471247.20.1.1733471450.0.0.0',
    '_ga_5DVRDQD429': 'GS1.1.1733471247.20.1.1733471450.0.0.0',
    '_ga_B77BES1Z8Z': 'GS1.1.1733471247.20.1.1733471450.49.0.0',
    '_jzqco': '%7C%7C%7C%7C1733409631243%7C1.437472813.1728978531563.1733471461338.1733471505912.1733471461338.1733471505912.undefined.0.0.136.136',
    '_bfa': '1.1728978531192.7a32iNJ24XYh.1.1733471503148.1733471502143.20.11.10650142842',
}
headers = {
    'accept': '*/*',
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'content-type': 'application/json',
    'cookieorigin': 'https://you.ctrip.com',
    'origin': 'https://you.ctrip.com',
    'priority': 'u=1, i',
    'referer': 'https://you.ctrip.com/',
    'sec-ch-ua': '"Microsoft Edge";v="131", "Chromium";v="131", "Not_A Brand";v="24"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'empty',
    'sec-fetch-mode': 'cors',
    'sec-fetch-site': 'same-site',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36 Edg/131.0.0.0',
}

#x-traceID的生成
n1 = int( time.time() * 1000 )
n2 = int(1e7 * random.random())
t = cookies.get("GUID")
n = f'{t} + "-" + {n1} + "-" + {n2}'

params = {
    '_fxpcqlniredt': t,
    'x-traceID': n,
}
for index in range(1,301):
    json_data = {
    'head': {
        'cid': t,
        'ctok': '',
        'cver': '1.0',
        'lang': '01',
        'sid': '8888',
        'syscode': '999',
        'auth': '',
        'xsid': '',
        'extension': [],
    },
    'scene': 'online',
    'districtId': 100051,
    'index': index,
    'sortType': 1,
    'count': 10,
    'filter': {
        'filterItems': [
            '0',
        ],
    },
    'coordinate': {
        'latitude': 21.28942580322746,
        'longitude': 110.33324580318082,
        'coordinateType': 'WGS84',
    },
    'returnModuleType': 'all',
}
    resp = requests.post(
        "https://m.ctrip.com/restapi/soa2/18109/json/getAttractionList",
        headers=headers,
        cookies=cookies,
        params=params,
        json=json_data
    )
    html_json = resp.json()

    for i in range(0,10):
        try:
            card = html_json['attractionList'][i]['card']
            # 不会缺少的数据量
            poiName = card['poiName'] #景点名
            districtName = card['districtName']  # 城市
            displayField = card['displayField']  # 地址
            url = card['detailUrlInfo']['url']  # 链接

            # 部分会缺少的数据量，用get设置默认值
            commentScore = card.get('commentScore',0) #评分
            heatScore = card.get('heatScore',0) #热度
            price = card.get('price','未提供')  # 门票
            features = card.get('shortFeatures','未提供')[0]  # 特点
            sightCategoryInfo = card.get('sightCategoryInfo','未提供')  # 类别

            sql = ("INSERT INTO gd_province (poiName,price,features,sightCategoryInfo,commentScore,heatScore,districtName,displayField,url) "
                   "VALUES ( %s, %s, %s, %s,%s, %s, %s, %s, %s)")
            cursor.execute(sql, (poiName,price,features,sightCategoryInfo,commentScore,heatScore,districtName,displayField,url))

            conn.commit()
        except:
            print(f"第{index}页的第{i+1}个信息框出错")

    print(f"第{index}页爬取成功")

# 写入数据库后关闭数据库
cursor.close()
conn.close()

9. 数据存入数据库中，使用sql语句查看数据内容