创新项目实训(二)-CSDN博客

本文链接：https://blog.csdn.net/a0939763286/article/details/115836444

创新项目实训(二)

前言

我们组打算搭建一个国内旅游比价网站，
而我负责的部份是各大订酒店网站的数据获取及整理

主要参考版上的经验分享+自己的修改理解
小白0经验入门记录、边爬边学习ing
有错误或更好的建议都可以指教讨论

途牛酒店

采用python+request
目标获取酒店名称、星级、用户评分、评论数、最低价格

先上个结果图
依旧最阳春的展示结果
在这里插入图片描述

正片开始

一样搜索上海的酒店,不用登入就能看到价格

url构成
https://hotel.tuniu.com/list/{不知名参数}?checkindate={入住时间}&checkoutdate={退房时间}&cityName={城市名}

在这里插入图片描述
按F12看下网页

点XHR查看，可以看到两个list,一个是city/list另一个是hotel/list
可以发现citylist里面有城市的citycode

cityCode 2500,cityName:" 上海"
这就发现搜索网页的不知名参数是城市对应的cityCode

在这里插入图片描述

直接将数据存到记事本

使用正则表达式提取

#相对应城市Code
def getCity():
    with open('City.txt','r', encoding='utf-8') as f:
        data = f.read()  # 读取文件
        find_city = re.findall(r'cityName.*?\"\:\"(.*?)\"', data)
        find_city_code=re.findall(r'cityCode.*?\"\:(.*?)\,',data)
    city_dict = {}
    for city,code in zip(find_city,find_city_code):
        city_dict[city] = code
        
    #也可以将城市对应的code另外存成一个记事本调用
    #with open('dict.txt','w', encoding='utf-8') as f:
        #f.write(str(city_dict))
    return city_dict

翻下一页时网址不变，且又多了个hotel/list
POST方法，而下面的Request Payload就是Request时传的参数
比较两个list，发现pageNo从1变成2

这就知道要获取翻页的资料只要更改pageNo就可以了
在这里插入图片描述
ok开始写代码

复制Headers跟Request Payload的参数就好

def getData(session,page,city,checkin,checkout):
    city_code = getCity()#获取cityCode
    url = 'https://hotel.tuniu.com/hotel-api/hotel/list?c=%7B%22ct%22%3A20000%7D'
    r_url = 'https://hotel.tuniu.com/list/{}p0s0b0?checkindate={}&checkoutdate={}&cityName={}&city={}&poi=0&stars=0&brands=0'.format(city_code[city],checkin,checkout,p.quote(city),city_code[city])
    data = {
        "primary": {
            "checkIn": checkin,
            "checkOut": checkout,
            "cityCode": city_code[city],
            "cityType": 0,
            'adultNum': 2,
            "childNum": 0, "childAges": [], "keyword": "", "roomNum": 1
        },
        "secondary": {
            "poi": {"locationType": 2, "pois": []}, "prices": [], "stars": [], "brands": [], "features": [],
            "facilities": [], "commentScore": "", "bedTypes": []
        },
        "threeStages": [], "suggest": {}, "sort": 0, "customerClient": 2,
        "returnDistance": 'true',
        "secondaryDist": {"pValue": "", "userType": 0},
        "pageNo": page,
        "PageSize": 20
    }
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
        'referer': r_url,
        'cookie': '自己的cookie'
    }
    jdata = json.dumps(data)#转成json格式
    res = session.post(url=url, data=jdata, headers=headers)
    print(res.text)

提取所需的资料

在这里插入图片描述

	#正则表达式提取
	ID = re.findall(r'\"hotelId\"\:(.*?)\,', list)
    Name = re.findall(r'\"chineseName\"\:\"(.*?)\"\,', list)
    Score = re.findall(r'\"score\"\:(.*?)\,', list)
    Star = re.findall(r'\"starName\"\:\"(.*?)\"\,', list)
    comment = re.findall(r'\,\"count\"\:(.*?)\}', list)
    Pic = re.findall(r'\"firstPic\"\:\"(.*?)\"\,', list)
    Price = re.findall(r'\"lowestPrice\"\:(.*?)\,', list)