使用requests库爬取淘宝商品信息

本文介绍了如何通过获取淘宝商品页面的MTOPJSONP数据,包括设置headers(cookie、referer和user-agent),以及使用requests库进行网络请求,然后解析返回的JSONP数据,以便于爬取商品信息。
摘要由CSDN通过智能技术生成

一、实现步骤

        找到搜索信息相关url,复制cookie,referer和uer-agent值。使用get请求得到一种mtopjsonp1()格式的返回值,里面包含商品信息,通过json库将该格式转化为dict格式,方便后续提取数据。

二、爬取准备

1. 打开淘宝并登录,搜索书包。

2. 进入开发者模式,进入network。

3. 刷新网页,之后通过复制某商品名称搜索(在开发者模式中按ctrl+f),输入商品名找到数据文件。发现"h5api.m.taobao.com/h5/mtop.relationrecommend.wirelessrecomme..."的文件中包含该商品信息,我们可以通过点击搜索结果后点击preview查看数据,这就是需要爬取的数据。

 4. 继续点击headers,找到cookie,referer,user-agent,这样我们就收集完毕headers需要的内容。

5. 找到正确的url,注意不是网页地址"...search?commend=all&ie=utf8&initiative_id=tbinde..."该网址返回一个错误的值。

因此,我们还是通过商品名字搜索到的文件里面找到对应的url就可以了,如下一大段字符就是。

三、代码实现

import requests
#这是搜索书包的url,搜索其他的url不一样
url = 'https://h5api.m.taobao.com/h5/mtop.relationrecommend.wirelessrecommend.recommend/2.0/?jsv=2.6.2&appKey=12574478&t=1710678823434&sign=e7680aee4bec4909677ef8ca16f573fb&api=mtop.relationrecommend.WirelessRecommend.recommend&v=2.0&type=jsonp&dataType=jsonp&callback=mtopjsonp1&data=%7B%22appId%22%3A%2234385%22%2C%22params%22%3A%22%7B%5C%22device%5C%22%3A%5C%22HMA-AL00%5C%22%2C%5C%22isBeta%5C%22%3A%5C%22false%5C%22%2C%5C%22grayHair%5C%22%3A%5C%22false%5C%22%2C%5C%22from%5C%22%3A%5C%22nt_history%5C%22%2C%5C%22brand%5C%22%3A%5C%22HUAWEI%5C%22%2C%5C%22info%5C%22%3A%5C%22wifi%5C%22%2C%5C%22index%5C%22%3A%5C%224%5C%22%2C%5C%22rainbow%5C%22%3A%5C%22%5C%22%2C%5C%22schemaType%5C%22%3A%5C%22auction%5C%22%2C%5C%22elderHome%5C%22%3A%5C%22false%5C%22%2C%5C%22isEnterSrpSearch%5C%22%3A%5C%22true%5C%22%2C%5C%22newSearch%5C%22%3A%5C%22false%5C%22%2C%5C%22network%5C%22%3A%5C%22wifi%5C%22%2C%5C%22subtype%5C%22%3A%5C%22%5C%22%2C%5C%22hasPreposeFilter%5C%22%3A%5C%22false%5C%22%2C%5C%22prepositionVersion%5C%22%3A%5C%22v2%5C%22%2C%5C%22client_os%5C%22%3A%5C%22Android%5C%22%2C%5C%22gpsEnabled%5C%22%3A%5C%22false%5C%22%2C%5C%22searchDoorFrom%5C%22%3A%5C%22srp%5C%22%2C%5C%22debug_rerankNewOpenCard%5C%22%3A%5C%22false%5C%22%2C%5C%22homePageVersion%5C%22%3A%5C%22v7%5C%22%2C%5C%22searchElderHomeOpen%5C%22%3A%5C%22false%5C%22%2C%5C%22search_action%5C%22%3A%5C%22initiative%5C%22%2C%5C%22sugg%5C%22%3A%5C%22_4_1%5C%22%2C%5C%22sversion%5C%22%3A%5C%2213.6%5C%22%2C%5C%22style%5C%22%3A%5C%22list%5C%22%2C%5C%22ttid%5C%22%3A%5C%22600000%40taobao_pc_10.7.0%5C%22%2C%5C%22needTabs%5C%22%3A%5C%22true%5C%22%2C%5C%22areaCode%5C%22%3A%5C%22CN%5C%22%2C%5C%22vm%5C%22%3A%5C%22nw%5C%22%2C%5C%22countryNum%5C%22%3A%5C%22156%5C%22%2C%5C%22m%5C%22%3A%5C%22pc%5C%22%2C%5C%22page%5C%22%3A%5C%221%5C%22%2C%5C%22n%5C%22%3A48%2C%5C%22q%5C%22%3A%5C%22%25E4%25B9%25A6%25E5%258C%2585%5C%22%2C%5C%22tab%5C%22%3A%5C%22all%5C%22%2C%5C%22pageSize%5C%22%3A48%2C%5C%22totalPage%5C%22%3A100%2C%5C%22totalResults%5C%22%3A4800%2C%5C%22sourceS%5C%22%3A%5C%220%5C%22%2C%5C%22sort%5C%22%3A%5C%22_coefp%5C%22%2C%5C%22bcoffset%5C%22%3A%5C%22%5C%22%2C%5C%22ntoffset%5C%22%3A%5C%22%5C%22%2C%5C%22filterTag%5C%22%3A%5C%22%5C%22%2C%5C%22service%5C%22%3A%5C%22%5C%22%2C%5C%22prop%5C%22%3A%5C%22%5C%22%2C%5C%22loc%5C%22%3A%5C%22%5C%22%2C%5C%22start_price%5C%22%3Anull%2C%5C%22end_price%5C%22%3Anull%2C%5C%22startPrice%5C%22%3Anull%2C%5C%22endPrice%5C%22%3Anull%2C%5C%22itemIds%5C%22%3Anull%2C%5C%22p4pIds%5C%22%3Anull%2C%5C%22categoryp%5C%22%3A%5C%22%5C%22%7D%22%7D'
headers = {
    "Cookie":"你的cookie",
    "referer": "https://s.taobao.com/",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36",
}
response = requests.get(url, headers=headers)
import json, re
def loads_jsonp( jsonp):
    try:
        return json.loads(re.match(".*?({.*}).*", jsonp, re.S).group(1))
    except:
        raise ValueError('Invalid Input')
json_data = loads_jsonp(response.text)
#提取的数据,输出长度
print(len(json_data['data']['itemsArray']))

得到如下结果。

如果需要翻页爬取,可以通过修改url来实现,通过转码找到page,修改即可。

 

使用Python爬取淘宝商品信息可以通过以下步骤实现: 1. 导入所需的:首先,需要导入Pythonrequests和BeautifulSoupRequests用于发送HTTP请求,BeautifulSoup用于解析HTML页面。 2. 发送请求获取页面内容:使用requests发送GET请求,获取淘宝商品搜索页面的HTML内容。 3. 解析页面内容:使用BeautifulSoup解析HTML内容,提取所需的商品信息。可以通过查看页面源代码,确定需要提取的信息所在的HTML标签和类名。 4. 提取商品信息:根据HTML标签和类名,使用BeautifulSoup提取商品的名称、价格、销量等信息。 5. 存储数据:将提取到的商品信息存储到本地文件或数据中,以便后续分析和使用。 下面是一个简单的示例代码: ```python import requests from bs4 import BeautifulSoup def get_taobao_products(keyword): url = f'https://s.taobao.com/search?q={keyword}' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36' } response = requests.get(url, headers=headers) soup = BeautifulSoup(response.text, 'html.parser') products = [] items = soup.select('.JIIxO') for item in items: name = item.select('.JIIxO .row-2 a')[0].text.strip() price = item.select('.JIIxO .row-3 .price strong')[0].text.strip() sales = item.select('.JIIxO .row-1 .deal-cnt')[0].text.strip() products.append({ 'name': name, 'price': price, 'sales': sales }) return products # 示例调用 keyword = '手机' products = get_taobao_products(keyword) for product in products: print(product) ```
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值