定向爬取淘宝商品名称和价格(嵩天老师)

嵩天老师的代码不能爬取现在的淘宝,那是因为现在淘宝的反扒技术升级了

解决方法:我们要将headers中的cookie替换成淘宝的(每个人的cookie值是不同的)

具体方法参考:通过requests库re库进行淘宝商品爬虫爬取(对中国大学mooc嵩天老师爬虫进行修改)_Omann的博客-CSDN博客

# -*- coding: utf-8 -*-
"""
Created on Mon Oct  4 00:06:08 2021

@author: saiban
"""
#嵩天老帅的代码不能爬取现在的淘宝,现在的淘宝反扒技术升级,
#我们需要把headers内容中的referer和cookies替换成淘宝的
import requests
import re
def getHtmlText(url):#获取页面
    try:
        header = {
    'authority': 's.taobao.com',
    'cache-control': 'max-age=0',
    'sec-ch-ua': '"Chromium";v="94", "Google Chrome";v="94", ";Not A Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-user': '?1',
    'sec-fetch-dest': 'document',
    'accept-language': 'zh-CN,zh;q=0.9',
    'cookie': 'cna=w7y7GYch4kMCAasijBL1tcnw; xlly_s=1; t=d77b430a77fdf76d9c17f2806b57c2ff; hng=CN%7Czh-CN%7CCNY%7C156; thw=cn; _m_h5_tk=5ccf9d80baa24976d1bd97719fb7d377_1633327797887; _m_h5_tk_enc=2d2f907d5f3a86f04dbbf2aa73897ecd; _samesite_flag_=true; cookie2=16bd808d5d4c077a1e5f8e3abc00787e; _tb_token_=e3e5e0fd3333d; sgcookie=E1004cVmZPbUKd%2Bo2Y6ewiI7lmCD2rFerQ9K0Rx3PgQSSoYp%2FWW8LOvfo4oThh7eNLIEFm5uGhkQ9IUsgsWnv4%2BYBvCt6Z2xxPQJp498ChIaTCg%3D; unb=2202345247497; uc3=nk2=F5RBx%2BGr84TAocRa&lg2=URm48syIIVrSKA%3D%3D&vt3=F8dCujaCTG0Yn%2BEGbMY%3D&id2=UUphyItuGYNeDyMxrA%3D%3D; csg=fa72b214; lgc=tb4216148421; cancelledSubSites=empty; cookie17=UUphyItuGYNeDyMxrA%3D%3D; dnk=tb4216148421; skt=a19b5b0102a7b5ee; existShop=MTYzMzM1MjE0MQ%3D%3D; uc4=nk4=0%40FY4KoqHi383HYtpSM0RDmlOwk4iA8Tg%3D&id4=0%40U2grE1hEVww3EVoATgMbl4PMiyTEeIZt; tracknick=tb4216148421; _cc_=W5iHLLyFfA%3D%3D; _l_g_=Ug%3D%3D; sg=17e; _nk_=tb4216148421; cookie1=W8743JHOqTkZp4234GIqb8W2j3pRPRi2%2Ftn1Y16wf2Y%3D; enc=lEu3iTdRRzo2bKJ%2FSRTJ7W3KJmqkZoqJ8qTWcN7Fqxv4oVm4619kntnz84TzJb6SnF8AjFC43wovrgqFDVvISLE2T0wQC8D4h3ZzkSjIpSs%3D; mt=ci=0_1; uc1=existShop=false&pas=0&cookie21=Vq8l%2BKCLjhZM&cookie16=UtASsssmPlP%2Ff1IHDsDaPRu%2BPw%3D%3D&cookie14=Uoe3dP4mSshcOw%3D%3D&cookie15=WqG3DMC9VAQiUQ%3D%3D; JSESSIONID=338CCA02A39CF025F71F1BF78290B3CC; tfstk=c_YCB2ijKJ2I0Vnz8BGaUpf5nnb5aLH1sD6pO3g3IVJs9xRcBsmQutMG732Gz_C1.; l=eBxAgJEmghiX7hpyBO5Cnurza77OFIRbzPVzaNbMiInca6OA9FiEjNCLQg5JWdtjgt5xNFtzh0NBGRE6SuzLRxGjL77kRs5mpI96Re1..; isg=BD09yvnBAAYXt6RpoStY_zgXTJk32nEsV3r_rv-C1hTDNl9owymX_Iuk4Gpws4nk',
}
        r=requests.get(url,headers=header,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        print("出错了")   
def parsePage(ilt,html):#对每个获得的页面进行解析
    try:
        plt=re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)
        tlt=re.findall(r'\"raw_title\"\:\".*?\"', html)
        for i in range(len(plt)):
            price=eval(plt[i].split(':')[1])#eval函数可以把最外层的单引号和双引号去掉
            title=eval(tlt[i].split(':')[1])
            ilt.append([price,title])
    except:
        print("有出错了")
        
def printGoodList(ilt):#打印输出信息
    tplt='{:4}\t{:8}\t{:16}\t'
    print(tplt.format('序号', '价格','商品名称'))
    count=0
    for g in ilt:
        count=count+1
        print(tplt.format(count,g[0],g[1]))
        
def main():
    goods='s书包'
    depth=2
    start_url='https://s.taobao.com/search?q='+goods
    infolist=[]
    for i in range(depth):
        try:
            url=start_url+'&pnum='+str(44*i)
            html=getHtmlText(url)
            parsePage(infolist,html)
        except:
            continue
    printGoodList(infolist)
    
main()#前面只是定义main()函数,这里是调用main函数,使整个程序运行

    

这里对Convert curl command syntax to Python requests, Ansible URI, browser fetch, MATLAB, Node.js, R, PHP, Strest, Go, Dart, Java, JSON, Elixir, and Rust code 做一下解释,我们从检查中获得的是curl语法,这个网站可以将 curl 语法转换为 Python、Node.js、PHP、R、Go、Rust、Elixir、Java、MATLAB、Ansible URI、Strest、Dart 和 JSON等格式

 

塞班学爬虫....学废了

评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值