如何编写爬虫获取淘宝网上所有的商品分类以及关键属性 销售属性 非关键属性数据

淘宝定义了限速规则,爬取淘宝网站上的数据时,为了防止淘宝的数据接口返回以下消息

u'\r\nvar propvalues={"error_response":{"code":7,"msg":"App Call Limited","sub_code":"accesscontrol.limited-by-dynamic-access-count","sub_msg":"This ban will last for 1 more seconds","request_id":"2elclr9dnika"}}'

通常我们会在爬虫res.text代码行之后加入sleep(15),让请求频率放慢,不过这样爬数据太慢啦,

限速对于新的链接地址(cid值变啦)淘宝网的限速规则不会立即触发生效,我经过试验后

索性不加sleep函数,但我会记录请求一旦返回App Call Limited消息时 cid 值等于多少,

我们可以手动来提交链接地址来获得那些丢失的没有被爬虫成功爬下来的数据。

在本范例中本来我想一次性爬完所有的数据,把顶级分类数据category-top.csv文件的内容定义如下:

0|{"itemcats_get_response":{"item_cats":{"item_cat":[
{"cid":16,"is_parent":true,"name":"女装/女士精品","parent_cid":0,"status":"normal"},
{"cid":120886001,"is_parent":true,"name":"公益","parent_cid":0,"status":"normal"},
{"cid":98,"is_parent":true,"name":"包装","parent_cid":0,"status":"normal"},
{"cid":120950002,"is_parent":true,"name":"天猫点券","parent_cid":0,"status":"normal"},
{"cid":50802001,"is_parent":true,"name":"数字阅读","parent_cid":0,"status":"normal"},
{"cid":120894001,"is_parent":true,"name":"淘女郎","parent_cid":0,"status":"normal"},
{"cid":50023722,"is_parent":true,"name":"隐形眼镜/护理液","parent_cid":0,"status":"normal"},
{"cid":50026555,"is_parent":true,"name":"购物提货券","parent_cid":0,"status":"normal"},
{"cid":50026523,"is_parent":true,"name":"休闲娱乐","parent_cid":0,"status":"normal"},
{"cid":50008075,"is_parent":true,"name":"餐饮美食卡券","parent_cid":0,"status":"normal"},
{"cid":50019095,"is_parent":true,"name":"消费卡","parent_cid":0,"status":"normal"},
{"cid":50014927,"is_parent":true,"name":"教育培训","parent_cid":0,"status":"normal"},
{"cid":26,"is_parent":true,"name":"汽车/用品/配件/改装","parent_cid":0,"status":"normal"},
{"cid":50020808,"is_parent":true,"name":"家居饰品","parent_cid":0,"status":"normal"},
{"cid":50020857,"is_parent":true,"name":"特色手工艺","parent_cid":0,"status":"normal"},
{"cid":50025707,"is_parent":true,"name":"度假线路/签证送关/旅游服务","parent_cid":0,"status":"normal"},
{"cid":50024099,"is_parent":true,"name":"电子元器件市场","parent_cid":0,"status":"normal"},
{"cid":30,"is_parent":true,"name":"男装","parent_cid":0,"status":"normal"},
{"cid":50008164,"is_parent":true,"name":"住宅家具","parent_cid":0,"status":"normal"},
{"cid":50020611,"is_parent":true,"name":"商业/办公家具","parent_cid":0,"status":"normal"},
{"cid":50010788,"is_parent":true,"name":"彩妆/香水/美妆工具","parent_cid":0,"status":"normal"},
{"cid":1801,"is_parent":true,"name":"美容护肤/美体/精油","parent_cid":0,"status":"normal"},
{"cid":50023282,"is_parent":true,"name":"美发护发/假发","parent_cid":0,"status":"normal"},
{"cid":1512,"is_parent":false,"name":"手机","parent_cid":0,"status":"normal"},
{"cid":14,"is_parent":true,"name":"数码相机/单反相机/摄像机","parent_cid":0,"status":"normal"},
{"cid":1201,"is_parent":false,"name":"MP3/MP4/iPod/录音笔","parent_cid":0,"status":"normal"},
{"cid":1101,"is_parent":false,"name":"笔记本电脑","parent_cid":0,"status":"normal"},
{"cid":50019780,"is_parent":false,"name":"平板电脑/MID","parent_cid":0,"status":"normal"},
{"cid":50018222,"is_parent":true,"name":"DIY电脑","parent_cid":0,"status":"normal"},
{"cid":11,"is_parent":true,"name":"电脑硬件/显示器/电脑周边","parent_cid":0,"status":"normal"},
{"cid":50018264,"is_parent":true,"name":"网络设备/网络相关","parent_cid":0,"status":"normal"},
{"cid":50008090,"is_parent":true,"name":"3C数码配件","parent_cid":0,"status":"normal"},
{"cid":50012164,"is_parent":true,"name":"闪存卡/U盘/存储/移动硬盘","parent_cid":0,"status":"normal"},
{"cid":50007218,"is_parent":true,"name":"办公设备/耗材/相关服务","parent_cid":0,"status":"normal"},
{"cid":50018004,"is_parent":true,"name":"电子词典/电纸书/文化用品","parent_cid":0,"status":"normal"},
{"cid":20,"is_parent":true,"name":"电玩/配件/游戏/攻略","parent_cid":0,"status":"normal"},
{"cid":50022703,"is_parent":true,"name":"大家电","parent_cid":0,"status":"normal"},
{"cid":50011972,"is_parent":true,"name":"影音电器","parent_cid":0,"status":"normal"},
{"cid":50012100,"is_parent":true,"name":"生活电器","parent_cid":0,"status":"normal"},
{"cid":50012082,"is_parent":true,"name":"厨房电器","parent_cid":0,"status":"normal"},
{"cid":50002768,"is_parent":true,"name":"个人护理/保健/按摩器材","parent_cid":0,"status":"normal"},
{"cid":27,"is_parent":true,"name":"家装主材","parent_cid":0,"status":"normal"},
{"cid":124912001,"is_parent":false,"name":"合约机","parent_cid":0,"status":"normal"},
{"cid":50020332,"is_parent":true,"name":"基础建材","parent_cid":0,"status":"normal"},
{"cid":50020485,"is_parent":true,"name":"五金/工具","parent_cid":0,"status":"normal"},
{"cid":50026535,"is_parent":true,"name":"医疗及健康服务","parent_cid":0,"status":"normal"},
{"cid":50020579,"is_parent":true,"name":"电子/电工","parent_cid":0,"status":"normal"},
{"cid":50050471,"is_parent":true,"name":"婚庆/摄影/摄像服务","parent_cid":0,"status":"normal"},
{"cid":50011949,"is_parent":true,"name":"特价酒店/特色客栈/公寓旅馆","parent_cid":0,"status":"normal"},
{"cid":21,"is_parent":true,"name":"居家日用","parent_cid":0,"status":"normal"},
{"cid":50016349,"is_parent":true,"name":"厨房/烹饪用具","parent_cid":0,"status":"normal"},
{"cid":50016348,"is_parent":true,"name":"家庭/个人清洁工具","parent_cid":0,"status":"normal"},
{"cid":50008163,"is_parent":true,"name":"床上用品","parent_cid":0,"status":"normal"},
{"cid":35,"is_parent":true,"name":"奶粉/辅食/营养品/零食","parent_cid":0,"status":"normal"},
{"cid":50014812,"is_parent":true,"name":"尿片/洗护/喂哺/推车床","parent_cid":0,"status":"normal"},
{"cid":50022517,"is_parent":true,"name":"孕妇装/孕产妇用品/营养","parent_cid":0,"status":"normal"},
{"cid":50008165,"is_parent":true,"name":"童装/婴儿装/亲子装","parent_cid":0,"status":"normal"},
{"cid":50020275,"is_parent":true,"name":"传统滋补营养品","parent_cid":0,"status":"normal"},
{"cid":50002766,"is_parent":true,"name":"零食/坚果/特产","parent_cid":0,"status":"normal"},
{"cid":50016422,"is_parent":true,"name":"粮油米面/南北干货/调味品","parent_cid":0,"status":"normal"},
{"cid":121380001,"is_parent":true,"name":"国内机票/国际机票/增值服务","parent_cid":0,"status":"normal"},
{"cid":121536003,"is_parent":true,"name":"数字娱乐","parent_cid":0,"status":"normal"},
{"cid":121536007,"is_parent":true,"name":"全球购代购市场","parent_cid":0,"status":"normal"},
{"cid":40,"is_parent":true,"name":"腾讯QQ专区","parent_cid":0,"status":"normal"},
{"cid":50010728,"is_parent":true,"name":"运动/瑜伽/健身/球迷用品","parent_cid":0,"status":"normal"},
{"cid":50013886,"is_parent":true,"name":"户外/登山/野营/旅行用品","parent_cid":0,"status":"normal"},
{"cid":50011699,"is_parent":true,"name":"运动服/休闲服装","parent_cid":0,"status":"normal"},
{"cid":25,"is_parent":true,"name":"玩具/童车/益智/积木/模型","parent_cid":0,"status":"normal"},
{"cid":50011665,"is_parent":true,"name":"网游装备/游戏币/帐号/代练","parent_cid":0,"status":"normal"},
{"cid":50008907,"is_parent":true,"name":"手机号码/套餐/增值业务","parent_cid":0,"status":"normal"},
{"cid":99,"is_parent":true,"name":"网络游戏点卡","parent_cid":0,"status":"normal"},
{"cid":23,"is_parent":true,"name":"古董/邮币/字画/收藏","parent_cid":0,"status":"normal"},
{"cid":50007216,"is_parent":true,"name":"鲜花速递/花卉仿真/绿植园艺","parent_cid":0,"status":"normal"},
{"cid":50004958,"is_parent":true,"name":"移动/联通/电信充值中心","parent_cid":0,"status":"normal"},
{"cid":50011740,"is_parent":true,"name":"流行男鞋","parent_cid":0,"status":"normal"},
{"cid":50006843,"is_parent":true,"name":"女鞋","parent_cid":0,"status":"normal"},
{"cid":50006842,"is_parent":true,"name":"箱包皮具/热销女包/男包","parent_cid":0,"status":"normal"},
{"cid":1625,"is_parent":true,"name":"女士内衣/男士内衣/家居服","parent_cid":0,"status":"normal"},
{"cid":50010404,"is_parent":true,"name":"服饰配件/皮带/帽子/围巾","parent_cid":0,"status":"normal"},
{"cid":50011397,"is_parent":true,"name":"珠宝/钻石/翡翠/黄金","parent_cid":0,"status":"normal"},
{"cid":28,"is_parent":true,"name":"ZIPPO/瑞士军刀/眼镜","parent_cid":0,"status":"normal"},
{"cid":33,"is_parent":true,"name":"书/杂志/报纸","parent_cid":0,"status":"normal"},
{"cid":34,"is_parent":true,"name":"音乐/影视/明星/音像","parent_cid":0,"status":"normal"},
{"cid":50017300,"is_parent":true,"name":"乐器/吉他/钢琴/配件","parent_cid":0,"status":"normal"},
{"cid":29,"is_parent":true,"name":"宠物/宠物食品及用品","parent_cid":0,"status":"normal"},
{"cid":2813,"is_parent":true,"name":"成人用品/情趣用品","parent_cid":0,"status":"normal"},
{"cid":50012029,"is_parent":true,"name":"运动鞋new","parent_cid":0,"status":"normal"},
{"cid":50013864,"is_parent":true,"name":"饰品/流行首饰/时尚饰品新","parent_cid":0,"status":"normal"},
{"cid":50014811,"is_parent":true,"name":"网店/网络服务/软件","parent_cid":0,"status":"normal"},
{"cid":50023724,"is_parent":true,"name":"其他","parent_cid":0,"status":"normal"},
{"cid":50017652,"is_parent":true,"name":"TP服务商大类","parent_cid":0,"status":"normal"},
{"cid":50023575,"is_parent":true,"name":"房产/租房/新房/二手房/委托服务","parent_cid":0,"status":"normal"},
{"cid":50023717,"is_parent":true,"name":"OTC药品/医疗器械/计生用品","parent_cid":0,"status":"normal"},
{"cid":50023878,"is_parent":true,"name":"自用闲置转让","parent_cid":0,"status":"normal"},
{"cid":50024186,"is_parent":true,"name":"保险","parent_cid":0,"status":"normal"},
{"cid":50024612,"is_parent":true,"name":"阿里健康送药服务","parent_cid":0,"status":"normal"},
{"cid":50024971,"is_parent":true,"name":"新车/二手车","parent_cid":0,"status":"normal"},
{"cid":50025004,"is_parent":true,"name":"个性定制/设计服务/DIY","parent_cid":0,"status":"normal"},
{"cid":50025110,"is_parent":true,"name":"电影/演出/体育赛事","parent_cid":0,"status":"normal"},
{"cid":50025618,"is_parent":true,"name":"理财","parent_cid":0,"status":"normal"},
{"cid":50025705,"is_parent":true,"name":"洗护清洁剂/卫生巾/纸/香薰","parent_cid":0,"status":"normal"},
{"cid":50025968,"is_parent":true,"name":"司法拍卖拍品专用","parent_cid":0,"status":"normal"},
{"cid":50026316,"is_parent":true,"name":"咖啡/麦片/冲饮","parent_cid":0,"status":"normal"},
{"cid":50023804,"is_parent":true,"name":"装修设计/施工/监理","parent_cid":0,"status":"normal"},
{"cid":50026800,"is_parent":true,"name":"保健食品/膳食营养补充食品","parent_cid":0,"status":"normal"},
{"cid":50050359,"is_parent":true,"name":"水产肉类/新鲜蔬果/熟食","parent_cid":0,"status":"normal"},
{"cid":50074001,"is_parent":true,"name":"摩托车/装备/配件","parent_cid":0,"status":"normal"},
{"cid":50158001,"is_parent":true,"name":"网络店铺代金/优惠券","parent_cid":0,"status":"normal"},
{"cid":50230002,"is_parent":true,"name":"服务商品","parent_cid":0,"status":"normal"},
{"cid":50454031,"is_parent":true,"name":"景点门票/演艺演出/周边游","parent_cid":0,"status":"normal"},
{"cid":50468001,"is_parent":true,"name":"手表","parent_cid":0,"status":"normal"},
{"cid":50510002,"is_parent":true,"name":"运动包/户外包/配件","parent_cid":0,"status":"normal"},
{"cid":50008141,"is_parent":true,"name":"酒类","parent_cid":0,"status":"normal"},
{"cid":50734010,"is_parent":true,"name":"资产","parent_cid":0,"status":"normal"},
{"cid":50025111,"is_parent":true,"name":"本地化生活服务","parent_cid":0,"status":"normal"},
{"cid":121938001,"is_parent":false,"name":"淘点点预定点菜","parent_cid":0,"status":"normal"},
{"cid":121940001,"is_parent":false,"name":"淘点点现金券","parent_cid":0,"status":"normal"},
{"cid":122650005,"is_parent":true,"name":"童鞋/婴儿鞋/亲子鞋","parent_cid":0,"status":"normal"},
{"cid":122684003,"is_parent":true,"name":"自行车/骑行装备/零配件","parent_cid":0,"status":"normal"},
{"cid":122718004,"is_parent":true,"name":"家庭保健","parent_cid":0,"status":"normal"},
{"cid":122852001,"is_parent":true,"name":"居家布艺","parent_cid":0,"status":"normal"},
{"cid":122950001,"is_parent":true,"name":"节庆用品/礼品","parent_cid":0,"status":"normal"},
{"cid":122952001,"is_parent":true,"name":"餐饮具","parent_cid":0,"status":"normal"},
{"cid":122928002,"is_parent":true,"name":"收纳整理","parent_cid":0,"status":"normal"},
{"cid":122966004,"is_parent":true,"name":"处方药","parent_cid":0,"status":"normal"},
{"cid":123536002,"is_parent":true,"name":"阿里通信专属类目","parent_cid":0,"status":"normal"},
{"cid":123500005,"is_parent":true,"name":"资产(政府类专用)","parent_cid":0,"status":"normal"},
{"cid":123690003,"is_parent":true,"name":"精制中药材","parent_cid":0,"status":"normal"},
{"cid":124024001,"is_parent":true,"name":"农业生产资料(农村淘宝专用)","parent_cid":0,"status":"normal"},
{"cid":124044001,"is_parent":true,"name":"品牌台机/品牌一体机/服务器","parent_cid":0,"status":"normal"},
{"cid":124050001,"is_parent":true,"name":"全屋定制","parent_cid":0,"status":"normal"},
{"cid":124242008,"is_parent":true,"name":"智能设备","parent_cid":0,"status":"normal"},
{"cid":124354002,"is_parent":true,"name":"电动车/配件/交通工具","parent_cid":0,"status":"normal"},
{"cid":124466001,"is_parent":true,"name":"农用物资","parent_cid":0,"status":"normal"},
{"cid":124468001,"is_parent":true,"name":"农机/农具/农膜","parent_cid":0,"status":"normal"},
{"cid":124470001,"is_parent":true,"name":"畜牧/养殖物资","parent_cid":0,"status":"normal"},
{"cid":124470006,"is_parent":true,"name":"整车(经销商)","parent_cid":0,"status":"normal"},
{"cid":124484008,"is_parent":true,"name":"模玩/动漫/周边/cos/桌游","parent_cid":0,"status":"normal"},
{"cid":124458005,"is_parent":true,"name":"茶","parent_cid":0,"status":"normal"},
{"cid":124568010,"is_parent":true,"name":"室内设计师","parent_cid":0,"status":"normal"},
{"cid":124750013,"is_parent":true,"name":"俪人购(俪人购专用)","parent_cid":0,"status":"normal"},
{"cid":124698018,"is_parent":true,"name":"装修服务","parent_cid":0,"status":"normal"},
{"cid":124844002,"is_parent":true,"name":"拍卖会专用","parent_cid":0,"status":"normal"},
{"cid":124868003,"is_parent":true,"name":"盒马","parent_cid":0,"status":"normal"},
{"cid":124852003,"is_parent":true,"name":"二手数码","parent_cid":0,"status":"normal"},
{"cid":125102006,"is_parent":true,"name":"到家业务","parent_cid":0,"status":"normal"},
{"cid":125406001,"is_parent":true,"name":"享淘卡","parent_cid":0,"status":"normal"},
{"cid":126040001,"is_parent":true,"name":"橙运","parent_cid":0,"status":"normal"},
{"cid":126252002,"is_parent":true,"name":"门店O2O","parent_cid":0,"status":"normal"},
{"cid":126488005,"is_parent":true,"name":"天猫零售O2O","parent_cid":0,"status":"normal"},
{"cid":126488008,"is_parent":true,"name":"阿里健康B2B平台","parent_cid":0,"status":"normal"},
{"cid":126602002,"is_parent":true,"name":"生活娱乐充值","parent_cid":0,"status":"normal"},
{"cid":126700003,"is_parent":true,"name":"家装灯饰光源","parent_cid":0,"status":"normal"},
{"cid":126762001,"is_parent":true,"name":"美容美体仪器","parent_cid":0,"status":"normal"},
{"cid":127076003,"is_parent":true,"name":"平台充值活动(仅内部店铺)","parent_cid":0,"status":"normal"},
{"cid":127492006,"is_parent":true,"name":"标准件/零部件/工业耗材","parent_cid":0,"status":"normal"},
{"cid":127484003,"is_parent":true,"name":"润滑/胶粘/试剂/实验室耗材","parent_cid":0,"status":"normal"},
{"cid":127508003,"is_parent":true,"name":"机械设备","parent_cid":0,"status":"normal"},
{"cid":127458007,"is_parent":true,"name":"搬运/仓储/物流设备","parent_cid":0,"status":"normal"},
{"cid":127442006,"is_parent":true,"name":"纺织面料/辅料/配套","parent_cid":0,"status":"normal"},
{"cid":127450004,"is_parent":true,"name":"金属材料及制品","parent_cid":0,"status":"normal"},
{"cid":127452002,"is_parent":true,"name":"橡塑材料及制品","parent_cid":0,"status":"normal"},
{"cid":127588002,"is_parent":true,"name":"阿里云云市场","parent_cid":0,"status":"normal"},
{"cid":127878006,"is_parent":true,"name":"新制造","parent_cid":0,"status":"normal"},
{"cid":127924022,"is_parent":true,"name":"零售通","parent_cid":0,"status":"normal"}
]},"request_id":"s82mq3r0hshh"}}|0

如果像上面那样定义顶级分类数据category-top.csv文件内容的话,爬虫代码的运行时间会很长,可能代码运行过程中会返回很多限速消息,返回数据不完整解析数据格式时出现的致命错误,或者其它网络错误,这样爬取数据我们自己不好控制,所以就不要那么贪心啦,我们让上面的165条顶级分类数据分成165次分别爬取,也就是说,我们需要像下面这样重新定义category-top.csv文件的内容,注意文件格式要定义成3行

[myth@contoso ~]$ cat /home/myth/taobao/category-top.csv
0|{"itemcats_get_response":{"item_cats":{"item_cat":[
{"cid":16,"is_parent":true,"name":"女装/女士精品","parent_cid":0,"status":"normal"},
]},"request_id":"s82mq3r0hshh"}}|0
[myth@contoso ~]$

如果你调试爬虫代码,你会发现顶级分类 ------ 女装/女士精品会按照以下递归次序从淘宝服务器上获得以下数据:

顶级分类:
{"cid":16,"is_parent":true,"name":"女装/女士精品","parent_cid":0,"status":"normal"}

女装/女士精品:
[{u'status': u'normal', u'parent_cid': 16, u'name': u'连衣裙', u'is_parent': False, u'cid': 50010850}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'T恤', u'is_parent': False, u'cid': 50000671}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'衬衫', u'is_parent': False, u'cid': 162104}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'裤子', u'is_parent': True, u'cid': 1622}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'牛仔裤', u'is_parent': False, u'cid': 162205}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'半身裙', u'is_parent': False, u'cid': 1623}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'马夹', u'is_parent': False, u'cid': 50013196}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'蕾丝衫/雪纺衫', u'is_parent': False, u'cid': 162116}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'毛针织衫', u'is_parent': False, u'cid': 50000697}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'短外套', u'is_parent': False, u'cid': 50011277}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'西装', u'is_parent': False, u'cid': 50008897}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'卫衣/绒衫', u'is_parent': False, u'cid': 50008898}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'毛衣', u'is_parent': False, u'cid': 162103}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'风衣', u'is_parent': False, u'cid': 50008901}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'毛呢外套', u'is_parent': False, u'cid': 50013194}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'棉衣/棉服', u'is_parent': False, u'cid': 50008900}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'羽绒服', u'is_parent': False, u'cid': 50008899}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'皮衣', u'is_parent': False, u'cid': 50008904}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'皮草', u'is_parent': False, u'cid': 50008905}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'中老年女装', u'is_parent': False, u'cid': 50000852}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'大码女装', u'is_parent': False, u'cid': 1629}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'套装/学生校服/工作制服', u'is_parent': True, u'cid': 1624}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'婚纱/旗袍/礼服', u'is_parent': True, u'cid': 50011404}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'唐装/民族服装/舞台服装', u'is_parent': True, u'cid': 50008906}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'背心吊带', u'is_parent': False, u'cid': 121412004}, 
{u'status': u'normal', u'parent_cid': 16, u'name': u'抹胸', u'is_parent': False, u'cid': 121434004}]

裤子:
[{u'status': u'normal', u'parent_cid': 1622, u'name': u'休闲裤', u'is_parent': False, u'cid': 162201}, 
{u'status': u'normal', u'parent_cid': 1622, u'name': u'西装裤/正装裤', u'is_parent': False, u'cid': 50022566}, 
{u'status': u'normal', u'parent_cid': 1622, u'name': u'打底裤', u'is_parent': False, u'cid': 50007068}, 
{u'status': u'normal', u'parent_cid': 1622, u'name': u'棉裤/羽绒裤', u'is_parent': False, u'cid': 50026651}]

套装/学生校服/工作制服:
[{u'status': u'normal', u'parent_cid': 1624, u'name': u'学生校服', u'is_parent': False, u'cid': 50008903}, 
{u'status': u'normal', u'parent_cid': 1624, u'name': u'职业女裙套装', u'is_parent': False, u'cid': 162401}, 
{u'status': u'normal', u'parent_cid': 1624, u'name': u'职业女裤套装', u'is_parent': False, u'cid': 162402}, 
{u'status': u'normal', u'parent_cid': 1624, u'name': u'休闲运动套装', u'is_parent': False, u'cid': 162404}, 
{u'status': u'normal', u'parent_cid': 1624, u'name': u'其它制服/套装', u'is_parent': False, u'cid': 162403}, 
{u'status': u'normal', u'parent_cid': 1624, u'name': u'医护制服', u'is_parent': False, u'cid': 50011411}, 
{u'status': u'normal', u'parent_cid': 1624, u'name': u'酒店工作制服', u'is_parent': False, u'cid': 50011412}, 
{u'status': u'normal', u'parent_cid': 1624, u'name': u'时尚套装', u'is_parent': False, u'cid': 123216004}]

婚纱/旗袍/礼服:
[{"cid":162701,"is_parent":false,"name":"婚纱","parent_cid":50011404,"status":"normal"},
{"cid":50005065,"is_parent":false,"name":"旗袍","parent_cid":50011404,"status":"normal"},
{"cid":162702,"is_parent":false,"name":"礼服\/晚装","parent_cid":50011404,"status":"normal"}]

唐装/民族服装/舞台服装:
[{u'status': u'normal', u'parent_cid': 50008906, u'name': u'民族服装/舞台装', u'is_parent': False, u'cid': 162703},
{u'status': u'normal', u'parent_cid': 50008906, u'name': u'唐装/中式服装', u'is_parent': True, u'cid': 1636}]

唐装/中式服装:
[{u'status': u'normal', u'parent_cid': 1636, u'name': u'上衣', u'is_parent': False, u'cid': 50003509}, 
{u'status': u'normal', u'parent_cid': 1636, u'name': u'裤子', u'is_parent': False, u'cid': 50003510}, 
{u'status': u'normal', u'parent_cid': 1636, u'name': u'裙子', u'is_parent': False, u'cid': 50003511}]

python 爬虫的运行环境是Linux,当然你也可以在Windows环境下运行爬虫,python版本如下:

[myth@contoso ~]$ python --version
Python 2.7.5
[myth@contoso ~]$

爬虫代码如下:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import requests
import sys
import json

reload(sys)
sys.setdefaultencoding('utf-8')
session = requests.Session()

f = open('category-top.csv','r')
data = list()
for line in open('category-top.csv'):
    line = f.readline().strip(',\n')
    data.append(line)
cidStr = ''.join(data)
f.close()

def createCidSelect(cidStr):
    cidArr = cidStr.split("|")
    cid = cidArr[0]
    spanId = cidArr[2]
    if '' == cid:
        return False

    cidArr = json.loads(cidArr[1])['itemcats_get_response']
    cidArr = cidArr['item_cats']
    cidArr = cidArr['item_cat']
    count = len(cidArr)
    file = open('category-all.csv', 'a')
    list1 = list()
    for i in range(count):
        if cidArr[i]['status'] == 'normal':
            file.write('{0},{1},{2},{3},{4};\n'.format(cidArr[i]['status'],cidArr[i]['parent_cid'],cidArr[i]['name'],int(cidArr[i]['is_parent']),cidArr[i]['cid']))
            list1.append(cidArr[i])

    file.close()
    parentId = cid
    for item in list1:
        childCidList(item,parentId)

def  childCidList(item,parentId):
    cid = 0
    try:
        cid = item['cid']
        if item['is_parent'] == False:
            loadScript(cid)
            return
        url = 'http://open.taobao.com/apitools/ajax_props.do?_tb_token_=3365b5d353fed&cid='+ str(cid) +'&act=childCid&restBool=false&ua=090%23qCQXNTXpXOVXPvi0XXXXXQkOIr77HU0hzDlo3e5rAGB2zoPlhnG5%2ByiUIr7ejGmnfjLiXXfbC7NK%2BvQXaKZdRva2jrbsXmLiXXfbC7NK24QXrpehnTFfoVM3eeu8iGliXX5dtRJXExTEMiwtXvXQsVW8ZxDiXXF2mp%2F9vQjBXvXzbc9P9lqAxgLAq6anQgwoWawOSBLiXajeGXriHnepAFhnPIj3Ho39h9kvXP73IzgeG%2FXXHYVmV6hnD6u3HoPsH4QXaPjPiq2d7D7bPvQXiHDow1Qg%2FrliXXfMhTQ%2F%2BvQXaKZWvPXMjrY0VBViXi2oemXumVM3oMavtXFjQ7%2Ba2T%3D%3D'
        headers = {
            "Accept": "*/*",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "en-US,en;q=0.9",
            "Connection": "keep-alive",
            "Cookie": "t=bbb6c14edb1c8d0e65996158979a8027; cna=I0NFEysmiWkCAQ6cyYxll3kh; tg=0; lgc=mycarting; tracknick=mycarting; mt=np=; v=0; cookie2=3253fc64aa52d22785ce4a3f5af722d2; _tb_token_=3365b5d353fed; dnk=mycarting; JSESSIONID=B94DBA871F64C03C095830C238F218A9; uc1=cookie14=UoTeNzVRvRsWPg%3D%3D&lng=zh_CN&cookie16=W5iHLLyFPlMGbLDwA%2BdvAGZqLg%3D%3D&existShop=false&cookie21=VFC%2FuZ9ajCbF99I65Qm9gQ%3D%3D&tag=8&cookie15=Vq8l%2BKCLz3%2F65A%3D%3D&pas=0; uc3=nk2=DkmnuVZqM291&id2=UU8OcO9lI45Clg%3D%3D&vt3=F8dBzr2Fa6i4%2Fc9OIz8%3D&lg2=UtASsssmOIJ0bQ%3D%3D; existShop=MTUyOTIxODk0Nw%3D%3D; sg=g43; csg=c98ee665; cookie1=VTrg90saGzeX9ovigm2mqTr%2Fu6w0vPNLI3IgZ0vhu9E%3D; unb=2761447894; skt=ed11d5f665cdb5a8; _cc_=W5iHLLyFfA%3D%3D; _l_g_=Ug%3D%3D; _nk_=mycarting; cookie17=UU8OcO9lI45Clg%3D%3D; apushdf188ec636caeab174aad0f3441beb09=%7B%22ts%22%3A1529221047670%2C%22parentId%22%3A1529220867224%7D; isg=BP7-BDyyKY-LxH2pSg1zscaeTx2Al8izHjVx9qgGccE8S58lEM-zyUVtxx-H87rR",
            "Host": "open.taobao.com",
            "Referer": "http://open.taobao.com/apitools/apiPropTools.htm?spm=0.0.0.0.mlPbbQ",
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36",
        }
        res = session.get(url, headers=headers)
        if (res.text != '' and res.text.find('{"itemcats_get_response":{"item_cats":{"item_cat":') > 0):
            cidStr = str(cid) + '|'+ res.text + '|' + str(parentId)
            createCidSelect(cidStr)
        else:
            print 'childCidList: ' + res.text
    except Exception as err:
        print  "cid : " + str(cid)
        print  "parentId : " + str(parentId)
        print err

def  loadScript(cid):
    try:
        url = 'http://open.taobao.com/apitools/ajax_props.do?_tb_token_=3365b5d353fed&act=props&cid='+ str(cid) +'&restBool=false&ua=090%23qCQXc4XpXODXPXi0XXXXXQkOIr77HUR5flY73eg3AGB3fzQocPf5Aw1OIruEk0Rs24QXQczccXFFoVM3VUVTihPz9JPWqaLiXXB%2B0ydC24QXrpec2XdDoVM3VUpKijLiXXB%2B0ydC24QXrpec2vzsoVM3ebpQinDiXXF2mp%2F9vQjBXvXUM%2Ben9l8BvNoGriLiXajeGXrfHnepFehnPIj3Ho39h9kvXP73IzgeG%2FXXHYVmV6hnD6u3HoPsH4QXaOXTsEIXgwYSPvQXit2CqnY8PmLiXXB%2B0ydC3vQXiPR22amsXvXqzwE6XkFGOYnqq4QXius%2BSbQ%3D'
        headers = {
            "Accept": "*/*",
            "Accept-Encoding": "gzip, deflate",
            "Accept-Language": "en-US,en;q=0.9",
            "Connection": "keep-alive",
            "Cookie": "t=bbb6c14edb1c8d0e65996158979a8027; cna=I0NFEysmiWkCAQ6cyYxll3kh; tg=0; lgc=mycarting; tracknick=mycarting; mt=np=; v=0; cookie2=3253fc64aa52d22785ce4a3f5af722d2; _tb_token_=3365b5d353fed; dnk=mycarting; JSESSIONID=B94DBA871F64C03C095830C238F218A9; uc1=cookie14=UoTeNzVRvRsWPg%3D%3D&lng=zh_CN&cookie16=W5iHLLyFPlMGbLDwA%2BdvAGZqLg%3D%3D&existShop=false&cookie21=VFC%2FuZ9ajCbF99I65Qm9gQ%3D%3D&tag=8&cookie15=Vq8l%2BKCLz3%2F65A%3D%3D&pas=0; uc3=nk2=DkmnuVZqM291&id2=UU8OcO9lI45Clg%3D%3D&vt3=F8dBzr2Fa6i4%2Fc9OIz8%3D&lg2=UtASsssmOIJ0bQ%3D%3D; existShop=MTUyOTIxODk0Nw%3D%3D; sg=g43; csg=c98ee665; cookie1=VTrg90saGzeX9ovigm2mqTr%2Fu6w0vPNLI3IgZ0vhu9E%3D; unb=2761447894; skt=ed11d5f665cdb5a8; _cc_=W5iHLLyFfA%3D%3D; _l_g_=Ug%3D%3D; _nk_=mycarting; cookie17=UU8OcO9lI45Clg%3D%3D; isg=BJubr7m9RASWA7jy73KOygu5KvbF2KV447J0TY3ZbBqxbLlOFUA_wrmuAsRizAdq; apushdf188ec636caeab174aad0f3441beb09=%7B%22ts%22%3A1529221231139%2C%22parentId%22%3A1529220867224%7D",
            "Host": "open.taobao.com",
            "Referer": "http://open.taobao.com/apitools/apiPropTools.htm?spm=0.0.0.0.mlPbbQ",
            "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.119 Safari/537.36",
        }
        res = session.get(url, headers=headers)
        outArr = res.text.split(";")
        if len(outArr) == 3:
            if (outArr[0] != '' and outArr[0].find('var props={"itemprops_get_response":{"item_props":{"item_prop":') > 1):
                file1 = open('props.csv', 'a')
                file1.write('{0}\n'.format(outArr[0]))
                file1.close()
            else:
                print str(cid) + " : " + outArr[0]
            if (outArr[1] != '' and outArr[1].find('var propvalues={"itempropvalues_get_response":{"last_modified":') > 1):
                file2 = open('propvalues.csv', 'a')
                file2.write('{0}\n'.format(outArr[1]))
                file2.close()
            else:
                print str(cid) + " : " + outArr[1]
        else:
            print outArr
    except Exception as err:
        print cid
        print err


createCidSelect(cidStr)

如何运行爬虫呢?

我们首先需要登录 淘宝开放平台 http://open.taobao.com/apitools/apiPropTools.htm?spm=0.0.0.0.mlPbbQ





为了保证万无一失,我们需要再一次确认以下地址能够成功返回数据,必须在上面已经登录google浏览器上发送下面这个地址

Request URL:http://open.taobao.com/apitools/ajax_props.do?_tb_token_=3365b5d353fed&cid=16&act=childCid&restBool=false&ua=090%23qCQXU4XvXpXXPXi0XXXXXQkOIr7EkU9szQ4bI%2B5rAGB3fovZcnGnGDkIOrgyTU5nq4QXi6W21dwWXvXBV7Vhihc3oVMCx5QuYk3G4k9sXvXq2CCyOmlXKotK%2BvQXaBVRozUEXudBmmLiXXfbC7NK24QXrpecvTFfoVM3ecgeijLiXXfbC7NKH4QXaOXTsEO4%2FBDGPvQXit2CqnY8PCLiXajeGXriHYVCOFhnDXa3HoUmh9kvXP73IzgeG%2FXXHYVmV6hnDXa3Ho64wvQXib%2Fc2viqUjp%2FXvXuCVHkRwiP3vQXi3e7PUasXvXq2C9LOMVXKym324QXQW3c6vF6oVM37RMEihPz9JPRq4QXi6W21dw%3D



了保证万无一失,我们需要再一次确认以下地址能够成功返回数据,必须在上面已经登录google浏览器上发送下面这个地址

Request URL:http://open.taobao.com/apitools/ajax_props.do?_tb_token_=3365b5d353fed&act=props&cid=50010850&restBool=false&ua=090%23qCQXt4XOX6TXPXi0XXXXXQkOIr7EkU7GDQToIeg3AGBvDrxmhYZhGQ86OzpMHU0nq4QXiP%2B0gzfsXvXqtKoQXP7PI0%2Fk%2BvQXaKZWwTXkjr56WYDiXXF2mp%2F9vQjBXvXzFZEaXQDA2dbv5I106NXHuL%2BkykLiXajeGXriHYVCOFhnP353HoUmh9kvXP73IzgeG%2FXXHYVmV6hnDXa3Ho64H4QXa67Mf8dkHqJwPvQXit2CqnY8PmLiXXfUZ8wl3vQXiXXXXXfsXvXqtKy3XPZPPrBC24QXQczccXFzoVM3aTd4ihPz9JPWqX%3D%3D


上面的地址1次请求,淘宝的服务器会返回2条(也可以说是2行很长字符串数据,结尾是分号)javascript格式的数据,数据很大,这2行数据有7.5MB(是不是2行数据,你也可以把网页的源码数据粘贴到Notepad++中一目了然啦)

如果你想查看连衣裙分类数据地址如下:

https://pan.baidu.com/s/1geos-Z-tYcdSOCpXGxtMkw

启动爬虫代码:

[myth@contoso ~]$ cd /home/myth/taobao

[myth@contoso taobao]$ python taobao.py

实时输出已经爬下来的数据:

tail -f /home/myth/taobao/category-all.csv
tail -f /home/myth/taobao/props.csv

tail -f /home/myth/taobao/propvalues.csv



[myth@contoso ~]$ cd /home/myth/taobao
[myth@contoso taobao]$ ls
category-all.csv  category-top.csv  props.csv  propvalues.csv  taobao.py  venv
[myth@contoso taobao]$ ls -lht
total 94M
-rw-rw-r-- 1 myth myth  94M Jun 18 23:20 propvalues.csv
-rw-rw-r-- 1 myth myth 186K Jun 18 23:20 props.csv
-rw-rw-r-- 1 myth myth 1.7K Jun 18 23:18 category-all.csv
-rw-rw-r-- 1 myth myth 5.8K Jun 18 20:49 taobao.py
-rw-rw-r-- 1 myth myth  178 Jun 17 15:16 category-top.csv
drwxrwxr-x 5 myth myth   82 Jun 16 04:53 venv
[myth@contoso taobao]$

为了继续爬顶级分类数据,我们可能要把category-top.csv,props.csv还有propvalues.csv分别另存为

category-top1.csv,props1.csv和propvalues1.csv最后清空已经爬下来的全部数据cat /dev/null > /home/myth/taobao/category-all.csv && cat /dev/null > /home/myth/taobao/props.csv && cat /dev/null > /home/myth/taobao/propvalues.csv

我们可以手动把category-top.csv文件中定义的第1条顶级分类数据------"女装/女士精品"换成

{"cid":16,"is_parent":true,"name":"女装/女士精品","parent_cid":0,"status":"normal"},

如下这条顶级数据分类

{"cid":120886001,"is_parent":true,"name":"公益","parent_cid":0,"status":"normal"},

继续爬取第2条顶级分类数据 -------"公益",依次类推,这样我们就可以爬完整个淘宝网站上的

商品分类数据,关键属性数据,销售属性数据,还有非关键性属性数据。


关于props.csv和propvalues.csv文件输出的格式为何不用category-all.csv一样的格式输出,

那是因为我们希望直接在浏览器手动提交链接返回数据格式能与爬虫返回的数据格式一致

方便我们补充完丢失的数据后一起提交给数据格式化工具来进一步做数据格式输出(自己编写代码实现)

比如,你可拼接数据成为SQL 插入语句导入到数据库,当然你也可以拼接成Redis的数据导入格式


启动开发工具

/home/myth/pycharm-community-2018.1.4/bin/pycharm.sh

清空已经爬下来的全部数据

cat /dev/null > /home/myth/taobao/category-all.csv && cat /dev/null > /home/myth/taobao/props.csv && cat /dev/null > /home/myth/taobao/propvalues.csv

查看被爬取的顶级分类数据
cat /home/myth/taobao/category-top.csv


评论 8
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值