一、爬取天猫店铺的相关信息

URL=“https://detail.tmall.com/item.htm?spm=a230r.1.14.8.4a1a115fb1rHn5&id=617806269122&cm_id=140105335569ed55e27b&abbucket=3&sku_properties=154362399:30930041”
【target】:
爬取:商品id、商品标题、商品主图地址、商品价格、店铺名称、掌柜名称、店铺地址。
代码如下:
import re
from lxml import etree
from parsel import Selector
import requests
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'
}
def get_goods_id_by_url(response):
res_html = etree.HTML(response)
goods_url = res_html.xpath('.//link[@rel="canonical"]/@href')[0].strip()
goods_id_re = re.compile(r'id=(.*)')
goods_id = re.findall(goods_id_re, goods_url)[0]
print('商品id是:'+goods_id) #商品id
def get_goods_title(sel):
sel_title = sel
goods_title = sel_title.xpath('//img[@id="J_ImgBooth"]/@alt').extract()[0]
print('商品标题是:'+goods_title) #商品标题
def get_goods_mainimages_adress(sel):
sel_image = sel
goods_mainimages_adress = sel_image.xpath('//img[@id="J_ImgBooth"]/@src').extract()[0]
print('商品主图地址是:'+goods_mainimages_adress) #商品主图地址
def get_goods_price(sel):
sel_price = sel
goods_price = sel_price.xpath('//dl[@id="J_PromoPrice"]/div[@class="tm-promo-price"]/span[@class="tm-price"]/text()').extract()
print('商品价格是:'+goods_price) #商品价格
def get_goods_shopname(sel):
sel_shopname = sel
goods_shopname = sel_shopname.xpath('//div[@id="shopExtra"]/div[@class="slogo"]/a/strong/text()').extract()[0]
print('店铺地址是:'+goods_shopname) #店铺名称
def get_goods_shopkeeper(sel):
sel_shopkeeper = sel
goods_shopkeeper = sel_shopkeeper.xpath('//div[@class="extend"]/ul/li[@class="shopkeeper"]/div[@class="right"]/a/text()').extract()[0]
print('掌柜名称是:'+goods_shopkeeper) #掌柜名称
def get_goods_shopadress(sel):
sel_shopadress = sel
goods_shopadress = sel_shopadress.xpath('//div[@class="extend"]/ul/li[@class="locus"]/div[@class="right"]/text()').extract()[0].strip()
print('店铺地址是:'+goods_shopadress) #店铺地址
url = "https://detail.tmall.com/item.htm?spm=a230r.1.14.8.4a1a115fb1rHn5&id=617806269122&cm_id=140105335569ed55e27b&abbucket=3&sku_properties=154362399:30930041"
response = requests.get(url, headers=headers).text
sel = Selector(text=response)
get_goods_id_by_url(response)
get_goods_title(sel)
get_goods_mainimages_adress(sel)
#get_goods_price(sel) #天猫对商品价格加密了,简单地爬虫无法爬取
get_goods_shopname(sel)
get_goods_shopkeeper(sel)
get_goods_shopadress(sel)
打印结果:
商品id是:617806269122
商品标题是:得力83650儿童智能闹钟语音控制学生用多功能床头语音提醒器卡通
商品主图地址是://img.alicdn.com/imgextra/i4/407910984/O1CN01NZmiXe1J8iKRcNCIC_!!407910984.jpg_430x430q90.jpg
店铺地址是:得力官方旗舰店
掌柜名称是:得力官方旗舰店
店铺地址是:浙江, 宁波
本文介绍了一种爬取天猫店铺商品详情的方法,包括商品ID、标题、图片、价格、店铺名、店主名及店铺位置等关键信息。通过Python的requests、lxml和parsel库实现网页数据抓取。

1917

被折叠的 条评论
为什么被折叠?



