法国亚马逊商品采集Python爬虫

看着身边做亚马逊铺货的朋友,花大时间收集商品信息,学着写个脚本帮忙解决下问题。他们日常主要是抓取商品价格,商品图片,商品介绍等。

商品图片应该是最难获取的到的。可以在js里可以获取到完整的商品大图

这个文章主要参考二爷记博客的文章:https://blog.csdn.net/minge89/article/details/106417047/

1、商品标题的获取

 

其实直接取title应该更简单,我这里是取得页面内容的标题。

 

亚马逊商品页面html标题代码:<title>Echo Dot (3ème génération), Enceinte connectée avec Alexa, Tissu anthracite: Amazon.fr</title>

商品标题的获取:req.xpath('//h1[@id="title"]/span[@id="productTitle"]/text()')
             

2、商品属性的获取

 

<ul class="a-unordered-list a-nostyle a-button-list a-vertical a-spacing-top-micro">

<li class="a-spacing-small videoCountTemplate aok-hidden"><span class="a-list-item">
<span id="videoCount_template" class="a-size-mini a-color-secondary video-count a-text-bold a-nowrap"> <hza:string id=""></hza:string></span>
</span></li>
<li class="a-spacing-small 360IngressTemplate pos-360 aok-hidden"><span class="a-list-item">
<span class="a-declarative" data-action="thumb-action" data-thumb-action="{}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-3"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-3-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-3-announce">
<img alt="" src="https://images-na.ssl-images-amazon.com/images/G/08/HomeCustomProduct/360_icon_73x73v2._CB485971279_SS40_FMpng_RI_.png">
</span></span></span>
</span>
</span></li>

<li class="a-spacing-small template"><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-4"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-4-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-4-announce">
<span class="placeHolder"></span>
</span></span></span>
</span></li>
<li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle a-button-selected a-button-focus" id="a-autoid-5"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-5-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-5-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51sWJTvgBfL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-6"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-6-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-6-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41hX%2B2Es%2BvL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-7"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-7-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-7-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51I5TLQy-JL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-8"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-8-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-8-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51b2EY6IdsL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-9"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-9-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-9-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41F9DlWvsrL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-10"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-10-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-10-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/51C-rk6qlOL._AC_US40_.jpg">
</span></span></span>
</span></li><li class="a-spacing-small item imageThumbnail a-declarative" data-ux-click=""><span class="a-list-item">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-11"><span class="a-button-inner"><input class="a-button-input" type="submit" aria-labelledby="a-autoid-11-announce"><span class="a-button-text" aria-hidden="true" id="a-autoid-11-announce">
<img src="https://images-na.ssl-images-amazon.com/images/I/41PZZf1xU6L._AC_US40_.jpg">
</span></span></span>
</span></li></ul>

 

先把所有轮播图的列表属性给提取出来,class=样式内容会根据商品品类不同会有变化:

req.xpath('//ul[@class="a-unordered-list a-nostyle a-button-list a-vertical a-spacing-top-micro"]/li')
             

商品颜色属性的获取

<ul class="a-unordered-list a-nostyle a-button-list a-declarative a-button-toggle-group a-horizontal a-spacing-top-micro swatches swatchesSquare imageSwatches" role="radiogroup" data-action="a-button-group" data-a-button-group="{&quot;name&quot;:&quot;twister_color_name&quot;}">

<li id="color_name_0" title="Cliquez pour sélectionner Tissu anthracite" data-defaultasin="B07PHPXHQS" data-dp-url="" class="swatchAvailable"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:0}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-12" aria-checked="false"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-12-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61sD09wyFML._SS36_.jpg" alt="Tissu anthracite" style="height:36px; width:36px" class="imgSwatch">
</div>

<div class=" " style="">

</div>

</div>


</button></span></span>
</span>
</div>

</span></li>

<li id="color_name_1" title="Cliquez pour sélectionner Tissu prune" data-defaultasin="B07WLTKTXY" data-dp-url="/dp/B07WLTKTXY/ref=twister_B07H61CQCM?_encoding=UTF8&amp;psc=1" class="swatchSelect"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:1}">
<span class="a-button a-button-thumbnail a-button-toggle a-button-selected" id="a-autoid-13" aria-checked="true"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-13-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61mROAfn-NL._SS36_.jpg" alt="Tissu prune" style="height:36px; width:36px" class="imgSwatch">
</div>

<div class=" " style="">
</div>
</div>
</button></span></span>
</span>
</div>
</span></li>

<li id="color_name_2" title="Cliquez pour sélectionner Tissu sable" data-defaultasin="B07PDHSPXT" data-dp-url="/dp/B07PDHSPXT/ref=twister_B07H61CQCM?_encoding=UTF8&amp;psc=1" class="swatchAvailable"><span class="a-list-item">
<div class="tooltip">
<span class="a-declarative" data-action="swatchthumb-action" data-swatchthumb-action="{&quot;dimIndex&quot;:1,&quot;dimValueIndex&quot;:2}">
<span class="a-button a-button-thumbnail a-button-toggle" id="a-autoid-14" aria-checked="false"><span class="a-button-inner"><button class="a-button-text" type="button" id="a-autoid-14-announce">

<span class="xoverlay"></span>
<div class="">
<div class="">
<img src="https://m.media-amazon.com/images/I/61FlVonHYyL._SS36_.jpg" alt="Tissu sable" style="height:36px; width:36px" class="imgSwatch">
</div>
<div class=" " style="">
</div>
</div>
</button></span></span>
</span>
</div>
</span></li>

</ul>

 

进行了简单的格式化处理

productColors=req.xpath('//li[@id="color_name_"]//text()')
productColor=''.join(Colors)


商品图片的的获取

主要是找到图片链接费了不少力气,写入到js中了,没办法,只能用正则获取到图片链接。

imgs_text=re.findall(r'ImageBlockATF(.+?)return data;',html,re.S)[0]
imgs=re.findall(r'"large":"(.+?)","main":',imgs_text,re.S)
             

图片有轮播图图片和鼠标划过的大图片

产品详情页面的图片

 

一个页面大概有3万多行代码,要挖掘出自己需要的数据,需要慢慢分析,最麻烦的应该是图片数据了。

 

附源码,仅供参考,学习,交流:

#法国亚马逊商品采集
#20200524 by 微信:huguo00289
#https://www.amazon.fr/dp/B07CNJTCBB/ref=twister_B07RVPW2GT?_encoding=UTF8&th=1
 
 

# -*- coding=utf-8 -*-
import requests
from fake_useragent import UserAgent
import re,os,time,random
from lxml import etree
def ua()
     ua=UserAgent();
    headers={"User-Agent":ua.random}
    return headers

def get_data(url):
    id=re.findall(r'dp/(.+?)/',url,re.S)[0]
    print(f'>>>您输入的商品链接id为:{id},正在采集,请稍后..')
    response=requests.get(url,headers=ua(),timeout=8)
    time.sleep(2)
    if response.status_code == 200:
         print(">>>恭喜,获取网页数据成功!")
         html=response.content.decode('utf-8')
with open(f'{id}.html','w',encoding='utf-8') as f:
f.write(html)
req=etree.HTML(html)
h1=req.xpath('//h1[@id="title"]/span[@id="productTitle"]/text()')
print(h1)
h1=h1[0].strip()
print(f'商品标题:{h1}')
productDescriptions=req.xpath('//div[@id="productDescription"]//text()')
productDescription=''.join(productDescriptions)
print(f'商品描述:{productDescription}')
imgs_text=re.findall(r'ImageBlockATF(.+?)return data;',html,re.S)[0]
imgs=re.findall(r'"large":"(.+?)","main":',imgs_text,re.S)
print(imgs)
text=f'商品标题:{h1}\n商品描述:{productDescription}\n商品图片{imgs}'
with open(f'{id}.txt','w',encoding='utf-8') as f:
 f.write(text)
print(f">>>恭喜,保存商品数据成功,已保存为{id}.txt")
lis=req.xpath('//ul[@class="a-unordered-list a-nostyle a-button-list a-declarative a-button-toggle-group a-horizontal a-spacing-top-micro swatches swatchesSquare"]/li')
if len(lis)>1:
print(f">>>商品存在分类属性,共有{len(lis)}分类!")
spans=req.xpath('//div[@class="twisterTextDiv text"]/span[@class="a-size-base"]/text()')
print(spans)

if __name__ == '__main__':
print("亚马逊采集工具-by 微信公众号:二爷记")
 print("BUG反馈 微信:huguo00289");
print("请输入要采集的网址,按回车运行");

try:
get_data(url)
 except Exception as e:
    if "port=443" in e:
print("获取网页链接超时,正在重试..")
get_data(url)
print("采集完毕!")
print("8s后,程序自动关闭,BUG反馈 微信:huguo00289")
time.sleep(8)

 

             

 

 

 

下面是美国亚马逊爬虫的参考代码

 

# -*- coding: utf-8 -*-
"""
File Name:     amzone
Description :
Author :       meng_zhihao
mail :       312141830@qq.com
date:          2019/5/8
"""
# 美国amazon
import requests,urllib
import datetime
from urllib.parse import quote, unquote
from selenium_operate import ChromeOperate
import re
import time
from crawl_tool_for_py3 import crawlerTool as ct
import os,base64
import xlsxwriter
from PIL import Image
DOMAIN = 'https://www.amazon.de'

HEADERS = { 'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_1_1 like Mac OS X) AppleWebKit/602.2.14 (KHTML, like Gecko) Mobile/14B100 MicroMessenger/6.3.22 NetType/WIFI Language/zh_CN'
            }
se = requests.session()

def img_resize(infile,outfile):
    im = Image.open(infile)
    # (x, y) = im.size  # read image size
    x_s = 120  # define standard width
    y_s = 160  # calc height based on standard width
    out = im.resize((x_s, y_s), Image.ANTIALIAS)  # resize image with high-quality
    out.save(outfile)


def gen_xls(item_infos):
    timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    book = xlsxwriter.Workbook('amazon%s.xlsx'%timestamp)
    worksheet = book.add_worksheet('demo')
    worksheet.write_row(0,0, ['关键词','排名','宝贝图片','价格','宝贝类目','宝贝描述','宝贝链接'])
    worksheet.set_column('A:D', 15) # 列宽约等于8像素 行高约等于1.37像素
    worksheet.set_column('C:C', 20)
    worksheet.set_column('B:B', 10)
    worksheet.set_column('F:F', 50)
    for i in range(len(item_infos)):
        col = i+1
        try:
            item_info = item_infos[i]
            row =   [item_info['keyword'],item_info['rank'],'',item_info['price'],item_info['cat'],item_info['descriptions'],item_info['item_url']]
            worksheet.write_row(col,0, row)
            worksheet.set_row(col, 120)
            if 'item_pic_base64' in item_info:
                item_pic_base64 = item_info["item_pic_base64"]
                try:
                    if 'https:' in item_pic_base64:
                        data = ct.get(item_pic_base64)
                    else:
                        data = base64.b64decode(item_pic_base64)
                    with open('test.png', 'wb') as f:
                        f.write(data)
                    img_resize('test.png', 'img/tmp%s.png'%i)
                    worksheet.insert_image( col,2, 'img/tmp%s.png'%i) # 名字必须不同
                except Exception as e:
                    print(str(e))
        except Exception as e:
            print(str(e))
    print('完成结果数,%s'%col)
    book.close()


def extractor_page(page): # 解析宝贝页
    item_info = {"descriptions":""}
    descriptions = ct.getXpath('//div[@id="productDescription"]/p/text()',page)
    if not descriptions:
        descriptions = ct.getXpath( '//div[@id="aplus"]/div//p//text()', page)
    descriptions= ''.join([description.strip() for description in descriptions])
    item_info["descriptions"] = descriptions
    item_pic_base64 = ct.getXpath1( '//div[@id="imgTagWrapperId"]/img/@src', page).split('base64,')[-1]
    item_info["item_pic_base64"] = item_pic_base64
    price = ct.getXpath1( '//span[@id="priceblock_ourprice"]/text()', page)
    item_info["price"] = price
    cats =  ct.getXpath( '//div[@id="wayfinding-breadcrumbs_container"]//a/text()', page)
    item_info["cat"] = '/'.join([cat.strip() for cat in cats])
    for k in item_info:
        print(k)
    return item_info

if __name__ == '__main__':
    #start_url = 'https://xueqiu.com/v4/statuses/public_timeline_by_category.json?since_id=-1&count=15&category=105'
    csv_rows=[]
    cookie = {}
    item_infos = []
    cop = ChromeOperate(executable_path=r'chromedriver.exe')
    cop.open(DOMAIN)
    with open('keywords.txt','r') as keyword_file:
        for line in keyword_file:
            line = line.strip()
            if not line:
                continue
            urls = [DOMAIN+'/s?k=%s&ref=nb_sb_noss_2'%quote(line),
                    # 'https://www.amazon.com/s?k=%s&ref=nb_sb_noss_2&page=2 ' % quote(line)
                    ]
            rank = 0
            for url in urls:
                # HEADERS.update({"Referer":url,"User-Agent":random.choice(USER_AGENT_POOL)})
                cop.open(url)
                page = cop.open_source()
                item_urls = ct.getXpath('//div[@class="sg-row"]//div[@class="sg-col-inner"]//h2/a/@href',page)
                if not item_urls:
                    print(page)
                for item_url in item_urls:
                    rank += 1
                    try:
                        if not 'qid' in item_url:
                            continue
                        else:
                            item_url = DOMAIN+item_url
                            cop.open(item_url)
                            page = cop.driver.page_source
                            if 'Kindle Edition' in page:
                                continue
                            item_info = extractor_page(page)
                            if 'Type the characters you see' in page  :
                                print('IP被封了',url)
                                time.sleep(10)
                                # print page
                                break
                            item_info['keyword'] = line
                            item_info['rank'] = rank
                            item_info['item_url'] = item_url.split('?')[0]
                            item_infos.append(item_info)
                    except Exception as e:
                        print(str(e))
    gen_xls(item_infos)
    cop.quit()
 

 

————————————————
版权声明:本文为CSDN博主「二爷记」的原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接及本声明。
原文链接:https://blog.csdn.net/minge89/article/details/106417047/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

福海鑫森

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值