Amazon一小时热榜

最新推荐文章于 2024-07-27 12:20:46 发布

Godsshdo

最新推荐文章于 2024-07-27 12:20:46 发布

阅读量330

点赞数

分类专栏：一些尝试文章标签： python http 数据挖掘

本文链接：https://blog.csdn.net/Godsshdo/article/details/108338305

版权

一些尝试专栏收录该内容

3 篇文章 0 订阅

订阅专栏

实例:爬取亚马逊一小时热榜

背景
实战

背景

需要获取亚马逊欧洲(amazon.de)下的热销榜的产品.
要求: 大类排名7000内, 评论少于10条, 价格在6€ 到15€ 之间

实战

抓取网页

因为之前已经有过爬取别的网站的需求, 因此本次爬取网页使用以往的驱动. 包含一个 driver_open() 函数以及一个get_content() 函数, 这两段代码如下所示:

def driver_open():
    dcap = dict(DesiredCapabilities.PHANTOMJS)
    dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0")
    driver = webdriver.PhantomJS(executable_path='D:/somethingforpy/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs.exe', desired_capabilities=dcap)
    return driver

def get_content(driver,url):
    driver.get(url)
    #等待1秒，更据动态网页加载耗时自定义
    time.sleep(1)
    # 获取网页内容
    content = driver.page_source.encode('utf-8')
    driver.close()
    soup = BeautifulSoup(content, 'lxml')
    return soup

其实第一段代码, 即driver_open() 的代码我并非特别熟悉, 是从别的地方copy来的, 因为当时爬某个网站被反爬, 因此需要添加请求头. 后续了解后再来填坑.
第二段代码中为啥要等待1s我也没有深究, 参考了其他大神的代码. 当时做完任务就忘记回头了(看看这次我记不记得回头=.=)

有了这两段代码之后, 只需要设定url 即可进行最基本的内容抓取:

url="https://www.amazon.de/gp/bestsellers/?ref_=nav_cs_bestsellers"
driver = driver_open()
soup = get_content(driver,url)
print(soup)

获取热榜类别信息

需要抓取的信息的位置
左侧一栏是热销的大类, 将各个大类看过一遍, 可以知道每个大类对应的网址都是类似:

https://www.amazon.de/gp/bestsellers/+类别+/ref=zg_bs_nav_0

因此, 我们应该首先获取各个类别的代码. 代码如下

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import requests
import time
import xlrd
from bs4 import BeautifulSoup
import re
import pandas as pd
from lxml import etree

def driver_open():
    dcap = dict(DesiredCapabilities.PHANTOMJS)
    dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0")
    driver = webdriver.PhantomJS(executable_path='D:/somethingforpy/phantomjs-2.1.1-windows/phantomjs-2.1.1-windows/bin/phantomjs.exe', desired_capabilities=dcap)
    return driver

def get_content(driver,url):
    driver.get(url)
    #等待1秒，更据动态网页加载耗时自定义
    time.sleep(1)
    # 获取网页内容
    content = driver.page_source.encode('utf-8')
    driver.close()
    soup = BeautifulSoup(content, 'lxml')
    return soup
i=0
#获取网页结构
url="https://www.amazon.de/gp/bestsellers/?ref_=nav_cs_bestsellers"
driver = driver_open()
soup = get_content(driver,url)
print(soup)

将打印出的内容copy到notepad++, 随便从热卖榜选择一个类别进入, 复制网址, 在np 中查找对应位置, 以Launchpad为例, 其网址是:

https://www.amazon.de/gp/bestsellers/boost/ref=zg_bs_nav_0

对应soup的位置为:
对应位置
可以看到是在a标签下的href内容中(废话).
可以看到其他类别的代码就在临近, 且格式都是

https://www.amazon.de/gp/bestsellers/

作为开头, 因此可用下面的语句筛选出各个链接

recorda = soup.find_all('a')
for item in recorda:
    # print(item.get("href"))
    s=str(item.get("href"))
    if(s.startswith("https://www.amazon.de/gp/bestsellers/")):
        l=s.split('/')
        print(l[5])
        # print(item.get("href"))

因为使用’/'分割后, 代码位置在第6位, 因此我直接提了l[5]
最后得到各个类别的代码

boost
amazon-devices
mobile-apps
audible
automotive
baby
diy
beauty
…

先将大类对应的代码记录下来,之后直接引用就行, 无需再次获取.

获取小类

获取完大类之后就是获取小类, 由于我们已知如何直接进入进入大类的网址, 因此这步就不再逐一进入各个大类, 随机选取一个大类进入搜查小类就行.
这部分和获取大类的思路一样(他们的排版长得都一样),这里不再赘述

获取商品信息

当我们进入到特定小类后, 就可以开始爬取商品信息了, 需要注意的是, 我们从各个超链接点进去, 最后定位的网址格式是:

https://www.amazon.de/gp/bestsellers/boost/9418396031/ref=zg_bs_nav_1_boost

但是我们可以注意到每种小类商品有两页的内容, 因此我们重新点击进入第一页和第二页, 可以看到新的网址是:

https://www.amazon.de/gp/bestsellers/boost/9418396031/ref=zg_bs_pg_1?ie=UTF8&pg=1

实验过后发现后面的pg=2时就是第二页, 因此需要记住这个链接格式, 后续直接调用, 通用格式如下:

https://www.amazon.de/gp/bestsellers/大类/小类/ref=zg_bs_pg_1?ie=UTF8&pg=页码

接下来就是抓取商品信息啦, 同样的, 提取网页内容,放入np中查看. 随便ctrl+F查找一个商品的名称. 例如我将选择下面这个商品:

发现他在很多位置出现, 不过有一个位置比较特殊, 就是下面这个位置:

特殊位置
因为他的a标签的class看起来比较搞特殊, 方便我们定位, 再搜寻其他商品, 发现格式时一样的.
但是这里的链接是不全的, 只需要在前面加上:

amazon.de

即可. 因此, 我们可以顺利获取各个商品的超链接, 代码如下:

i=0
#抓取商品
url="https://www.amazon.de/gp/bestsellers/boost/9418396031/ref=zg_bs_pg_2?ie=UTF8&pg=1"
driver = driver_open()
soup = get_content(driver,url)
# print(soup)
# 抓取各个item的链接
temp = soup.find_all("a",attrs={'class':'a-link-normal a-text-normal'})
# print(temp)
for item in temp:
    print("===",i,"===")
    s= "amazon.de"+str(item.get('href'))
    # print("amazon.de",str(item.get('href')))
    print(s)
    i+=1

输出如下
输出结果
接下来就是寻找价格区间
价格所在的位置如下:

price

原本想用’p13n-sc-price’进行定位, 但是后续发现用这个定位能定位去其他奇奇怪怪的东西, 因此使用’a-size-base a-color-price’进行定位.
但是这么定位意味着我还需要找span下的子节点, soup 找到父节点
即 :

span class=“a-size-base a-color-price”

后得到的对象是bs4.啥.tag
tag对象可以继续使用.children查找而children的返回值又是list_iterator对象,因此,我们可以这样来寻找价格区间, 代码如下:

for item in price:
    print("===",i,"===")
    # print(item)
    # print(type(item.children))#list_iterator
    # print(list(item.children))
    l=list(item.children)
    # print(l[0])
    # print(l[0].string)
    p1=float(l[0].string.replace(" €","").replace(",","."))
    # print(p1)
    p2=float(l[-1].string.replace(" €","").replace(",","."))
    print(p1)
    print(p2)
    # print("==")
    # print(item.span)
    # for i,chi in enumerate(item.children):
    #     print(chi.string.replace(" €",""))

    i+=1

输出结果如下:

=== 0 ===
3.3
23.52
=== 1 ===
11.99
11.99
=== 2 ===
19.95
24.99
=== 3 ===
4.99
21.22
=== 4 ===
30.84
30.84
=== 5 ===
8.95
23.95
=== 6 ===

接下来剩下评价是否少于10条以及大类排名是否在7000内了

Godsshdo

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Amazon一小时热榜

实例:爬取亚马逊一小时热榜背景实战抓取网页获取热榜类别信息获取小类获取商品信息背景需要获取亚马逊欧洲(amazon.de)下的热销榜的产品.要求: 大类排名7000内, 评论少于10条, 价格在6€ 到15€ 之间实战抓取网页因为之前已经有过爬取别的网站的需求, 因此本次爬取网页使用以往的驱动. 包含一个 driver_open() 函数以及一个get_content() 函数, 这两段代码如下所示:def driver_open(): dcap = dict(DesiredCapab
复制链接

扫一扫