scrapy爬取华为商城所有商品信息--科技快人一步

最新推荐文章于 2022-03-26 12:11:03 发布

孔丘闻言

最新推荐文章于 2022-03-26 12:11:03 发布

阅读量1.7k

点赞数 1

分类专栏：爬虫 python 文章标签： python scrapy

本文链接：https://blog.csdn.net/xiaodsadwwq/article/details/93796326

版权

华为商城 https://www.vmall.com/index.html

目标:华为商城下的商品信息

按主页的左边手机,笔记本&平板,智能穿戴……分类
每一个分类下的小分类
- 商品标题
- 商品价格
规格参数
- 主要参数
- 主体
- ……
- 商品编码
写入excel
设置好excel数据表,分析数据

代码如下（scrapy）：

# -*- coding: utf-8 -*-
import os
import re
import urllib.request
from copy import deepcopy

import scrapy
import xlrd
import xlwt
from ..items import HuaweiItem


class HuaWei(scrapy.Spider):
    name = 'huawei'
    allowed_domains = ['vmall.com', 'vmallres.com']
    start_urls = ['http://vmall.com/']

    def parse(self, response):
        self.new_xls()
        # 主页
        print("分割线-----------------------主页------------------------分割线")
        classify_list_A = response.xpath('//div[@id="category-block"]/div/ol/li')
        print("大分类长度:", len(classify_list_A))
        for i in classify_list_A:
            # print("现在位置:", classify_list_A)
            item = HuaweiItem()
            item['classify_A'] = i.xpath('.//input[2]/@value').extract_first()
            classify_list = i.xpath('.//div[2]//li[not(@class="subcate-btn")]')
            # classify_list = i.xpath('.//div[2]//li[last()]')
            for i in classify_list:
                item['classify_B'] = i.xpath('.//input[1]/@value').extract_first()
                href = "https://www.vmall.com" + str(i.xpath('.//a/@href').extract_first()) + '-1-3-0'
                # print("href:", href)
                yield scrapy.Request(
                    href,
                    callback=self.parse_A,
                    meta={
   "item": deepcopy(item)}
                )
        rb = xlrd.open_workbook('华为商城.xls')
        # 通过sheet_by_index()获取的sheet
        rs = rb.sheet_by_index(0)
        print("已爬取的商品数量:", rs.nrows - 1)

    def parse_A(self, response):
        # 中间页
        print("分割线-----------------------中间页------------------------分割线")
        li_list = response.xpath('//div[@class="layout"]/div[@class="channel-list"]/div[@class="pro-list clearfix"]/ul/li')
        if li_list:
            print("正在爬取页面链接:", response.request.url)
            print("此页面商品数量:", len(li_list))
            for i in li_list:
                item = response.meta["item"]
                rb = xlrd.open_workbook('华为商城.xls')
                # 通过sheet_by_index()获取的sheet
                rs = rb.sheet_by_index(0)
                cods = rs.col_values(0, start_rowx=0, end_rowx=None)
                item['title']

最低0.47元/天解锁文章

孔丘闻言

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
3
评论
scrapy爬取华为商城所有商品信息--科技快人一步

华为商城 https://www.vmall.com/index.html目标:华为商城下的商品信息按主页的左边手机,笔记本&平板,智能穿戴……分类每一个分类下的小分类商品标题商品价格规格参数主要参数主体……商品编码写入excel设置好excel数据表,分析数据代码如下（scrapy）：# -*- coding: utf-8 -*-impor...
复制链接

扫一扫

专栏目录