Python爬虫项目--58同城二手商品爬虫

最新推荐文章于 2023-12-26 15:39:04 发布

Nicolas Acci

最新推荐文章于 2023-12-26 15:39:04 发布

阅读量722

点赞数 2

分类专栏：爬虫 python

本文链接：https://blog.csdn.net/qq_41635501/article/details/96854301

版权

python 同时被 2 个专栏收录

13 篇文章 0 订阅

订阅专栏

爬虫

0 篇文章 0 订阅

订阅专栏

Python爬虫实战–58同城二手商品

目标URL:http://bj.58.com/sale.shtml

爬虫任务：爬取一级页面商品的url，进入二级页面爬取商品信息，保存数据。

第一步：页面解析

在这里插入图片描述

首先需要爬取一级页面商品的url，一级页面是li 的形式，通过xpath helper 解析前端

！！？？测试时只能抓取第一个值

**解决方法：**用Selenium + Chrome获取就可以获取页面

！！？？在进入二级页面的时候只能爬取一个url

报错信息：
//bj.58.com/shouji/38674092781339x.shtml
2019-07-13 09:58:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/sale.shtml>
{‘goods_url’: ‘//bj.58.com/shouji/38674092781339x.shtml’}
2019-07-13 09:58:39 [scrapy.core.scraper] ERROR: Spider error processing <GET https://bj.58.com/sale.shtml> (referer:
None)

解决方法：
-url地址不对，需要拼接正确的地址- allowed_domains = [‘允许爬取的域名’], 如果解析到的域名（url)不在这儿，就不会发送该url

第二步：数据爬取

用了两种解析方法bs4，xpath。在爬取之后要注意字符串的格式转换。
spider.py 代码如下：

# -*- coding: utf-8 -*-
import scrapy
from bs4 import BeautifulSoup
from city58_second_goods.items import City58SecondGoodsItem, GoodsItem


class SecondGoodsSpider(scrapy.Spider):
    name = 'second_goods'
    # allowed_domains = ['允许爬取的域名'], 如果解析到的域名（url)不在这儿， 就不会发送该url
    allowed_domains = ['bj.58.com']
    start_urls = ['https://bj.58.com/sale.shtml']

    def parse(self, response):
        print("*" * 80)
        html = response.xpath('//div/ul/li/a/@href').extract()
        html.pop(0)
        # 获取二级页面url
        item = City58SecondGoodsItem()
        # print("1", html)
        for i in html[:4]:
            item["goods_url"] = i
            goods_url = "https:" + i
            print(goods_url, "jjjjjjjjjjjj")
            yield item
            yield scrapy.Request(url=goods_url, callback=self.goods_info)

    def goods_info(self, response):
        print("+" * 50)
        soup = BeautifulSoup(response.text, 'lxml')
        temp_title = soup.title.get_text()
        print("1" * 50, temp_title)
        title = temp_title.split(" - ")[0]
        try:
            temp_time = soup.select("div.detail-title__info > div")[0].get_text()
            #[0].get_text()用来提取文本内容
            time = temp_time.split(" ")[0]
            temp_price = soup.select("span.infocard__container__item__main__text--price")[0].get_text()
            price = temp_price.split()[0]
            temp = soup.select("div.infocard__container > div:nth-of-type(2) > div:nth-of-type(2)")[0].get_text()
            if '成新' in temp:
                color = temp
                temp_area = soup.select("div.infocard__container > div:nth-of-type(3) > div:nth-of-type(2)")[0]
            else:
                color = None
                temp_area = soup.select("div.infocard__container > div:nth-of-type(2) > div:nth-of-type(2)")[0]
            temp_area = list(temp_area.stripped_strings)
            area = list(filter(lambda x: x.replace("-", ''), temp_area))
            temp_cate = list(soup.select("div.nav")[0].stripped_strings)
            cate = list(filter(lambda x: x.replace(">", ''), temp_cate))
            item = GoodsItem()
            item['goods_title'] = title
            item['goods_time'] = time
            item['goods_price'] = price
            item['goods_color'] = color
            item['goods_area'] = str(area)
            item['goods_cate'] = str(cate)
            yield item
        except:
            print("Error 404!")

第三步：数据保存

通关管道保存数据，我用了4种管道方法（mongodb、mysql、xslx、csv）保存数据，选择你喜欢的一款~

   'city58_second_goods.pipelines.MongodbSecondGoodsPipeline': 400,#保存到MongoDB
   'city58_second_goods.pipelines.MysqlSecondGoodsPipeline': 400,#保存到mysql
   'city58_second_goods.pipelines.XslxSecondGoodsPipeline': 400,#保存到xslx
   'city58_second_goods.pipelines.CsvSecondGoodsPipeline': 400,#保存到csv

.pipelines.py

保存到MongoDB

class MongodbSecondGoodsPipeline(object):
    '''保存到mongodb'''
    def __init__(self,
                 databaseIp='127.0.0.1',
                 databasePort=27017,
                 # user="mongo",
                 # password=None, #没有设置用户和密码
                 mongodbName='second_goods'):
        client = MongoClient(databaseIp, databasePort)
        self.db = client[mongodbName]
        # self.db.authenticate(user, password)

    def process_item(self, item, spider):
        if isinstance(item, GoodsItem):
            postItem = dict(item)  # 把item转化成字典形式
            self.db.scrapy.insert(postItem)  # 向数据库插入一条记录
        return item

保存到mysql

class MysqlSecondGoodsPipeline(object):
    '''保存到mysql'''

    def __init__(self):
        dbparams = {
            'host': '127.0.0.1',
            'port': 3306,
            'user': 'root',
            'password': '123456',
            'database': 'second_goods',
            'charset': 'utf8'
        }
        self.conn = pymysql.connect(**dbparams)
        self.cursor = self.conn.cursor()
        self._sql = None

    def process_item(self, item, spider):
        self.cursor.execute("""
                insert into goods(goods_title,goods_time,goods_price,goods_color,goods_area,goods_cate) values(%s,%s,%s,%s,%s,%s)
                """, (
            item['goods_title'], item['goods_time'], item['goods_price'], item['goods_color'], item['goods_area'],
            item['goods_cate']))
        self.conn.commit()
        return item

下面是建立的mysql表结构：
在这里插入图片描述

保存到xlsx表

class XslxSecondGoodsPipeline(object):
    '''保存到xslx'''

    def open_spider(self, spider):
        self.wb = Workbook()
        # 创建excel
        self.ws = self.wb.active
        # 设置表头信息
        self.ws.append(['标题', '时间', '价格', '颜色', '地区', '备注'])

    def process_item(self, item, spider):
        line = [item['goods_title'], item['goods_time'], item['goods_price'], item['goods_color'], item['goods_area'],
                item['goods_cate']]
        # 注意列表的顺序
        self.ws.append(line)
        return item

    def close_spider(self, spider):
        self.wb.save('goods.xlsx')

？？创建xlsx表的时候报错：

AttributeError: ‘City58SecondGoodsPipeline’ object has no attribute ‘ws’

解决方法：更新版本没用。修改函数的结构，结构如下：

class City58SecondGoodsPipeline(object):
    def open_spider(self, spider):
    
    def process_item(self, item, spider):

    def close_spider(self, spider):
        self.wb.save('goods.xlsx')

？？xlsx表有些字段无法获取

raise ValueError(“Cannot convert {0!r} to Excel”.format(value))
ValueError: Cannot convert [‘朝阳’] to Excel

解决方法：因为xslx只能保存字符串的形式，有两个字段不是字符串
通过print（type（project））查看一下project的类型
通过命令escape(project).encode(‘utf-8’)将其改为字符串类型

保存scv/json

class CsvSecondGoodsPipeline(object):
    '''保存到csv或json'''

    def process_item(self, item, spider):
        if isinstance(item, City58SecondGoodsItem):  # 用来判断是哪个item
            json_str = json.dumps(dict(item), ensure_ascii=False)
            with open("url.csv", "a", encoding="utf-8") as f:
                f.write(json_str + '\n')
        elif isinstance(item, GoodsItem):
            json_str = json.dumps(dict(item), ensure_ascii=False)
            with open("goods.csv", "a", encoding="utf-8") as f:
                f.write(json_str + '\n')
        return item

今天爬虫项目实战，就先讲到这里啦，代码中还有很多方法可以去实现，我就不一一讲啊。代码在下面~~源码地址:
https://github.com/NicolasAcci/Python-Spider/tree/master/city58_second_goods

Nicolas Acci

关注

2
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Python爬虫项目--58同城二手商品爬虫

Python爬虫实战–58同城二手商品目标URL:http://bj.58.com/sale.shtml爬虫任务：爬取一级页面商品的url，进入二级页面爬取商品信息，保存数据。第一步：页面解析首先需要爬取一级页面商品的url，一级页面是li 的形式，通过xpath helper 解析前端！！？？测试时只能抓取第一个值**解决方法：**用Selenium + Chrome获取就可以获...
复制链接

扫一扫