Scrapy实例2、爬取靓号网

最新推荐文章于 2024-07-10 09:15:55 发布

少年小子

最新推荐文章于 2024-07-10 09:15:55 发布

阅读量209

点赞数

分类专栏：网络爬虫 Python 文章标签： python mysql 爬虫

本文链接：https://blog.csdn.net/weixin_45118180/article/details/110755543

版权

Python 同时被 2 个专栏收录

6 篇文章 1 订阅

订阅专栏

网络爬虫

3 篇文章 0 订阅

订阅专栏

前言：通过实例学习了解Scrapy爬虫框架的使用，并把爬取到的数据保存到数据库中和保存成一个Json格式的文件。

项目分析：

项目名：phone 爬虫名：getphone 爬取的网址：http://www.jihaoba.com/escrow/ 集号吧

分析爬取的字段：

//div[@class="numbershow"]/ul

手机号码：

./li[@class="number hmzt"]/a/@href

需要使用正则匹配获取电话号码

re("\\d{11}")	# 11位电话号码

在这里插入图片描述

价格：

./li[@class="price"]/span/text()

存入数据库需要对人民币符号进行格式处理

replace("￥", "")

在这里插入图片描述

归属地：

./li[@class="brand"]/brand/text()

在这里插入图片描述

项目流程：

创建项目
```
scrapy startproject phone
```

创建爬虫器

cd phone
scrapy genspider getphone jihaoba.com

设计爬虫器

from ..items import PhoneItem

class GetphoneSpider(scrapy.Spider):
    name = 'getphone'
    allowed_domains = ['jihaoba.com']
    start_urls = ['http://www.jihaoba.com/escrow/']

    def parse(self, response):
        # print(response)
        lists = response.xpath('//div[@class="numbershow"]/ul')
        for list in lists:
            phoneitem = PhoneItem()
            # 电话号码
            phoneitem['number'] = list.xpath('li[@class="number hmzt"]/a/@href').re("\\d{11}")[0]
            # 价格
            price = list.xpath('li[@class="price"]/span/text()').extract()[0]
            phoneitem['price'] = price.replace("￥", "")
            # 归属地
            phoneitem['brand'] = list.xpath('li[@class="brand"]/text()').extract()[0]
            # print(number, price, brand)
            yield phoneitem
        # 继续下一页
        # next = "http://www.jihaoba.com" + response.xpath('//a[@class="m-pages-next"]/@href').extract_first()
        # yield scrapy.Request(next)

设置项目 items

class PhoneItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    number = scrapy.Field()
    price = scrapy.Field()
    brand = scrapy.Field()

设置管道 pipelines

import MySQLdb
from .settings import mysql_host, mysql_user, mysql_passwd, mysql_db, mysql_port

# 保存到数据库
class putMySQLPipeline:
    def __init__(self):
        host = mysql_host
        user = mysql_user
        passwd = mysql_passwd
        db = mysql_db
        port = mysql_port
        # 连接数据库
        self.connection = MySQLdb.connect(host, user, passwd, db, port, charset='utf8')
        # 获取游标
        self.cursor = self.connection.cursor()

    def process_item(self, item, spider):
        sql = "insert into phonetable(number,price,brand) values(%s,%s,%s)"
        param = list()		# 创建列表存放数据
        param.append(item['number'])
        param.append(item['price'])
        param.append(item['brand'])
        self.cursor.execute(sql, tuple(param))	# 执行操作，tuple()表示将列表转换为元组
        self.connection.commit()		# 提交事务
        return item

    def __del__(self):
        self.cursor.close()
        self.connection.close()
        
class JsonWithEncodingPipeline(object):
    def __init__(self):
        self.file = codecs.open("phone.json", "a", encoding="utf-8")
        
    def process_item(self, item, spider):
        lines = json.dumps(dict(item), ensure_ascii=False) + "\n"	# dict() 创建一个字典
        self.file.write(lines)
        return item

    def __del__(self):
        self.file.close()

设置 settings

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 管道优先级(数字越小优先级越高)
ITEM_PIPELINES = {
   # 'phone.pipelines.PhonePipeline': 300,
   'phone.pipelines.putMySQLPipeline': 300,
   'phone.pipelines.JsonWithEncodingPipeline': 400
}

mysql_host = '192.168.142.200'		# 数据库IP地址
mysql_user = 'root'				# 数据库用户名
mysql_passwd = '123456'			# 数据库密码
mysql_db = 'db01'				# 数据库名
mysql_tb = 'phonetable'			# 数据表名
mysql_port = 3306				# 数据库端口号

创建 start.py

from scrapy import cmdline
cmdline.execute("scrapy crawl getphone".split())

数据库

建立数据库
```
create database db01;
```

数据库授权

grant all privileges on db01.* 'root'@'%' identified by '123456';

建数据表

create table phone (
	number bigint,
	price int,
	brand varchar(50)
);

设置字符集编码

查看数据库字符集
```
show create database db01;
```

修改数据库字符编码

alter database db01 character set utf8;

查看数据表字符集
```
show create table phone;
```
修改数据表字符编码
```
alter table phone character set utf8;
```

运行项目
直接运行 start.py

少年小子

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Scrapy实例2、爬取靓号网

通过实例学习了解Scrapy爬虫框架的使用，并把爬取到的数据保存到数据库中和保存成一个Json格式的文件。
复制链接

扫一扫

专栏目录

Scrapy实例2、爬取靓号网

项目分析：

项目流程：

“相关推荐”对你有帮助么？