scrapy爬虫案例爬取赶集网租房信息并入库

最新推荐文章于 2024-03-21 03:00:00 发布

Beeman_xia

最新推荐文章于 2024-03-21 03:00:00 发布

阅读量2.8k

点赞数 1

分类专栏： python 文章标签： scrapy 爬虫

本文链接：https://blog.csdn.net/Beeman_xia/article/details/78527264

版权

python 专栏收录该内容

8 篇文章 0 订阅

订阅专栏

本案例有以下几个步骤

1、scrapy shell 使用

2、创建scrapy项目

3、创建数据库

4、编写爬虫

一、scrapy shell 使用

安装pycharm专业版（数据库要用到），ipython，

打开windows命令提示符，然后输入scrapy，会列出用法。

也可以用pycharm自带的Terminal，打开pycharm后在底行有个Terminal

我们要爬取租房信息，输入scrapy shell url ,本案例中 url 是 http://tj.ganji.com/fang1/

scrapy shell http://tj.ganji.com/fang1/

输入后会返回状态和参数，200表示成功

不好找的话直接输入response也可以获得状态信息

这里还有个选项是view(response)，输入会打开览器并显示缓存到本地的临时文件

view(response)

浏

然后可以用xpath获取要抓的内容

选取xpath可以用火狐浏览器器的firebug和firepath，两个配合使用

选取合适的xpath获取页面的所有价钱

然后在命令提示符中输入

response.xpath(".//div[@class='price']/span[1]/text()").extract()

text()是xpath的一个方法，获取文本内容。extract()：序列化该节点为unicode字符串并返回list。

这样就用scrapy shell 获取的所有价钱，同样的，也可以通过改变xpath获取其他内容。

二、创建scrapy项目

在前面输入 scrapy 的时候有个选项是 startproject，这个选项是创建scrapy项目，我们在命令提示符中输入

scrapy startproject XXX

xxx是项目名称

scrapy startproject alalala

最好别在C盘创建了项目，会涉及到权限问题，我又重新建了个项目。。。

打开pycharm-->file-->open 找到建好的项目，会显示项目结构

_init_.py 保持默认，不修改

items.py 保存爬取到的数据的容器

middlewares.py 中间件配置文件

settings.py 项目的设置文件，延迟等。

pipelines.py 项目管道文件，对传入的项目类中的数据进行一个清理和入库

spiders目录该目录下只有一个init.py 文件，在该目录下定义爬虫类并集成scrapy.Spider

三、创建数据库

pycharm自带很多数据库图形化工具，我们使用sqlite数据库

为了方便，我们直接用pychrm的Terminal

在Terminal中进入项目，输入ipython，这是python的命令行工具，比原生的好用

ipython

输入

In [1]: import sqlite3

In [2]: zufang=sqlite3.connect('zufang.sqlite')

In [3]: create_table='create table zufang(title varchar(512),price varchar(128))'

In [4]: zufang.execute(create_table)

这就完成创建数据库了

可以在项目中看到

然后按住这个文件拖动到右侧的Database中

Database默认是在右侧，如果没有的话可能pycharm不是专业版或者没调出来

1、专业版激活码（server license）：http://idea.imsxm.com/

2、没调出来

或者直接用图形工具创建数据库

创建好之后是这样的

main是默认数据库

sqlite_master是系统表

这样数据库就创建好了

四、编写爬虫

我们在spiders目录下创建python文件，文件名任意我这里叫spider

spider.py:

import scrapy
from ..items import ZufangItem


class GanJi(scrapy.Spider): #创建爬虫类，继承scrapy.Spider
    #name,start_urls,parse这三个必须要有
    name="zufang"   #用于区别Spider。 该名字必须是唯一的,在终端用scrapy list 会显示当前项目爬虫的数量
    start_urls=["http://tj.ganji.com/fang1/"]   #包含了Spider在启动时进行爬取的url列表。
    def parse(self, response):  #解析返回的数据
        print(response)
        zf=ZufangItem()#定义一个item实体
        title_list=response.xpath(".//div[@class='f-list-item ershoufang-list']/dl/dd/a/text()").extract()#获取房屋名称
        price_list=response.xpath(".//div[@class='price']/span[1]/text()").extract()#获取房屋价钱
        for i,j in zip(title_list,price_list):
            zf['title']=i
            zf['price']=j
            yield zf

items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ZufangItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    price = scrapy.Field()
    # pass

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import sqlite3

class ZufangPipeline(object):
    def open_spider(self,spider):
        self.con=sqlite3.connect("zufang.sqlite")
        self.cu=self.con.cursor()


    def process_item(self, item, spider):
        print(spider.name,'pipelines')
        insert_sql="insert into main.zufang(title,price) values('{}','{}')".format(item['title'],item['price'])
        print(insert_sql)
        self.cu.execute(insert_sql)
        self.con.commit()
        return item

    def close_spider(self,spider):
        self.con.close()

setting.py

这个文件里要该一些东西，在67行

# }

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'zufang.pipelines.ZufangPipeline': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html

最后是运行

运行命令是scrapy crawl XXX

scrapy crawl zufang --nolog  #--nolog是不显示日志

再看看数据库

收工~

Beeman_xia

关注

1
点赞
踩
21

收藏

觉得还不错? 一键收藏
1
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫

专栏目录