python之scrapy:第一只spider

最新推荐文章于 2023-06-30 23:18:49 发布

networksu

最新推荐文章于 2023-06-30 23:18:49 发布

阅读量473

点赞数

分类专栏： scrapy 文章标签： scrapy

本文链接：https://blog.csdn.net/networksu/article/details/85053750

版权

scrapy 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

学习python一直的方向是想成为数据分析方向发展，但是数据分析是那种自己想学却比较需要环境的工作。一般在家自己学习数据分析得有很多的数据。那不如先从python最著名的爬虫功能学起。

首先先从身边的自己进行需要的数据开始抓取，最终选择了深圳房地产信息系统，这是个对外的查询房产信息的系统。包括了房产的楼号、面积、产权信息。这些数据即贴近生活又有分析价值。ok，开始抓取。

这个网站比较老，应该是09年左右的系统。用的是ASP.NET开发的，因为之前一直在写ASP.NET，很多控件都用的.NET自带的。前端页面的很多代码都是自动生成的，比如分页。

用的python框架是scrapy，著名的爬虫框架。里面的要实现的方法都是回调函数，因此整个抓取过程都是多线程的。

首先我们要安装scrapy框架，很简单用pip install scrapy.中间可能会遇到一些问题，见招拆招，这里就展开了，可以遇到问题搜百度。

一）创建scrapy，我们可以把抓取一个网站列为一个项目。

在相应目录下scrapy startproject rishome

这时候创建出来的目录结构是这样的。

a) items类似ORM系统的对象类对应数据库的表

b) spiders目录下保存爬虫，我这里创建了一只itcast的爬虫

c) pipelines用于处理items对数据库的操作

d) middlewares没用到就不说了，以后用到补充

e）setting保存这个爬虫程序的相关设置

执行数序是从itcast开始请求页面返回HTML，按照对HTML的解析封装数据给items对象，将items对象推送给pipelines进行数据处理。整个过程都是异步的，也是多线程的。

二）结构分析

先分析一下页面的数据逻辑我们最新看到的是一个个楼盘信息，点击每个楼盘又有多个分支，每个分支又有分多个楼，每个楼又有多个户。

项目列表：

楼名分支：

座名：

每房信息：

关系是项目1：N 楼 1：N 房（座只作为房其中一个属性）

这个不光要看页面还要看url构成来分析

当进入某项目的时候url是http://ris.szpl.gov.cn/bol/projectdetail.aspx?id=37813 很明显id就是这个项目的id

当进入该项目某楼时url是http://ris.szpl.gov.cn/bol/building.aspx?id=33043&presellid=37813这时候id是该楼id，priesellid是项目id

当进入某间房的时候是http://ris.szpl.gov.cn/bol/housedetail.aspx?id=1634258 此时的id是该房的id

根据分析进行数据库设计：

property:
    id   项目id
    name 项目名称

buliding：
    id           楼id
    propertyid   项目id
    bulidingname 楼名称

house:
    id           房屋id
    bulidingid   楼id
    name         房屋名称
    square       房屋面积
    ....

三）创建items，根据数据库创建

因为items不会直接跟数据库映射，所以这里没有必要完全按数据库创建，类名和属性名没必要和数据库一致

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class RishomeItem(scrapy.Item):
   id = scrapy.Field()
   name = scrapy.Field()


class BulidingItem(scrapy.Item):
   id = scrapy.Field()
   propertyid=scrapy.Field()
   bulidingname=scrapy.Field()


class HouseItem(scrapy.Item):
   id  = scrapy.Field()
   bulidingid  = scrapy.Field()
   level  = scrapy.Field()
   houseno  = scrapy.Field()

正式创建spider

# -*- coding: utf-8 -*-
import scrapy
from rishome.items import RishomeItem,BulidingItem,HouseItem
from rishome.pipelines import RishomePipeline

class ItcastSpider(scrapy.Spider):
    name = 'itcast'
    allowed_domains = ['ris.szpl.gov.cn']
    start_urls = ['http://ris.szpl.gov.cn/bol/']

name代表这个爬虫的名称，后面执行那个的话，也是按这个命令来执行的

allowed_domains 没搞明白做什么的

start_urls 表示起始页面

当我们对起始页面发起Get请求，请求的结果就进度到def parse(self, response)方法里，response就相当于对方服务器对我们返回的内容。

接下来我们要将一个重要的知识点xpath，怎么从无数HTML标签中找到自己想要的这是一门技术。

通过response.xpath进行检索。后面起文章说明。

    def parse(self, response):
        context = response.xpath('//tr[@bgcolor="#F5F9FC"]/td[3]')
        for item in context:
            title=item.xpath('a/text()').extract_first()
            idstr=item.xpath('a/@href').extract_first()
            idstr=idstr[idstr.find('=')+1:]
            request=scrapy.Request(url='http://ris.szpl.gov.cn/bol/projectdetail.aspx?id='+idstr, method='GET',callback=self.showdetailpage)
            yield request

第一页并不需要我们抓取什么内容，而是要根据链接进入下个页面。我们可以用chrom来协助我们获取xpath

我们抓取是这个表格每行的tr是 <tr bgcolor="#F5F9FC">。要点击项目名称，项目名称上链接是我们终极提取的目标，xpath中所有的该标签的提取方式是response.xpath('//tr[@bgcolor="#F5F9FC"]/td[3]')

//表示所有改节点

//tr[@bgcolor="#F5F9FC"]表示所有属性bgcolor为"#F5F9FC"的tr节点

//tr[@bgcolor="#F5F9FC"]/td[3]表示所有属性bgcolor为"#F5F9FC"的tr节点下第3个td标签

context = response.xpath('//tr[@bgcolor="#F5F9FC"]/td[3]')将结果赋给context，此时context也是一个xpath的集合

既然是集合我们当然可以遍历它，  for item in context:
此时的item就是一个个td节点，那么他的子节点就是链接<a href="projectdetail.aspx?id=37873" target="_parent">天安云谷产业园二期(02-08)</a>

如果我们获得a标签下的内容，那么我们就用 title=item.xpath('a/text()').extract_first()获得

extract_first()表示返回第一个数值，返回字符

extract()表示返回list，将所有的结果保存在一个list

如果我们想过的一个标签的中某个属性的值，我们可以用 idstr=item.xpath('a/@href').extract_first()获得,这样我们就获得了a标签href的值,应为我们获取了一个相对地址，我们要向这个地址发起请求。

idstr=idstr[idstr.find('=')+1:]
request=scrapy.Request(url='http://ris.szpl.gov.cn/bol/projectdetail.aspx?id='+idstr, method='GET',callback=self.showdetailpage)
            yield request

这时候我们就向项目的URL发起了Request请求，使用了GET方法，结果通过回调函数self.showdetailpage

下面我们编写回调函数，这时候我们就要封装Items里面的对象了。提取网页上的内容，然后封装在RishomeItem对象。将封装好的对象交由pipelins处理。接下来继续需要进入下一层的就继续yied request

    def showdetailpage(self,response):
        item = RishomeItem()
        homeid = response.url[response.url.find('=')+1:]
        item["url"]=response.url
        item["id"]=homeid
        context=response.xpath('//tr[@class="a1"]')
        for it in context:
            title = it.xpath('td[1]/div/text()').extract_first()
            if title=="项目名称":
                content=it.xpath('td[2]/text()').extract_first()
                item['name']=content
            if title=="宗地位置":
                content = it.xpath('td[2]/text()').extract_first()
                item['location']=content
            if title=="受让日期":
                content = it.xpath('td[2]/text()').extract_first()
                item['landstartdate']=content
                content = it.xpath('td[4]/text()').extract_first()
                item['district'] = content.replace('\r\n','').replace(' ','')
            if title=="合同文号":
                content = it.xpath('td[4]/div/text()').extract_first()
                item['landyear'] = content.replace('\r\n','').replace(' ','').replace('年','')
            if title=="房屋用途":
                content = it.xpath('td[2]/text()').extract_first()
                item['landproperty'] = content
            if title=="土地用途":
                content = it.xpath('td[2]/text()').extract_first()
                item['houseproperty'] = content
        yield item
        projectlist=response.xpath('//*[@id="DataList1"]/tr[@bgcolor="#F5F9FC"]')
        for it in projectlist:
            bi=BulidingItem()
            bulidingname=it.xpath('td[2]/text()').extract_first()
            bulidingurl=it.xpath('td[5]/a/@href').extract_first()
            bulidingurl=self.start_urls[0]+bulidingurl
            biid=bulidingurl[bulidingurl.find('id=')+3:bulidingurl.find('&')]
            bi['id']=biid
            bi['propertyid']=homeid
            bi['bulidingname']=bulidingname
            bi['url']=bulidingurl
            yield bi
            request= scrapy.Request(bulidingurl,method='GET',callback=self.showhousepage)
            yield request

在看看pipeline如何处理

启用pipeline之前需要在setting里面设置一下

ITEM_PIPELINES = {
    'rishome.pipelines.RishomePipeline': 300,
}

import pymysql.cursors
import time
class RishomePipeline(object):
    def __init__(self):
        # 连接数据库
        self.connect = pymysql.connect(
            host='127.0.0.1',  # 数据库地址
            port=3306,  # 数据库端口
            db='rishome',  # 数据库名
            user='root',  # 数据库用户名
            passwd='',  # 数据库密码
            charset='utf8',  # 编码方式
            use_unicode=True)
        self.cursor = self.connect.cursor()

#所有的item被yield了以后，都会到这个方法，区别item的方法是获取item的type
    def process_item(self, item, spider):
        if str(type(item))=="<class 'rishome.items.RishomeItem'>":
            self.saverishome(item)
        if str(type(item))=="<class 'rishome.items.BulidingItem'>":
            self.savebuliding(item)
        if str(type(item))=="<class 'rishome.items.HouseItem'>":
            self.savehouse(item)
        return item  # 必须实现返回

    def saverishome(self,item):
        timestr=time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time()))
        sqlstr="""INSERT INTO `rishome`.`property` (`id`,`name`,`location`,`district`,`landstartdate`,`landyear`, `landproperty`,`houseproperty`,`createtime`,`updatetime`,`url`
                ) VALUES(%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s) """
        #self.cursor.execute(sqlstr,(item['id'],item['name'],item['location'],item['district'],item['landstartdate'],item['landyear'],item['landproperty'],item['houseproperty'],timestr,timestr,item['url']))
        self.cursor.execute(sqlstr, (
        item['id'], item['name'], item['location'], item['district'],
        None if item['landstartdate']=='' else item['landstartdate'],
        None if item['landyear']=='' else item['landyear'],
        item['landproperty'], item['houseproperty'], timestr, timestr, item['url']))
        self.connect.commit()

最后我们执行这只爬虫。进入到爬虫目录run spider

E:\scrapy\rishome\rishome\spiders>scrapy crawl itcast

这样我们就完成了，从页面抓取到URL跳转，到数据库存储的全部过程。

networksu

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
python之scrapy:第一只spider

学习python一直的方向是想成为数据分析方向发展，但是数据分析是那种自己想学却比较需要环境的工作。一般在家自己学习数据分析得有很多的数据。那不如先从python最著名的爬虫功能学起。首先先从身边的自己进行需要的数据开始抓取，最终选择了深圳房地产信息系统，这是个对外的查询房产信息的系统。包括了房产的楼号、面积、产权信息。这些数据即贴近生活又有分析价值。ok，开...
复制链接

扫一扫