爬取链家二手房首页和详情页信息

一、准备条件
1)python环境,最好为python3
2)安装scrapy ,pip install scrapy
如果报错ERROR: Faild building wheel for Twisted,去https://pypi.org/project/Twisted/#files下载对应的whl,然后pip insall xxx.whl文件,再重新安装scrapy即可。如果还是不可以,则建立一个虚拟环境,在虚拟环境中安装scrapy即可。

二、代码,由于我这里目的只是走通流程,所以爬取的数据不全,主要是通过首页拿到详情页的地区和区域信息,如需其他字段信息,可自行参考代码进行改善。

1)首先创建scrapy项目

scrapy startproject Lianjia

2)创建爬虫

cd Lianjia
scrapy genspider lianjia lianjia.com

3)在其spider.py下的lianjia.py中写入如下代码:

import scrapy
from ..items import LianjiaItem, LianjiaDetailItem


class LianjiaSpider(scrapy.Spider):
    name = 'lianjia'
    allowed_domains = ['lianjia.com']
    start_urls = ['https://dl.lianjia.com/ershoufang/rs/']

    def parse(self, response):

        secondarys = response.xpath('//*[@id="content"]/div[1]/ul/li')


        for sec in secondarys:
            item = LianjiaItem()
            item['title'] = sec.xpath('.//div[1]/div[1]/a/text()').extract_first() # 标题
            item['detail_url'] = sec.xpath('.//div[1]/div[1]/a/@href').extract_first() # 详情页链接


            if item['detail_url']:

                # 向详情页发送请求
                yield scrapy.Request(
                    item['detail_url'],
                    callback=self.detail_handle,
                    meta={"house":item}
                )
            yield  item



        # 下一页
        url = response.xpath("//div[@class='page-box house-lst-page-box']/@page-data").extract_first()
        # 获取总页数
        pages = eval(url)['totalPage']
        for pa in range(int(pages)+1):
            next_url = 'https://dl.lianjia.com/' + '/ershoufang/pg{}'.format(pa)
            yield scrapy.Request(
                url=next_url,
            )


    # 详情页
    def detail_handle(self, response):

        temp = response.meta['house']
        item = LianjiaDetailItem()

        # 获取详情页信息
        item['community'] = response.xpath('.//div[2]/div[5]/div[1]/a[1]/text()').extract_first() # 小区
        area_1 = response.xpath('.//div[2]/div[5]/div[2]/span[2]/a[1]/text()').extract_first() # 区域部分1
        area_2 = response.xpath('.//div[2]/div[5]/div[2]/span[2]/a[2]/text()').extract_first() # 区域部分2
        if area_1 and area_2:
            item['area'] = area_1 + ' ' + area_2
        else:
            item['area'] = area_1 or area_2

        yield item

4)items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class LianjiaItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    title = scrapy.Field()  # 标题
    detail_url= scrapy.Field()  # 详情页链接


class LianjiaDetailItem(scrapy.Item):

    title = scrapy.Field()  # 标题
    area = scrapy.Field()  # 区域
    community = scrapy.Field()  # 小区名称

5)pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
import json
from itemadapter import ItemAdapter


# 保存为json文件
class LianjiaPipeline(object):

    def __init__(self):
        self.file = open('Lianjia.json', 'w')

    def process_item(self, item, spider):
        item = dict(item)
        json_data = json.dumps(item, ensure_ascii=False) + ',\n'
        self.file.write(json_data)

        return item

    def __del__(self):
        self.file.close()


6)settings.py中将ITEM_PIPELINES = {‘Lianjia.pipelines.LianjiaPipeline’: 300,} 的注释打开,好将USER_AGENT也打开,写入自己的请求头 或者上网找一个即可。
7)在当前目录下执行

scrapy crawl lianjia

8)爬取结束后即可在当前目录下找到一个lianjia.json文件。点它。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值