目的:
使用scrapy框架进行租房信息(杭州地区)爬取,包括小区名称,位置,价格,面积,房间朝向、户型等,然后把爬取的信息保存到本地csv表格中。
分析:
某家的网站爬取不难,看一下页面,需要爬取的是小区名称,位置,价格,面积,房间朝向、户型,和房源维护时间,爬取完当前页面后再爬取下一页信息,这里主要是使用resquest和xpath方法进行爬取,然后使用pandas库进行数据保存,从而把数据保存到本地。
代码部分:
items.py
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class LianjiaItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#pass
title = scrapy.Field() #租房标题
quyu = scrapy.Field() #所在市区,如西湖区
ziquyu = scrapy.Field() #所在子市区,如西湖区的转塘
house_name = scrapy.Field() #小区名称
mianji = scrapy.Field() #mianji
chaoxiang = scrapy.Field() #房子朝向
huxing = scrapy.Field() #房子户型
price = scrapy.Field() #租房价格
date = scrapy.Field() #维护日期
spiders.py
# -*- coding: utf-8 -*-
import scrapy
from lianjia.items import LianjiaItem
import urllib
class LianjiawangSpider(scrapy.Spider):
name = 'lianjiawang'
allowed_domains = ['hz.XXXXXX.com']
start_urls = ['https://hz.XXXXX.com/zufang/']
def parse(self, response):
#pass
div_list = response.xpath('//div[@class="content__article"]/div[@class="content__list"]/div[@class="content__list--item"]')