一、准备条件
1)python环境,最好为python3
2)安装scrapy ,pip install scrapy
如果报错ERROR: Faild building wheel for Twisted,去https://pypi.org/project/Twisted/#files下载对应的whl,然后pip insall xxx.whl文件,再重新安装scrapy即可。如果还是不可以,则建立一个虚拟环境,在虚拟环境中安装scrapy即可。
二、代码,由于我这里目的只是走通流程,所以爬取的数据不全,主要是通过首页拿到详情页的地区和区域信息,如需其他字段信息,可自行参考代码进行改善。
1)首先创建scrapy项目
scrapy startproject Lianjia
2)创建爬虫
cd Lianjia
scrapy genspider lianjia lianjia.com
3)在其spider.py下的lianjia.py中写入如下代码:
import scrapy
from ..items import LianjiaItem, LianjiaDetailItem
class LianjiaSpider(scrapy.Spider):
name = 'lianjia'
allowed_domains = ['lianjia.com']
start_urls = ['https://dl.lianjia.com/ershoufang/rs/']
def parse(self, response):
secondarys = response.xpath('//*[@id="content"]/div[1]/ul/li')
for sec in secondarys:
item = LianjiaItem()
item['title'] = sec.xpath('.//div[1]/div[1]/a/text()').extract_first() # 标题
item['detail_url'] = sec.xpath('.//div[1]/div[1]/a/@href').extract_first() # 详情页链接
if item['detail_url']:
# 向详情页发送请求
yield scrapy.Request(
item['detail_url'],
callback=self.detail_handle,
meta={"house":item}
)
yield item
# 下一页
url = response.xpath("//div[@class='page-box house-lst-page-box']/@page-data").extract_first()
# 获取总页数
pages = eval(url)['totalPage']
for pa in range(int(pages)+1):
next_url = 'https://dl.lianjia.com/' + '/ershoufang/pg{}'.format(pa)
yield scrapy.Request(
url=next_url,
)
# 详情页
def detail_handle(self, response):
temp = response.meta['house']
item = LianjiaDetailItem()
# 获取详情页信息
item['community'] = response.xpath('.//div[2]/div[5]/div[1]/a[1]/text()').extract_first() # 小区
area_1 = response.xpath('.//div[2]/div[5]/div[2]/span[2]/a[1]/text()').extract_first() # 区域部分1
area_2 = response.xpath('.//div[2]/div[5]/div[2]/span[2]/a[2]/text()').extract_first() # 区域部分2
if area_1 and area_2:
item['area'] = area_1 + ' ' + area_2
else:
item['area'] = area_1 or area_2
yield item
4)items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class LianjiaItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field() # 标题
detail_url= scrapy.Field() # 详情页链接
class LianjiaDetailItem(scrapy.Item):
title = scrapy.Field() # 标题
area = scrapy.Field() # 区域
community = scrapy.Field() # 小区名称
5)pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
import json
from itemadapter import ItemAdapter
# 保存为json文件
class LianjiaPipeline(object):
def __init__(self):
self.file = open('Lianjia.json', 'w')
def process_item(self, item, spider):
item = dict(item)
json_data = json.dumps(item, ensure_ascii=False) + ',\n'
self.file.write(json_data)
return item
def __del__(self):
self.file.close()
6)settings.py中将ITEM_PIPELINES = {‘Lianjia.pipelines.LianjiaPipeline’: 300,} 的注释打开,好将USER_AGENT也打开,写入自己的请求头 或者上网找一个即可。
7)在当前目录下执行
scrapy crawl lianjia
8)爬取结束后即可在当前目录下找到一个lianjia.json文件。点它。