1、项目背景及需求
- 在B站看了一个爬取房天下网站的案例,自己动手敲了敲,改了改
- 这个网站既卖全国各个城市的新房,也卖二手房,要做的就是爬取各个城市新房的各项信息,各个城市二手房的各种信息
- 新房的信息有:哪个省份的(province),哪个城市的(city),小区名字(name),价格(price),几居室(rooms),房子面积(area),地址(address),房子属于哪个行政区(district),是否在售(sale),每一套房子详情页面的链接(origin_url)
- 二手房的信息有:哪个省份的(province),哪个城市的(city),小区名字(name),地址(address),房子的一些基本信息(infos),价格(price),房子单价(unit),每套房子详情页面的链接(origin_url)
- 以上要爬取的信息在下面Scrapy的
items.py
文件中可以看到
2、Scrapy爬虫的书写
2.1、项目创建及目录结构
2.1.1、项目创建
- 打开cmd命令行,进入到想要创建文件的文件夹
- 输入命令:
scrapy startproject fang
- 输入
cd fang
命令两次,再输入cd spiders
命令,进入爬虫文件夹 - 输入命令:
scrapy genspider sfw fang.com
创建爬虫文件
2.1.2、目录结构
2.2、settings.py
文件
- 设置
ROBOTSTXT_OBEY = False
- 设置
DOWNLOAD_DELAY =
3 - 将以下代码打开,请求头在
middlewares.py
中书写(下面有)
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en'
}
下载器中间件和管道文件后期代码书写完成之后再打开
2.3、items.py
文件
此文件中写要存储的数据
# 新房信息
class NewHouseItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
# 省份
province = scrapy.Field()
# 城市
city = scrapy.Field()
# 小区的名字
name = scrapy.Field()
# 价格
price = scrapy.Field()
# 几居, 这是个列表
rooms = scrapy.Field()
# 面积
area = scrapy.Field()
# 地址
address = scrapy.Field()
# 行政区
district = scrapy.Field()
# 是否在售
sale = scrapy.Field()
# 房天下详情页面的url
origin_url = scrapy.Field()
# 二手房信息
class ESFHouseItem(scrapy.Item):
# 省份
province = scrapy.Field()
# 城市
city = scrapy.Field()
# 小区名字
name = scrapy.Field()
# # 几室几厅
# rooms = scrapy.Field()
# # 层
# floor = scrapy.Field()
# # 朝向
# toward = scrapy.Field()
# # 年代
# year = scrapy.Field()
# 地址
address = scrapy.Field()
# # 建筑面积
# # area = scrapy.Field()
# 总价
price = scrapy.Field()
# 单价
unit = scrapy.Field()
# 原始的url
origin_url = scrapy.Field()
# 信息
infos = scrapy.Field()
2.4、sfw.py
文件(爬虫文件)
主要的爬虫代码写在此文件中
import scrapy
import re
from fang.items import NewHouseItem, ESFHouseItem
class SfwSpider(scrapy.Spider):
name = 'sfw'
allowed_domains = ['fang.com']
start_urls = ['https://www.fang.com/SoufunFamily.htm']
def parse(self, response):
# 所有城市标签
trs = response.xpath("//div[@class = 'outCont']//tr")
province = None
# 遍历得到每一行的数据
for tr in trs:
# 获取省份和对应城市的两个td标签
tds = tr.xpath(".//td[not(@class)]")
# 省份名称
province_text = tds[0]
# 省份对应的城市名称及链接
city_info = tds[1]
# 提取省份名称
province_text = province_text.xpath(".//text()").get()
province_text = re.sub(r"\s", "", province_text)
if province_text:
province = province_text
# 不爬取海外房产
if province == "其它":
continue
# 提取城市名称及链接
city_links = city_info.xpath(".//a")
for city_link in city_links:
# 获取城市
city = city_link.xpath(".//text()").get()
# 获取城市链接
city_url = city_link.xpath(".//@href").get()
# 构建新房链接
url_split = city_url.split("fang")
url_former = url_split[0]
url_backer = url_split[1]
newhouse_url = url_former + "newhouse.fang.com/house/s/"
# 构建二手房链接
esf_url = url_former + "esf.fang.com/"
# print("++" * 20)
# print("省份:", province)
# print("城市:", city)
# print("新房链接:", newhouse_url)
# print("二手房链接:", esf_url)
# print("++" * 20)
# 返回新房信息再解析
yield scrapy.Request(url=newhouse_url, callback=self.parse_newhouse, meta={
"info": (province, city)})
# 返回二手房信息再解析
yield scrapy.Request(url=esf_url, callback=self.parse_esf, meta = {
"info": (province, city)})
# 新房页面解析
def parse_newhouse(self, response):
province, city = response.meta.get("info")
lis = response.xpath("//div[contains(@class, 'nl_con')]/ul/li[not(@style)]")
for li in lis:
# 获取房产名字
name = li.xpath(".//div[@class='nlcd_name']/a/text()").get(