利用Python Scrapy框架爬取“房天下”网站房源数据

最新推荐文章于 2024-05-25 01:07:28 发布

如风过境YD

最新推荐文章于 2024-05-25 01:07:28 发布

阅读量5.8k

点赞数 8

分类专栏： Python编程

本文链接：https://blog.csdn.net/qq_35649945/article/details/92729815

版权

Python编程专栏收录该内容

4 篇文章 0 订阅

订阅专栏

文章目录

分析网页
获取新房、二手房、租房数据
反反爬虫
将数据保存至MongoDB数据库

分析网页

“房天下”网站首页首页链接
在这里插入图片描述
由于数据量较大，本次只获取如下图热门城市房源数据，网址链接

　　点击上图中的热门城市入口会进入该城市的首页，该网页下存放着新房、二手房以及租房的url链接。
以上海为例：
上海首页：url=https://sh.fang.com/
上海新房：url=https://sh.newhouse.fang.com/house/s/
二手房：url=https://sh.esf.fang.com/
租　　房：url=https://sh.zf.fang.com/
可以通过获取首页的url切割和拼接字符串的方式来获取该城市的新房，二手房，租房url。

 def parse(self, response):
        aList=response.xpath("//div[@class='searchcity fblue']/a")
        for a in aList:
            city_url=a.xpath(".//@href").get()
            city=a.xpath(".//text()").get()
            url_module = city_url.split('.')
            print(url_module)
            scheme=url_module[0]
            domain=url_module[1]
            com=url_module[2]
            newhouse_url=None
            esf_url=None
            zufang_url=None
            print(city_url + city)
            if city == ('全国' or '香港' or '美国'or'海外'):
                continue
            elif city == '北京':
                newhouse_url = 'https://newhouse.fang.com/house/s/'
                esf_url = 'https://esf.fang.com/house/i32/'
                zufang_url='https://zu.fang.com/house/i32/'
            else:
                # 构建每个城市的新房链接
                newhouse_url = scheme + '.newhouse.'+domain+'.'+com+'house/s/'
                #构建每个城市的二手房链接
                esf_url = scheme+'.esf.'+domain+'.'+com+''
                #租房信息
                zufang_url=scheme+'.zu.'+domain+'.'+com+''
            print('new_house:'+newhouse_url)
            print(esf_url)
            print('zufang:'+zufang_url)

获取新房、二手房、租房数据

新房数据

在这里插入图片描述
如图可以看出新房信息存储在属性为house_value clearfix的div标签里，可以通过xpath语法获取新房信息，代码如下:

def parse_newhouse(self,response):
    city = response.meta.get("info")
    #实例化一个items
    item = NewHouseItem()
    #得到所有的房源列表
    lis = response.xpath('//div[contains(@class,"nl_con")]/ul/li')
    for li in lis:
        #去广告的li标签，
        if not li.xpath('.//div[@class="nlcd_name"]'):
            continue
        # 房名
        item["name"] = li.xpath('.//div[@class="nlcd_name"]/a/text()').get().strip()
        house_type_text = li.xpath(".//div[contains(@class,'house_type')]/a//text()").getall()
        # 几居
        item["rooms"] = "/".join(filter(lambda x: x.endswith('居'or '以上'),house_type_text))
        area = "".join(li.xpath('.//div[contains(@class,"house_type")]/text()').getall())
        # 面积
        item["area"] = re.sub(r"\s|/|－","",area)
        # 地区
        item["address"] = li.xpath('.//div[@class="address"]/a/@title').get()
        # 行政区
        district = "".join(li.xpath('.//div[@class="address"]/a//text()').getall())
        # 没有行政
        if "[" not in  district:
            item["district"] = "暂无数据"
        else:
            item["district"] = "".join(re.search(r".*\[(.+)\].*", district).group(1))
        # 销售状态
        item["sale"] = li.xpath('.//div[contains(@class,"fangyuan")]/span/text()').get()
        # price
        price = "".join(li.xpath(".//div[@class='nhouse_price']//text()").getall())
        item["price"] = re.sub(r'\s|"广告"',"",price)
        # origin_url
        item["origin_url"] = response.urljoin(li.xpath('.//div[@class="nlcd_name"]/a/@href').get())
        item["city"] = city
        yield item     #通过yield将item类送到pipeline管道，可进行保存

#获取下一页的url进行循环爬取，判断下一页的url是否为空，若不为空，则进行下一页的爬取

next_url = response.xpath("//div[@class='page']//a[@class='next']/@href").get()
if next_url:
      yield scrapy.Request(url=response.urljoin(next_url),callback=self.parse_newhouse,meta={"info":(city)})

租房数据：

在这里插入图片描述
上图信息对应的源码如下：

根据上图对应的租房源码信息写对应xpath代码如下;

def parse_renthouse(self, response):
    #print(response.url)
    city=response.meta.get("info")
    item=RenthousescrapyItem(city=city)
    dds=response.xpath("//div[@class='houseList']/dl/dd")
    for dd in dds:
        item['title']=dd.xpath("./p[@class='title']/a/@title").get()
        spans=dd.xpath("./p[@class='font15 mt12 bold']/text()").getall()
        span=list(map(lambda x:re.sub("\s","",x),spans))
        item['rooms']=spans[1]
        if len(spans)>3:
            item['direction']=re.sub("\s","",spans[3])#去除空白字符
        else:
            item['direction']='数据暂无'
        item['area']=spans[2]
        a=dd.xpath("./p[@class='gray6 mt12']//text()").getall()#获取地区
        item['region']=a[0]
        #地址
        a=''.join(a)
        item['address']=a
        div="".join(dd.xpath("./div/p[@class='mt12']//text()").getall())
        #交通
        item['traffic']=div
        price_p=''.join(dd.xpath("./div[@class='moreInfo']/p//text()").getall())
        #价格
        item['price']=price_p
        yield item
        #取出所有在divfanye属性下的a标签的href属性
    next_url_list=response.xpath(".//div[@class='fanye']/a/@href").getall()
    #声明全局变量
    global FLAG
    next_url=[]
    if(len(next_url_list)==8):
        # 转化为字符串
        next_url=''.join(next_url_list[6])
    elif(len(next_url_list)==9):
        next_url=''.join(next_url_list[7])
    if next_url:
        yield scrapy.Request(url=response.urljoin(next_url), callback=self.parse_renthouse, meta={"info":city})

通过查看xpath语法可以对以上获取下一页url的代码进行改进，下一页的url地址保存在@fanye属性的div标签下的a标签下，虽然下一页的url与其它同在a标签下的其它url没有属性可以区分。但是通过找寻规律可以发现，下一页url所在的标签一直在该div标签下的倒数第二个标签。因此，可以通过如下xpath语法获取该url：

next_url=response.xpath(".//div[@class='fanye']/a[last()-1]/@href").get()

由于获得的url是不完整的，可以通过response.urljoin()函数来补全url域名，然后再通过parse_renthouse()函数来解析网页，获取更多的租房信息。代码如下：

If   next_url:        #判断是否还存在下一页
yield scrapy.Request(url=response.urljoin(next_url), callback=self.parse_renthouse, meta={"info":city}

二手房数据

获取二手房信息的原理与获取新房，租房信息基本相同，代码如下：

 def parse_esf(self,response):
        print(response.url)
        city = response.meta.get("info")
        item = ESFHouseItem(city=city)
        #获取所有的dls
        dls = response.xpath('//div[contains(@class,"shop_list")]/dl')
        for dl in dls:
            item["name"] = dl.xpath('.//p[@class="add_shop"]/a/@title').get()
            infos = dl.xpath('.//p[@class="tel_shop"]/text()').getall()
            infos = list(map(lambda x:re.sub(r"\s","",x),infos))
            for info in infos:
                print(info)
                if  '室' in info:
                    item["rooms"] = info
                elif  '层' in info:
                    item["floor"] = info
                elif '向' in info:
                    item["toward"] = info
                elif  '㎡' in info:
                    item['area'] = info
                else:
                    item["year"] = info.replace("建筑年代","")
            #地址
            item['address'] = dl.xpath('.//p[@class="add_shop"]/span/text()').get()
            #总价格
            price_s = dl.xpath('.//dd[@class="price_right"]/span/b/text()').get()
            price_w = dl.xpath('.//dd[@class="price_right"]/span[1]/text()').get()
            if price_s and price_w:
                item['price'] = ''.join(price_s)+ ''.join(price_w)
            else:
                item['price'] = '暂无数据'
            #
            #多少一平米
            item['unit'] = dl.xpath('.//dd[@class="price_right"]/span[2]/text()').get()
            # origin_url
            item['origin_url'] = response.urljoin(dl.xpath('.//h4/a/@href').get())
            #print(item,response.url,city)
            yield  item
        next_url = response.xpath('.//div[@class="page_al"]/p[1]/a/@href').get()
        if  next_url:
            yield  scrapy.Request(url=response.urljoin(next_url),callback=self.parse_esf,meta={"info":(city)})

反反爬虫

一般网站都会采取一些反爬措施。因此，需要设置多个请求头，通过切换User-Agent
止网站识别爬虫，在Scrapy的Middlewares中间件中定义UserAgentMiddleware类，生成随机请求头，然后在settings中开启Middlewares中间件，随机的User-Agent就可以在运行的过程中生成了。当网页访问达到一定量时，网站会封禁Ip。在必要时，还需在Middlewares中间件中添加Ip代理，通过中间代理服务器来访问目标网页，防止Ip被封。同时在使用爬虫访问“房天下”网站时，应该限制请求速度，一方面是因为，如果请求过快，可能会导致对方服务器崩溃；另一方面，请求速度过快，对方也能识别出爬虫，将Ip封禁。

将数据保存至MongoDB数据库

本过程主要是为了练习将获取的数据保存至多种形式方便存取，因此本次还将数据以csv,json等格式保存。

JSON格式

完成数据的抓取之后，最方便的保存方式就是将数据存储为JSON格式，保存到本地。JSON格式的数据结构是 {key：value，key：value,…}的键值对的结构，这种结构很方便数据的保存，Python提供了JSON模块可供导入使用。在Pipeline中定义RentHouseJson类并重写其中process()函数保存item对象，具体代码如下：

class RentHouse(object):
    def __init__(self):
        self.renthouse_fp=open('renthouse.json','wb') 
        self.renthouse_exporter=JsonLinesItemExporter(self.renthouse_fp,ensure_ascii=False)
    def process():
        self.renthouse_exporter.export_item(item)
        return item
    def close_spider(self,item,spider):
        self.renthouse_fp.close()

同时需要在setting设置文件中开启RentHousePipeline管道，就可以将数据保存到renthouse.json文件中了。保存结果如图所示。
在这里插入图片描述

CSV格式

JSON格式的内部数据以键值对的形式存在，便于这种网络爬虫数据的存放。但是如图4-1所示，保存的数据不够直观，不便于人们对数据的观察。Python中还存在CSV模块，可使用CSV格式进行存储，CSV格式将数据保存在类似于excel表格的形式，具体代码如下：

 class FangCSVPipeline(object):
    def __init__(self):
        print("开始写入...")

    self.f1 = open('new_house.csv', 'w', newline='')
    self.write1=csv.writer(self.f1)
    self.write1.writerow(["城市", "小区名称", "价格", "几居",
                          "面积", "地址", "行政区", "是否在售", "详细url"])

    self.f2 = open('esf_house1.csv', 'w', newline='')
    self.write2=csv.writer(self.f2)
    self.write2.writerow(["城市", "小区的名字", "几居", "层", "朝向",
                          "年代", "地址", "建筑面积", "总价", "单价", "详细的url"])

    self.f3=open('rent_house.csv','w',newline='')
    self.write3=csv.writer(self.f3)
    self.write3.writerow(['城市','标题', '房间数', '平方数',
                        '价格', '地址', '交通描述', '区', '房间朝向'])
def process_item(self,item,spider):
    print("正在写入...")
    if isinstance(item,NewHouseItem):
        self.write1.writerow([item['city'],item['name'],item['price'],
          item['rooms'],item['area'],item['address'],item['district'],item['sale']
          ,item['origin_url']])
    elif isinstance(item,ESFHouseItem):
        self.write2.writerow([item['city'],item['name'],item['rooms'],
              item['floor'],item['toward'],item['year'],item['address'],item['area']
              ,item['price'],item['unit'],item['origin_url']])
    elif isinstance(item,RenthousescrapyItem):
        self.write3.writerow([item['city'],item['title'], item['rooms'], item['area'], item['price']
                                , item['address'], item['traffic'], item['region'],
                             item['direction']])
    return item
def close_spider(self,item,spider):
    print("写入完成...")
    self.f1.close()
    self.f2.close()
    self.f3.close()

值得注意的是保存CSV格式时默认会换行，在结果中产生很多空行，因此，需要在打开文件时要将newline设置为‘’，来控制换行。
在这里插入图片描述

MongoDB数据库

将数据存储为CSV格式观察直观，也便于保存、移动。当数据量较小时，也容易观察数据的特点。但当十七个城市的四万多条租房数据存储在这样的表里时，在对数据处理时会显得臃肿笨重。因此，当遇到大量数据时需要将数据存储到数据库中。同时，也便于数据的添加。
MongoDB数据库是一种介于关系型数据库与非关系型数据库之间的数据库，数据库内部可以创建多个集合，可以将不同的城市的数据存在不同的集合中，不需要像关系型数据库那样提前建数据表。集合内可以存储记录，每条记录之间没有联系，包含信息可以不同，但是记录内部可以以键值对的形式存在，有两者都具有的优点，并且易部署，易操作。
Python提供了pymongo模块可供连接MongoDB数据库。在保存租房数据时，每条数据都会保存一条city信息，可以据此定义一个get_collection()函数，返回城市应该存储的集合，从而将信息存储在对应城市的集合中，然后可以通过collection自带的insert()函数将记录插入集合中，具体代码如下：

class mongodbPipeline(object):
    def open_spider(self,spider):
        self.client=pymongo.MongoClient(host='localhost',port=27017)
        print("打开数据库...")
    def close_spider(self,spider):
        print('写入完毕，关闭数据库.')
        self.client.close()
    def process_item(self,item,spider):
        if isinstance(item, NewHouseItem):
            db=self.client.new_house
        elif isinstance(item,ESFHouseItem):
            db=self.client.esf_house1
        elif isinstance(item,RenthousescrapyItem):
            db=self.client.rent_house
        collection=getCollection(db,item['city'])
        # db.coll = COLLECTION[item['city']]
        #collection = db.coll
        collection.insert(dict(item))
        print('正在写入...')
        return item

经过以上步骤，通过将数据保存到MongoDB数据库的集合中，基本完成了租房数据的存储过程。

如风过境YD

关注

8
点赞
踩
59

收藏

觉得还不错? 一键收藏
13
评论
利用Python Scrapy框架爬取“房天下”网站房源数据

分析网页“房天下”网站首页由于数据量较大，本次只获取如下图热门城市房源数据点击上图中的热门城市入口会进入该城市的首页，该网页下存放着新房、二手房以及租房的url链接。以上海为例：url=https://sh.fang.com/: 上海新房：url=https://sh.newhouse.fang.com/house/s/二手房：url=https://sh.esf.fang...
复制链接

扫一扫