Python爬虫：春招在即你准备好租房了吗？

本文链接：https://blog.csdn.net/qq_42768234/article/details/105183022

在这里插入图片描述

一、前言

最近在做一个数据分析的项目主要讨论招聘市场和Python语言的就业环境以及租房，已经完成，但现在还不能公开，租房分析时需要数据，所以今天先把租房爬虫代码分享给大家！

二、源码

目标网站：房天下
注意事项：如果直接使用 requests 发送请求但话，是得不到想要的数据，网站会自动跳转，所以我们需要先提取真正的地址，然后再向真正网址发送请求然后解析数据即可
使用介绍：

首先我们先去目标网站（房天下）选择租房，然后按照需求设定条件比如地铁线路或者价格区间
然后点击第二页、记录总页码（或者你想爬多少页）然后复制URL（不要是第一页的url）比如下面是一个深圳租房6000元以下的租房信息第二页的url：https://sz.zu.fang.com/house1-j079/i32/ 你会发现后面这部分：/i32/ 然后再点击第三页：https://sz.zu.fang.com/house1-j079/i33/ 发现url结尾是：/i33/ 规律很明显了吧！/i3 后面的数字就是页码，接下来我们使用python中字符串格式化语法 %s 将它修改即可，例如：
```
url = "https://sz.zu.fang.com/house1-j034/i3%s/"
page_num = 10
print([url % x for x in range(1, page_num+1)])
```
输出结果如下：

[‘https://sz.zu.fang.com/house1-j034/i31/’,
‘https://sz.zu.fang.com/house1-j034/i32/’,
‘https://sz.zu.fang.com/house1-j034/i33/’,
‘https://sz.zu.fang.com/house1-j034/i34/’,
‘https://sz.zu.fang.com/house1-j034/i35/’,
‘https://sz.zu.fang.com/house1-j034/i36/’,
‘https://sz.zu.fang.com/house1-j034/i37/’,
‘https://sz.zu.fang.com/house1-j034/i38/’,
‘https://sz.zu.fang.com/house1-j034/i39/’,
‘https://sz.zu.fang.com/house1-j034/i310/’]

所以我们要将页面处的/i3**/ 改为 %s 使用列表生成式即可完成，例如（为什么要这么麻烦呢，因为这个网站url一旦出错就会让你输入验证码，所以url要严格一点，解决办法也要但是我们又需要的数据不是很多，没必要再改，只需要自己把url准备好和页面，可以改为队列，注意：网站同一IP地址爬取3000条左右就需要输入验证码）：

https://sz.zu.fang.com/house1-j079/i32/ # 原url第二页
https://sz.zu.fang.com/house1-j079/i3%s/ # 修改后的格式化字符串

完成url、和页面的配置就可以使用代码进行爬虫：

import re
import requests
import pandas as pd
from lxml import etree


class SpiderApartment(object):
    def __init__(self, file_path, url, page_num):
        self.file_path = file_path
        self.url = [url % x for x in range(1, page_num+1)]

    @staticmethod
    def get_true_url(url):
        """
        获取跳转的url
        :param url:
        :return:
        """
        response = requests.get(url)
        result = etree.HTML(response.text)
        temp_url = result.xpath('//a[@class="btn-redir"]/@href')[0]
        return temp_url

    @staticmethod
    def get_title(etree_obj):
        title = etree_obj.xpath("//div['houseList xh-highlight']/dl//p['title'][1]/a")
        return [i.text for i in title]

    @staticmethod
    def get_address(etree_obj):
        """
        提取详细地址
        :param etree_obj:
        :return:
        """
        temp_list = list()
        result2 = etree_obj.xpath("//div['houseList xh-highlight']/dl//p['title'][3]//span")
        for x in range(0, len(result2), 3):
            temp_list.append("-".join([i.text for i in result2[x: x + 3]]))
        return temp_list

    @staticmethod
    def get_line(etree_obj):
        """
        获取地铁线路名
        :param etree_obj:
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p['title'][1]/span[@class='note subInfor']")
        return [str(i.text).replace("距", "") for i in result]

    @staticmethod
    def get_room_type(etree_obj):
        """
        获取房屋的出租类型
        :param etree_obj:
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p[@class='font15 mt12 bold']/text()[1]")
        return [re.sub("\n\t", "", i).strip() for i in result]

    @staticmethod
    def get_room_scale(etree_obj):
        """
        获取房间的信息几室几庭
        :param etree_obj:
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p[@class='font15 mt12 bold']/text()[2]")
        return [re.sub("\n\t", "", i).strip() for i in result]

    @staticmethod
    def get_room_size(etree_obj):
        """
        获取房间的大小
        :param etree_obj:
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p[@class='font15 mt12 bold']/text()[3]")
        return [str(re.sub("\n\t", "", i).strip()).split("�")[0] for i in result]

    @staticmethod
    def get_room_direction(etree_obj):
        """
        获取房间的朝向
        :param etree_obj:
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p[@class='font15 mt12 bold']/text()[4]")
        return [str(re.sub("\n\t", "", i).strip()).split("�")[0] for i in result]

    @staticmethod
    def get_price(etree_obj):
        """
        获取房间的价格
        :return:
        """
        result = etree_obj.xpath("//div['houseList xh-highlight']/dl//p[@class='mt5 alingC']/span")
        return [i.text for i in result]

    def run(self):
        for x in self.url:
            print(x)
            response = requests.get(self.get_true_url(x))
            result = etree.HTML(response.text)
            df = pd.DataFrame()
            df["title"] = self.get_title(result)  # 标题
            df["address"] = self.get_address(result)  # 地址
            df["line"] = self.get_line(result)  # 地铁线路
            df["room_type"] = self.get_room_type(result)  # 房间类型
            df["room_scale"] = self.get_room_scale(result)  # 房间规模
            df["room_size"] = self.get_room_size(result)  # 房间面积
            # df["room_direction"] = self.get_room_direction(result)  # 房间朝向
            df["price"] = self.get_price(result)  # 房间价格
            try:
                df.to_csv(self.file_path, encoding="utf_8_sig", mode="a", header=False)
            except FileNotFoundError:
                df.to_csv(self.file_path, encoding="utf_8_sig")


if __name__ == '__main__':
    name = "./深圳.csv"
    url = "https://sz.zu.fang.com/house1-j034/i3%s/"
    page = 100
    spider = SpiderApartment(name, url, page)
    spider.run()