【编程】逐行解析之爬取北上广深租房信息

最新推荐文章于 2021-01-06 15:14:09 发布

food_for_thought

最新推荐文章于 2021-01-06 15:14:09 发布

阅读量635

点赞数

分类专栏： python 文章标签：数据库 python mongodb

本文链接：https://blog.csdn.net/food_for_thought/article/details/105009215

版权

python 专栏收录该内容

19 篇文章 5 订阅

订阅专栏

俗话说读万卷书不如行万里路，提升编程能力的很好途径之一是自己去写，但当自己不会的时候去解析别人写的代码也是效率不错的学习途径。
本文将逐行解读大神Alfred数据室里的一项爬取链家租房网站的代码，因小弟并非计算机专业，只是略有兴趣，遗漏之处，必为不会，错误不足之处，请不吝指点。
完整代码在文末。
北上广深租房图鉴

本文主要接触mongodb数据库的使用，因为本次爬取的数据量很大，同时进一步学习class的使用。

文章目录

代码分析

import os
import re
import time
import requests
from pymongo import MongoClient
from info import rent_type, city_info

os 模块提供了用来处理文件和目录的函数。
re, requests 之前介绍过。
time 顾名思义。
pymongo 是用来连接 mongdb 数据库的，MongoDB 是目前最流行的 NoSQL 数据库之一，使用的数据类型 BSON（类似 JSON），建议不了解的同学打开链接认真学习。
info 是作者事先搜集好的北上广深区位信息，用来爬取时导入所需类别，可在文末查看具体内容。

class Rent(object):
    """
    初始化函数，获取租房类型（整租、合租）、要爬取的城市分区信息以及连接mongodb数据库
    """

如注释所言。

    def __init__(self):
        self.rent_type = rent_type
        self.city_info = city_info

        host = os.environ.get('MONGODB_HOST', '127.0.0.1')  # 本地数据库
        port = os.environ.get('MONGODB_PORT', '27017')  # 数据库端口
        mongo_url = 'mongodb://{}:{}'.format(host, port)
        mongo_db = os.environ.get('MONGODB_DATABASE', 'Lianjia')
        client = MongoClient(mongo_url)
        self.db = client[mongo_db]
        self.db['zufang'].create_index('m_url', unique=True)  # 以m端链接为主键进行去重

该函数主要用来创建mongo数据库。
首先引入之前搜集好的租房类别和城市信息。
os.environ.get（）是python中os模块获取环境变量的一个方法，os.environ是一个字典，是环境变量的字典。
MongoClient(mongo_url) 即连接上文的数据库。

    def get_data(self):
        """
        爬取不同租房类型、不同城市各区域的租房信息
        :return: None
        """
        for ty, type_code in self.rent_type.items():  # 整租、合租
            for city, info in self.city_info.items():  # 城市、城市各区的信息
                for dist, dist_py in info[2].items():  # 各区及其拼音
                    res_bc = requests.get('https://m.lianjia.com/chuzu/{}/zufang/{}/'.format(info[1], dist_py))
                    pa_bc = r"data-type=\"bizcircle\" data-key=\"(.*)\" class=\"oneline \">"
                    bc_list = re.findall(pa_bc, res_bc.text)
                    self._write_bc(bc_list)
                    bc_list = self._read_bc()  # 先爬取各区的商圈，最终以各区商圈来爬数据，如果按区爬，每区最多只能获得2000条数据

函数作用如注释所言。
self.rent_type.items() 打印出来的结果是dict_items([(‘整租’, 200600000001), (‘合租’, 200600000002)])，因此ty描述的是前面的整租、合租，type_code指的是后面的代码。
self.city_info.items() 同上，前者是城市，后者是具体信息。
我们要爬取的手机端链家网站网址还是比较简单的，各个类别造成网址的变化还是比较容易发现，即城市和各区拼音。
re.findall(pa_bc, res_bc.text)，对获取的网址后面加.text可以直接获取该网址的源码，re.findall即获得所有符合前面正则表达式设定的内容，并通过后面写的函数来存入并读取改内容。

if len(bc_list) > 0:
   for bc_name in bc_list:
   idx = 0
   has_more = 1
   while has_more:
         try:
             url = 'https://app.api.lianjia.com/Rentplat/v1/house/list?city_id={}&condition={}' \
                   '/rt{}&limit=30&offset={}&request_ts={}&scene=list'.format(info[0],
                                                                       bc_name,
                                                                       type_code,
                                                                       idx*30,
                                                                       int(time.time()))
             res = requests.get(url=url, timeout=10)
             print('成功爬取{}市{}-{}的{}第{}页数据！'.format(city, dist, bc_name, ty, idx+1))
             item = {'city': city, 'type': ty, 'dist': dist}
             self._parse_record(res.json()['data']['list'], item)

             total = res.json()['data']['total']
             idx += 1
             if total/30 <= idx:
               has_more = 0
             # time.sleep(random.random())
         except:
             print('链接访问不成功，正在重试！')
             time.sleep(5)

如果商圈的值长度大于0时，设定两个值，分析寻找网页api数据接口，这个地方的知识点不了解，不知道具体怎么做到的，望高手指点。
随后每次爬取一页就调用后面两个函数，当每页显示数量小于30后，就进行下一次循环，同时最好有休息时间，以防访问次数过多而被禁止访问。

    @staticmethod
    def _write_bc(bc_list):
        """
        把爬取的商圈写入txt，为了整个爬取过程更加可控
        :param bc_list: 商圈list
        :return: None
        """
        with open('bc_list.txt', 'w') as f:
            for bc in bc_list:
                f.write(bc + '\n')

之前的代码主要是解析json数据。
with open as 是用来打开文本的，前面的参数是所要打开的文件，后面的是以何种方式打开，w代表写入模式，其他模式参照附录2.

知识总结

with open('bc_list.txt', 'w') as f:  # 内置读写文本的方式

完整代码

import os
import re
import time
import requests
from pymongo import MongoClient
from info import rent_type, city_info


class Rent(object):
    """
    初始化函数，获取租房类型（整租、合租）、要爬取的城市分区信息以及连接mongodb数据库
    """
    def __init__(self):
        self.rent_type = rent_type
        self.city_info = city_info

        host = os.environ.get('MONGODB_HOST', '127.0.0.1')  # 本地数据库
        port = os.environ.get('MONGODB_PORT', '27017')  # 数据库端口
        mongo_url = 'mongodb://{}:{}'.format(host, port)
        mongo_db = os.environ.get('MONGODB_DATABASE', 'Lianjia')
        client = MongoClient(mongo_url)
        self.db = client[mongo_db]
        self.db['zufang'].create_index('m_url', unique=True)  # 以m端链接为主键进行去重

    def get_data(self):
        """
        爬取不同租房类型、不同城市各区域的租房信息
        :return: None
        """
        for ty, type_code in self.rent_type.items():  # 整租、合租
            for city, info in self.city_info.items():  # 城市、城市各区的信息
                for dist, dist_py in info[2].items():  # 各区及其拼音
                    res_bc = requests.get('https://m.lianjia.com/chuzu/{}/zufang/{}/'.format(info[1], dist_py))
                    pa_bc = r"data-type=\"bizcircle\" data-key=\"(.*)\" class=\"oneline \">"
                    bc_list = re.findall(pa_bc, res_bc.text)
                    self._write_bc(bc_list)
                    bc_list = self._read_bc()  # 先爬取各区的商圈，最终以各区商圈来爬数据，如果按区爬，每区最多只能获得2000条数据

                    if len(bc_list) > 0:
                        for bc_name in bc_list:
                            idx = 0
                            has_more = 1
                            while has_more:
                                try:
                                    url = 'https://app.api.lianjia.com/Rentplat/v1/house/list?city_id={}&condition={}' \
                                          '/rt{}&limit=30&offset={}&request_ts={}&scene=list'.format(info[0],
                                                                                                     bc_name,
                                                                                                     type_code,
                                                                                                     idx*30,
                                                                                                     int(time.time()))
                                    res = requests.get(url=url, timeout=10)
                                    print('成功爬取{}市{}-{}的{}第{}页数据！'.format(city, dist, bc_name, ty, idx+1))
                                    item = {'city': city, 'type': ty, 'dist': dist}
                                    self._parse_record(res.json()['data']['list'], item)

                                    total = res.json()['data']['total']
                                    idx += 1
                                    if total/30 <= idx:
                                        has_more = 0
                                    # time.sleep(random.random())
                                except:
                                    print('链接访问不成功，正在重试！')
                                    time.sleep(5)

    def _parse_record(self, data, item):
        """
        解析函数，用于解析爬回来的response的json数据
        :param data: 一个包含房源数据的列表
        :param item: 传递字典
        :return: None
        """
        if len(data) > 0:
            for rec in data:
                item['bedroom_num'] = rec.get('frame_bedroom_num')
                item['hall_num'] = rec.get('frame_hall_num')
                item['bathroom_num'] = rec.get('frame_bathroom_num')
                item['rent_area'] = rec.get('rent_area')
                item['house_title'] = rec.get('house_title')
                item['resblock_name'] = rec.get('resblock_name')
                item['bizcircle_name'] = rec.get('bizcircle_name')
                item['layout'] = rec.get('layout')
                item['rent_price_listing'] = rec.get('rent_price_listing')
                item['house_tag'] = self._parse_house_tags(rec.get('house_tags'))
                item['frame_orientation'] = rec.get('frame_orientation')
                item['m_url'] = rec.get('m_url')
                item['rent_price_unit'] = rec.get('rent_price_unit')

                try:
                    res2 = requests.get(item['m_url'], timeout=5)
                    pa_lon = r"longitude: '(.*)',"
                    pa_lat = r"latitude: '(.*)'"
                    pa_distance = r"<span class=\"fr\">(\d*)米</span>"
                    item['longitude'] = re.findall(pa_lon, res2.text)[0]
                    item['latitude'] = re.findall(pa_lat, res2.text)[0]
                    distance = re.findall(pa_distance, res2.text)
                    if len(distance) > 0:
                        item['distance'] = distance[0]
                    else:
                        item['distance'] = None
                except:
                    item['longitude'] = None
                    item['latitude'] = None
                    item['distance'] = None

                self.db['zufang'].update_one({'m_url': item['m_url']}, {'$set': item}, upsert=True)
                print('成功保存数据:{}!'.format(item))

    @staticmethod
    def _parse_house_tags(house_tag):
        """
        处理house_tags字段，相当于数据清洗
        :param house_tag: house_tags字段的数据
        :return: 处理后的house_tags
        """
        if len(house_tag) > 0:
            st = ''
            for tag in house_tag:
                st += tag.get('name') + ' '
            return st.strip()

    @staticmethod
    def _write_bc(bc_list):
        """
        把爬取的商圈写入txt，为了整个爬取过程更加可控
        :param bc_list: 商圈list
        :return: None
        """
        with open('bc_list.txt', 'w') as f:
            for bc in bc_list:
                f.write(bc+'\n')

    @staticmethod
    def _read_bc():
        """
        读入商圈
        :return: None
        """
        with open('bc_list.txt', 'r') as f:
            return [bc.strip() for bc in f.readlines()]


if __name__ == '__main__':
    rent = Rent()
    rent.get_data()

附录

info

rent_type = {'整租': 200600000001, '合租': 200600000002}

city_info = {'北京': [110000, 'bj', {'东城': 'dongcheng', '西城': 'xicheng', '朝阳': 'chaoyang', '海淀': 'haidian',
                                   '丰台': 'fengtai', '石景山': 'shijingshan', '通州': 'tongzhou', '昌平': 'changping',
                                   '大兴': 'daxing', '亦庄开发区': 'yizhuangkaifaqu', '顺义': 'shunyi', '房山': 'fangshan',
                                   '门头沟': 'mentougou', '平谷': 'pinggu', '怀柔': 'huairou', '密云': 'miyun',
                                   '延庆': 'yanqing'}],
             '上海': [310000, 'sh', {'静安': 'jingan', '徐汇': 'xuhui', '黄浦': 'huangpu', '长宁': 'changning',
                                   '普陀': 'putuo', '浦东': 'pudong', '宝山': 'baoshan', '闸北': 'zhabei',
                                   '虹口': 'hongkou','杨浦': 'yangpu', '闵行': 'minhang', '金山': 'jinshan',
                                   '嘉定': 'jiading','崇明': 'chongming', '奉贤': 'fengxian', '松江': 'songjiang',
                                   '青浦': 'qingpu'}],
             '广州': [440100, 'gz', {'天河': 'tianhe', '越秀': 'yuexiu', '荔湾': 'liwan', '海珠': 'haizhu', '番禺': 'panyu',
                                   '白云': 'baiyun', '黄埔': 'huangpu', '从化': 'conghua', '增城': 'zengcheng',
                                   '花都': 'huadu', '南沙': 'nansha'}],
             '深圳': [440300, 'sz', {'罗湖区': 'luohuqu', '福田区': 'futianqu', '南山区': 'nanshanqu',
                                   '盐田区': 'yantianqu', '宝安区': 'baoanqu', '龙岗区': 'longgangqu',
                                   '龙华区': 'longhuaqu', '光明区': 'guangmingqu', '坪山区': 'pingshanqu',
                                   '大鹏新区': 'dapengxinqu'}]}

附录2

w：以写方式打开，
a：以追加模式打开 (从 EOF 开始, 必要时创建新文件)
r+：以读写模式打开
w+：以读写模式打开 (参见 w )
a+：以读写模式打开 (参见 a )
rb：以二进制读模式打开
wb：以二进制写模式打开 (参见 w )
ab：以二进制追加模式打开 (参见 a )
rb+：以二进制读写模式打开 (参见 r+ )
wb+：以二进制读写模式打开 (参见 w+ )
ab+：以二进制读写模式打开 (参见 a+ )fp.read([size])