【Python】基于Python获取链家小区房价信息及其POI数据

陈修一

已于 2022-03-23 23:00:26 修改

阅读量4.7k

点赞数 13

分类专栏： Python 大数据文章标签：爬虫链家房价 python 大数据

于 2020-09-07 16:02:33 首次发布

本文链接：https://blog.csdn.net/Leaze932822995/article/details/108449531

版权

Python 同时被 2 个专栏收录

19 篇文章 20 订阅

订阅专栏

大数据

10 篇文章 12 订阅

订阅专栏

文章目录

1 简介
2 效果展示
3 分析网页
4 代码思路
5 完整代码
6 相关文章

1 简介

本来要先发在csdn上的，但是之前学弟催我给他公众号写点东西，我就把这篇博客首发在他的公众号上，现在得空了就把这篇搬回来。大家可以关注一下学弟的公众号。
首先打开相关网页（北京链家小区信息）。
注意：本博客的代码适用于爬取某个城市的小区房价信息。如需要爬取其他信息，可修改代码，链家的数据获取的基本逻辑都差不多。

2 效果展示

这是我之前在上一篇博客上写的效果，当时还没添加查询POI数据和写入CSV的功能。
在这里插入图片描述
然后下面这个是后来修改完善后的导出效果。其中A-D列来源于链家，E到H列来源于百度地图。

3 分析网页

用谷歌浏览器打开北京链家小区信息，如果需要其他城市的可以直接替换。
首先可以看看我们要爬取的信息。我这次主要获取的有小区名、小区位置、房价。
在这里插入图片描述

进入页面后Ctrl+U查看网页源代码，在源代码界面Ctrl+F搜索小区名（也可以搜索其他的）先定位到相关的代码处。
经过简单的观察，我们就可以定位到我们所需要的所有信息（链家还是挺好爬取的……）。
可以看下图，该条目下我们所需的所有信息一下都被找到了。
在这里插入图片描述
在Python中获取该页面所有的源代码文本信息的代码如下，这里用到了requests库。下面代码中的变量html里面装的就是该页面的所有源代码

    url = str(url)
    html = requests.get(url).text # 获取页面源代码

然后在网页中把我们要获取的数据的前后的代码块复制一下，按照正则表达式（用到re库）的规则整理一下。下面这个代码的意思就是在html（也就是源代码）中通过我们所需要的数据的前后代码定位到我们所需要的数据，并且获取数据，分别传入变量name、price、district、bizcircle中。

    # 正则表达式
    # 小区名
    name_rule = r'lianjia.com/xiaoqu/[0-9]*/" target="_blank">(.*?)</a>' #[0-9]* 表示任意多个数字     .*? 匹配一次
    name = re.findall(name_rule, html)
    # 房价
    price_rule = r'<div class="totalPrice"><span>(.*?)</span>'
    price = re.findall(price_rule, html)
    # 小区所在区域
    district_rule = r'class="district" title=".*?">(.*?)</a>'
    district = re.findall(district_rule, html)
    # 小区所在商圈
    bizcircle_rule = r'class="bizcircle" title=".*?">(.*?)</a>&nbsp'
    bizcircle = re.findall(bizcircle_rule, html)

好了，到这里，基本的分析和数据的获取代码已经完成了，接下来就是要整理获取到的数据，并且实现批量获取

4 代码思路

前文中网页的分析已经完成了，核心的爬取思路有了。
接下来还需要完成的工作有：

实现爬虫的自动化，比如说自动翻页什么的。因为我们总不能就爬取一页吧，一页好像才30条房价信息……
把爬取到的n个页面整合到一起。
通过小区名字查询POI数据。
把查询到的POI数据和原来的房价数据整合到一起。
把房价数据和POI数据写入csv表中。

先把需要的库导入。

import requests
import re
import time
import csv
import datetime

首先我们根据前文分析页面总结的代码先实现单页数据的爬取，并且把它封装成一个函数。
该函数的思路是：传入链家某城市的小区房价的某个页面的url，读取该页面的源代码并且传入变量html，用正则表达式定位并且获取我们需要的数据并将其传入变量name、price、district、bizcircle中（这里传入的是列表形式），然后将其整理为字典。

# 1.爬取房价并且返回一个页面的字典
def get_housing_price(url):
    url = str(url)
    html = requests.get(url).text # 获取页面原代码
    # 正则表达式
    # 小区名
    name_rule = r'lianjia.com/xiaoqu/[0-9]*/" target="_blank">(.*?)</a>' #[0-9]* 表示任意多个数字     .*? 匹配一次
    name = re.findall(name_rule, html)
    # 房价
    price_rule = r'<div class="totalPrice"><span>(.*?)</span>'
    price = re.findall(price_rule, html)
    # 小区所在区域
    district_rule = r'class="district" title=".*?">(.*?)</a>'
    district = re.findall(district_rule, html)
    # 小区所在商圈
    bizcircle_rule = r'class="bizcircle" title=".*?">(.*?)</a>&nbsp'
    bizcircle = re.findall(bizcircle_rule, html)
    # 建立小区名和房价对应的字典
    housing_price_dict = {}
    if len(name) == len(price) == len(district) == len(bizcircle):
        for i in range(len(name)):
            infor = [] # 存放信息的列表
            if price[i] != '暂无': #因为存在暂无，把除了暂无房价数据以外的房价变成浮点型
                floated = float(price[i])
            else:
                floated = '暂无'
            infor.append(name[i])
            infor.append(district[i])
            infor.append(bizcircle[i])
            infor.append(floated)
            housing_price_dict[str(url)+'-'+str(i)+'-'+name[i]] = infor # 遍历生成键值
    else:
        print('参数匹配失败')
    return housing_price_dict

第二步是通过第一步的函数查询n个页面的数据（遍历），并且将n个页面获取的数据整合为一个字典。
先写一个合并两个字典的函数放在这，等下用得着。

# 2.1.合并字典
def merge_dict(dict1, dict2):
    merged = {**dict1, **dict2} 
    return merged

然后回到原网页看看链家小区房价信息页面的翻页规则，发现只只需要在原来url后+pg+数字就能翻页。很简单，现在通过遍历的思路，写一个能够生成输入起始页和终止页之间所有url的函数即可，然后再把合并字典的函数放进去，就可以实现自动翻页并且整合房价数据。

# 2.2.整合房价字典
def merge_price_dict(url, PageStart, PageEnd):
    initial = {}
    for pg in range(PageStart, PageEnd+1): # 设置起始和中止界面
        url_new = str(url) + '/pg' + str(pg) + '/' # 翻页
        prices = get_housing_price(url_new)
        time.sleep(5)
        print(f'===================正在获取第 {pg} 页房价数据===================')
        initial = merge_dict(initial, prices)
    return initial

到上面，基本的房价信息已经获取完毕了。接下来第三步就是要通过小区的名字查询POI数据。
我这里用的是百度的api接口，当然高德还是腾讯的也可以用，但是我还是百度的用得比较多，所以这里我就说下百度的方法。
在之前，需要先获取百度地图开放平台（http://lbsyun.baidu.com/）的ak密钥，就是点进我的应用-创建应用。因为我很早之前获取的，具体的获取流程我也不太记得了，如果找不到获取可以自己上百度谷歌一下。
获取了ak后，就可以通过百度地图的接口查询POI数据了，下面的函数也就能实现了。
这个函数的意思是传入POI关键字参数Keyword和地区参数District，在地区District中搜索和Keyword相关的POI数据。在某些情况下的POI获取中，我们要的是n个与关键字相关的POI数据（比如我在北京市范围内搜索“银行”），但是由于在这个任务中我们是要检索特定的小区，那检索出来的POI数据基本是第一个没跑了（我还是相信百度……（主要是懒））。回到这个函数的代码，输入关键字和地区参数后，这个函数会返回名字name、地区Region、经度Longitude、纬度Latitude。

# 3.获取POI数据
def get_POI(Keyword, District):
    ak = 这里填入你获取的ak，并且用引号括起来
    url = f'http://api.map.baidu.com/place/v2/search?query={Keyword}&region={District}&page_size=20&page_num=0&output=json&ak={ak}'
    html = requests.get(url)    #获取网页信息
    data = html.json()          #获取网页信息的json格式数据
    item = data['results'][0]   # 读取reults里第一个元素，可能是最匹配的
    Name = item["name"]
    # Province = item['province']
    City = item['city']
    Area = item['area']
    # Address = item['address']
    Region = City + Area
    ######获取经纬度#####
    Longitude = item['location']['lng'] # 经度
    Latitude = item['location']['lat']  # 纬度
    return Name, Region, Longitude, Latitude

第四步就是将所获取到的POI数据添加到原来的房价字典里。这一步的函数将上面的所有函数整合在一起。

# 4.把房价数据和POI数据匹配，生成新字典
def match_price_POI(url, PageStart, PageEnd, City):
    count = 1
    hp_infor = merge_price_dict(url, PageStart, PageEnd) # 获取房价信息字典
    for key, value in hp_infor.items(): # 遍历获得房价信息字典的键和值
        if count % 50 == 0:
            print(f'===================正在获取第 {count} 条POI数据===================')
            time.sleep(1)
        count += 1
        try:
            name, region, lon, lat = get_POI(value[0], value[1]) # 以键（小区名）作为关键词获取POI数据
        except:
            try:
                name, region, lon, lat = get_POI(value[0], str(City)) # 搜索大城市区域
            except:
                try:
                    name, region, lon, lat = get_POI(value[1]+value[0], str(City)) # 以键（小区名）作为关键词获取POI数据
                except:
                    print(f'！！！！！！查询 {key} 的POI数据时遇到一个不可避免的错误！！！！！！')
                    name = ''
                    region = ''
                    lon = ''
                    lat = ''
        time.sleep(2)
        value.append(name)
        value.append(region)
        value.append(lon)
        value.append(lat)
        hp_infor[key] = value # 更新字典
    return hp_infor

终于来到最后一步，第五步！这一步就是把所有数据写入csv存起来。写入csv的好处就是以后需要的话不用重新爬取，并且如果要做地理分析还是什么的话可以直接导入ArcGIS。
这个函数就是添加了第四步的功能，并且用遍历字典的方法写入csv里（写入csv的方法很简单这里就不多说）。为了监测爬虫的情况，我这里import了datetime库看爬虫所需时间。

# 5.把房价数据和POI数据写入CSV
def write_CSV(url, PageStart, PageEnd, City, FileName):                                    #urls是传入的url列表，FileName是存储的csv文件夹名
    startime = datetime.datetime.now()
    print(f'当前时间   {startime}')

    data = match_price_POI(url, PageStart, PageEnd, City)              #爬取整合房价和POI数据
    Mycsv = open(f'{FileName}.csv','a',newline='')        #打开csv
    csv_write = csv.writer(Mycsv,dialect='excel')
    tittle = ('小区', '地区', '商圈', '房价', 'POI小区', 'POI地区', '经度', '纬度')           #表头
    csv_write.writerow(tittle)                            #写入表头
    count = 1
    for key, value in data.items():
        content = (value[0], value[1], value[2], value[3], value[4], value[5], value[6], value[7])
        csv_write.writerow(content)
        if count % 10 == 0:
            print(f'===================正在写入第 {count} 条房价数据===================')
        count += 1
    print('数据写入完成')

    endtime = datetime.datetime.now()
    print(f'当前时间   {endtime}')
    print('')
    print(f'共花费时间   {endtime - startime}')
    print('')
    print('')

接下来必不可少地要说说启动方法。我举个栗子，以北京为例，爬取1到20页的房价信息。

write_CSV(r'https://bj.lianjia.com/xiaoqu/', 1, 20, '北京', '北京房价')

然后我再举个栗子，以哈尔滨为例，爬取50到60页的房价信息。

write_CSV(r'https://hrb.lianjia.com/xiaoqu/', 50, 60, '哈尔滨', '哈尔滨房价')

5 完整代码

# -*- coding: UTF-8 -*-
import requests
import json
import re
import time
import csv
import datetime




# 1.爬取房价并且返回一个页面的字典
def get_housing_price(url):
    url = str(url)
    html = requests.get(url).text # 获取页面原代码
    # 正则表达式
    # 小区名
    name_rule = r'lianjia.com/xiaoqu/[0-9]*/" target="_blank">(.*?)</a>' #[0-9]* 表示任意多个数字     .*? 匹配一次
    name = re.findall(name_rule, html)
    # 房价
    price_rule = r'<div class="totalPrice"><span>(.*?)</span>'
    price = re.findall(price_rule, html)
    # 小区所在区域
    district_rule = r'class="district" title=".*?">(.*?)</a>'
    district = re.findall(district_rule, html)
    # 小区所在商圈
    bizcircle_rule = r'class="bizcircle" title=".*?">(.*?)</a>&nbsp'
    bizcircle = re.findall(bizcircle_rule, html)
    # 建立小区名和房价对应的字典
    housing_price_dict = {}
    if len(name) == len(price) == len(district) == len(bizcircle):
        for i in range(len(name)):
            infor = [] # 存放信息的列表
            if price[i] != '暂无': #因为存在暂无，把除了暂无房价数据以外的房价变成浮点型
                floated = float(price[i])
            else:
                floated = '暂无'
            infor.append(name[i])
            infor.append(district[i])
            infor.append(bizcircle[i])
            infor.append(floated)
            housing_price_dict[str(url)+'-'+str(i)+'-'+name[i]] = infor # 遍历生成键值
    else:
        print('参数匹配失败')
    return housing_price_dict



# 2.1.合并字典
def merge_dict(dict1, dict2):
    merged = {**dict1, **dict2} 
    return merged


# 2.2.整合房价字典
def merge_price_dict(url, PageStart, PageEnd):
    initial = {}
    for pg in range(PageStart, PageEnd+1): # 设置起始和中止界面
        url_new = str(url) + '/pg' + str(pg) + '/' # 翻页
        prices = get_housing_price(url_new)
        time.sleep(5)
        print(f'===================正在获取第 {pg} 页房价数据===================')
        initial = merge_dict(initial, prices)
    return initial



# 3.获取POI数据
def get_POI(Keyword, District):
    ak = 这里填入你获取的ak，并且用引号括起来
    url = f'http://api.map.baidu.com/place/v2/search?query={Keyword}&region={District}&page_size=20&page_num=0&output=json&ak={ak}'
    html = requests.get(url)    #获取网页信息
    data = html.json()          #获取网页信息的json格式数据
    item = data['results'][0]   # 读取reults里第一个元素，可能是最匹配的
    Name = item["name"]
    # Province = item['province']
    City = item['city']
    Area = item['area']
    # Address = item['address']
    Region = City + Area
    ######获取经纬度#####
    Longitude = item['location']['lng'] # 经度
    Latitude = item['location']['lat']  # 纬度
    return Name, Region, Longitude, Latitude



# 4.把房价数据和POI数据匹配，生成新字典
def match_price_POI(url, PageStart, PageEnd, City):
    count = 1
    hp_infor = merge_price_dict(url, PageStart, PageEnd) # 获取房价信息字典
    for key, value in hp_infor.items(): # 遍历获得房价信息字典的键和值
        if count % 50 == 0:
            print(f'===================正在获取第 {count} 条POI数据===================')
            time.sleep(1)
        count += 1
        try:
            name, region, lon, lat = get_POI(value[0], value[1]) # 以键（小区名）作为关键词获取POI数据
        except:
            try:
                name, region, lon, lat = get_POI(value[0], str(City)) # 搜索大城市区域
            except:
                try:
                    name, region, lon, lat = get_POI(value[1]+value[0], str(City)) # 以键（小区名）作为关键词获取POI数据
                except:
                    print(f'！！！！！！查询 {key} 的POI数据时遇到一个不可避免的错误！！！！！！')
                    name = ''
                    region = ''
                    lon = ''
                    lat = ''
        time.sleep(2)
        value.append(name)
        value.append(region)
        value.append(lon)
        value.append(lat)
        hp_infor[key] = value # 更新字典
    return hp_infor



# 5.把房价数据和POI数据写入CSV
def write_CSV(url, PageStart, PageEnd, City, FileName):                                    #urls是传入的url列表，FileName是存储的csv文件夹名
    startime = datetime.datetime.now()
    print(f'当前时间   {startime}')

    data = match_price_POI(url, PageStart, PageEnd, City)              #爬取整合房价和POI数据
    Mycsv = open(f'{FileName}.csv','a',newline='')        #打开csv
    csv_write = csv.writer(Mycsv,dialect='excel')
    tittle = ('小区', '地区', '商圈', '房价', 'POI小区', 'POI地区', '经度', '纬度')           #表头
    csv_write.writerow(tittle)                            #写入表头
    count = 1
    for key, value in data.items():
        content = (value[0], value[1], value[2], value[3], value[4], value[5], value[6], value[7])
        csv_write.writerow(content)
        if count % 10 == 0:
            print(f'===================正在写入第 {count} 条房价数据===================')
        count += 1
    print('数据写入完成')

    endtime = datetime.datetime.now()
    print(f'当前时间   {endtime}')
    print('')
    print(f'共花费时间   {endtime - startime}')
    print('')
    print('')