Python实战:爬取二手房信息并进行地图可视化【附源码】

项目复盘

在这个项目中,我们使用Python来爬取链家网站上的二手房信息,并将这些信息在地图上进行可视化。我们选择南昌市作为目标城市,通过高德地图API获取每个房源的经纬度信息,然后利用Folium库在地图上展示这些房源的位置。

整个过程涉及到了网络爬虫、数据清洗、地理编码和数据可视化等多个步骤。首先,我们发送HTTP请求获取网页内容,然后解析HTML文档提取所需的信息。接着,我们使用线程池并发地爬取每一页的数据,以提高爬取效率。然后,我们将爬取到的数据整理成Pandas DataFrame,并进行必要的数据清洗。最后,我们将清洗后的数据在地图上进行可视化。

此外,我们还引入了日志记录功能,以便于监控程序运行状态和调试代码。同时,我们也考虑到了异常处理和用户中断的情况,确保程序能够在遇到错误或被用户中断时,仍能正确关闭网络会话并保存已爬取的数据。

技术总结

  1. 网络爬虫:使用requests库发送HTTP请求,获取网页内容。然后,使用lxml库和XPath语法从HTML文档中提取所需的信息。

    session = requests.Session()
    response = session.get(f'https://{city}.ke.com{pathname}pg1/', cookies=cookies, headers=headers)
    html_text1 = etree.HTML(response.text)
  2. 并发编程:使用ThreadPoolExecutor创建一个线程池,然后提交任务到线程池进行并发执行。这样可以显著提高数据爬取的速度。

    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = [executor.submit(fetch_page_data, session, city, areaname, pathname, i, cookies, headers) for i in range(1, pageTotal + 1)]
  3. 数据处理:使用Pandas库将爬取到的数据整理成DataFrame,并进行数据清洗。还使用了高德地图API获取每个房源的经纬度信息,并将这些信息添加到DataFrame中。

    df = pd.DataFrame({
        '行政区域': [areaname] * len(all_titles),
        '名称': all_titles,
        '小区名': all_positions,
        '房屋信息': all_houses,
        '发布时间': all_follows,
        '总价(万)': all_totalPrices,
        '单价(元/平)': all_unitPrices,
        '地址': all_urls,
        '经度': lngs,
        '纬度': lats
    })
  4. 数据可视化:使用Folium库将清洗后的数据在地图上进行可视化。为每个房源创建一个标记,并设置自定义的弹出窗口和工具提示。

    for index, row in df.iterrows():
        popup_html = f"""
        <div style="width:200px; font-size: 14px;">
            <strong>{row['小区名']}</strong><br>
            {row['名称']}<br>
            总价:{row['总价(万)']}万
        </div>
        """
        popup = folium.Popup(popup_html, max_width=300)
        folium.Marker(
            location=[float(row['纬度']), float(row['经度'])],
            popup=popup,
            tooltip=row['小区名'],  # 鼠标悬停时显示的文字
            icon=folium.Icon(color='blue')
        ).add_to(m)
  5. 日志记录:使用logging库进行日志记录。创建了一个logger,并配置了两个handler,一个用于将日志输出到控制台,另一个用于将日志写入文件。还设置了日志的格式和级别,以便于查看和分析日志。

    logger = logging.getLogger(__name__)
    logger.setLevel(logging.INFO)  # 设置日志级别
    fh = logging.FileHandler('web_scraping.log', encoding='utf-8')
    fh.setLevel(logging.INFO)
    ch = logging.StreamHandler()
    ch.setLevel(logging.INFO)
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    fh.setFormatter(formatter)
    ch.setFormatter(formatter)
    logger.addHandler(fh)
    logger.addHandler(ch)
  6. 异常处理:使用try-except语句捕获可能出现的异常,并在异常发生时记录错误信息。此外,还处理了用户中断的情况,确保程序能够在被用户中断时,仍能正确关闭网络会话并保存已爬取的数据

    try:
        logger.info(f"Start scraping data for {city}")
        getSalesData(city) 
        logger.info(f"Finish scraping data for {city}")
    except Exception as e:
        logger.error(f"Error occurred during scraping or map generation: {e}")

完整代码示例: 

from tkinter import FALSE
import requests
from lxml import etree
import traceback
import requests.exceptions
import os
import folium
import time
import logging
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed

# 创建一个logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)  # 设置日志级别

# 创建一个handler,用于写入日志文件
fh = logging.FileHandler('web_scraping.log', encoding='utf-8')
fh.setLevel(logging.INFO)

# 创建一个handler
ch = logging.StreamHandler()
ch.setLevel(logging.INFO)

# 定义handler
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
ch.setFormatter(formatter)

# 给logger添加handler
logger.addHandler(fh)
logger.addHandler(ch)

# 使用logger
logger.info('This is a log info')
logger.debug('Debugging')
logger.warning('Warning exists')
logger.info('Finish')

cookies = {
    'select_city': '360100',
    'lianjia_ssid': '75f5926e-b623-420d-9fbf-518a5d4d74a6',
    'lianjia_uuid': '0e30ff27-caba-4dda-98f1-757a94403866',
    'sajssdk_2015_cross_new_user': '1',
    'sensorsdata2015jssdkcross': '%7B%22distinct_id%22%3A%2218ea28c3da0420-044e7de481d7e7-4c657b58-921600-18ea28c3da1ea9%22%2C%22%24device_id%22%3A%2218ea28c3da0420-044e7de481d7e7-4c657b58-921600-18ea28c3da1ea9%22%2C%22props%22%3A%7B%22%24latest_traffic_source_type%22%3A%22%E7%9B%B4%E6%8E%A5%E6%B5%81%E9%87%8F%22%2C%22%24latest_referrer%22%3A%22%22%2C%22%24latest_referrer_host%22%3A%22%22%2C%22%24latest_search_keyword%22%3A%22%E6%9C%AA%E5%8F%96%E5%88%B0%E5%80%BC_%E7%9B%B4%E6%8E%A5%E6%89%93%E5%BC%80%22%7D%7D'
}
 
headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'Connection': 'keep-alive',
    'Referer': 'https://nc.ke.com/ershoufang/',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/123.0.0.0 Safari/537.36 Edg/123.0.0.0',
    'sec-ch-ua': '"Microsoft Edge";v="123", "Not:A-Brand";v="8", "Chromium";v="123"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"macOS"',
}

geocode_cache = {}  # 初始化缓存字典

def get_geocode_amap(address, city, key):
    if address in geocode_cache:  # 检查缓存中是否已有结果
        logger.info(f"Address {address} found in cache.")
        return geocode_cache[address]
    
    url = 'https://restapi.amap.com/v3/geocode/geo'
    params = {
        'address': address,
        'city': city,
        'output': 'json',
        'key': key
    }
    try:
        response = requests.get(url, params=params)
        logger.info(f"Sending request to {url} with params {params}")
        data = response.json()
        logger.info(f"Geocoding response for {address}: {data}") 
        if data['status'] == '1' and data['geocodes']:
            # 过滤出符合预期城市的结果
            filtered_geocodes = [geocode for geocode in data['geocodes'] if geocode['city'] == '南昌市']
            if filtered_geocodes:
                location = filtered_geocodes[0]['location'].split(',')
                lng, lat = location[0], location[1]  # 经度、纬度
                geocode_cache[address] = (lng, lat)  # 将结果保存到缓存中
                logger.info(f"Geocoded {address} successfully: {lng}, {lat}")
                return lng, lat
            else:
                logger.warning(f"No matching results for {address} in the expected city.")
                return None, None
        else:
            logger.warning(f"Failed to geocode address {address}. Response: {data}")
            return None, None
    except Exception as e:
        logger.error(f"An error occurred while geocoding {address}: {e}\n{traceback.format_exc()}")
        return None, None
# 获取区的名称和路由
def getAreasInfo(city):
    responseinit = requests.get(
        f'https://{city}.ke.com/ershoufang', cookies=cookies, headers=headers)
    html_text_init = etree.HTML(responseinit.text)
    districts = [z for z in zip(html_text_init.xpath('//a[@class=" CLICKDATA"]/text()'),
                                html_text_init.xpath('//a[@class=" CLICKDATA"]/@href'))]
    return districts
    
def fetch_page_data(session, city, areaname, pathname, page, cookies, headers):
    url = f'https://{city}.ke.com{pathname}pg{page}/'
    try:
        logger.info(f"Fetching data from {url}")
        response = session.get(url, cookies=cookies, headers=headers, timeout=10)
        if response.status_code == 200:
            logger.info(f"Successfully fetched data for page {page} in {areaname}")
            html_text = etree.HTML(response.text)
    
            title = []
            position = []
            house = []
            follow = []
            totalPrice = []
            unitPrice = []
            url = []
            
            ullist = html_text.xpath('//ul[@class="sellListContent"]//li[@class="clear"]')
            for li in ullist:
                liChildren = li.getchildren()[1]
                # 名称
                title.append(liChildren.xpath('./div[@class="title"]/a/text()')[0])
                # url 地址
                url.append(liChildren.xpath('./div[@class="title"]/a/@href')[0])
                # 小区名称
                position.append(liChildren.xpath('./div/div/div[@class="positionInfo"]/a/text()')[0])
                # 房屋信息
                houselis = liChildren.xpath('./div/div[@class="houseInfo"]/text()')
                house.append([x.replace('\n', '').replace(' ', '') for x in houselis][1])
                # 上传时间
                followlis = liChildren.xpath('./div/div[@class="followInfo"]/text()')
                follow.append([x.replace('\n', '').replace(' ', '') for x in followlis][1])
                # 总价
                totalPrice.append(liChildren.xpath('./div/div[@class="priceInfo"]/div[@class="totalPrice totalPrice2"]/span/text()')[0].strip())
                # 收集单价信息
                unit_price_elements = liChildren.xpath('./div/div[@class="priceInfo"]/div[@class="unitPrice"]/span/text()')
                if unit_price_elements:
                    # 只有获取到单价信息时才进行处理
                    unit_price_str = unit_price_elements[0].strip()
                    # 从字符串中移除"元/平"和逗号,然后转换为整数
                    unit_price_int = int(unit_price_str.replace('元/平', '').replace(',', ''))
                    unitPrice.append(unit_price_int)
                else:
                    unitPrice.append(None) 

            return {
                'title': title,
                'position': position,
                'house': house,
                'follow': follow,
                'totalPrice': totalPrice,
                'unitPrice': unitPrice,
                'url': url
            }
        else:
            logger.error(f"Failed to fetch data from {url}. Status code: {response.status_code}")
    except requests.exceptions.RequestException as e:
        logger.error(f"Request to {url} failed due to network error: {e}\n{traceback.format_exc()}")
    except Exception as e:
        logger.error(f"An unexpected error occurred while fetching data from {url}: {e}\n{traceback.format_exc()}")

# 获取页面数据
def getSinglePageInfo(session, city, areaname, pathname, geocode=True):  
    response = session.get(f'https://{city}.ke.com{pathname}pg1/', cookies=cookies, headers=headers)
    html_text1 = etree.HTML(response.text)
    pageInfo = html_text1.xpath('//div[@class="page-box house-lst-page-box"]/@page-data')
    pageTotal = 5 #获取次数
    
    all_titles = []
    all_positions = []
    all_houses = []
    all_follows = []
    all_totalPrices = []
    all_unitPrices = []
    all_urls = []

    with ThreadPoolExecutor(max_workers=5) as executor:
        futures = [executor.submit(fetch_page_data, session, city, areaname, pathname, i, cookies, headers) for i in range(1, pageTotal + 1)]
        
        for future in as_completed(futures):
            result = future.result()
            all_titles.extend(result["title"])
            all_positions.extend(result["position"])
            all_houses.extend(result["house"])
            all_follows.extend(result["follow"])
            all_totalPrices.extend(result["totalPrice"])
            all_unitPrices.extend(result["unitPrice"])
            all_urls.extend(result["url"])

    if geocode:
        key = '高德地图API密钥'  # 高德地图API密钥
        lngs, lats = [], []
        for p in all_positions:
            lng, lat = get_geocode_amap(p + "小区", city, key)
            lngs.append(lng)
            lats.append(lat)
    else:
        lngs, lats = ['未知'] * len(all_positions), ['未知'] * len(all_positions)

    df = pd.DataFrame({
        '行政区域': [areaname] * len(all_titles),
        '名称': all_titles,
        '小区名': all_positions,
        '房屋信息': all_houses,
        '发布时间': all_follows,
        '总价(万)': all_totalPrices,
        '单价(元/平)': all_unitPrices,
        '地址': all_urls,
        '经度': lngs,
        '纬度': lats
    })

    # 清洗和转换数据
    df['名称'] = df['名称'].str.strip()
    df['小区名'] = df['小区名'].str.strip()
    df['房屋信息'] = df['房屋信息'].str.replace('\n', '').str.replace(' ', '')
    df['发布时间'] = df['发布时间'].str.replace('\n', '').str.replace(' ', '')
    df['总价(万)'] = df['总价(万)'].astype(float)
    df = df.drop_duplicates(subset=['地址'], keep='first')
    df = df.dropna(subset=['经度', '纬度'])
    logger.info("Dropped duplicates based on column '地址'")
    return df

def get_batch_data(session, city, districts, start_page=1, end_page=5, geocode=True): 
    # 检查是否存在历史数据文件
    if os.path.exists(f'{city}_data.csv'):
        history_data = pd.read_csv(f'{city}_data.csv')
    else:
        history_data = pd.DataFrame()

    for district in districts:
        logger.info(f"Start scraping {district[0]}")
        for page in range(start_page, end_page + 1):
            retries = 3
            while retries > 0:
                try:
                    dfInfo = getSinglePageInfo(session, city, district[0], district[1] + f"pg{page}/", geocode=geocode) 
                    # 将新数据和历史数据合并
                    history_data = pd.concat([history_data, dfInfo])
                    # 删除重复项
                    history_data.drop_duplicates(subset=['地址'], keep='first', inplace=True)
                    # 保存到文件中
                    history_data.to_csv(f'{city}_data.csv', index=False, encoding='utf-8-sig')
                    break  # 如果成功获取数据,跳出重试循环
                except Exception as e:
                    logger.error(f"Error fetching {city}, {district[0]}, page {page}: {e}")
                    retries -= 1
                    sleep_time = 2 ** (3 - retries)  # 指数退避策略
                    logger.info(f"Retrying after {sleep_time} seconds...")
                    time.sleep(sleep_time)
    

def getSalesData(city, start_page=1, end_page=5):
    session = requests.Session()
    try:
        districts = getAreasInfo(city)
        for district in districts:
            get_batch_data(session, city, [district], start_page=start_page, end_page=end_page)
    except KeyboardInterrupt:
        logger.info("User interrupted the process. Exiting gracefully.")
    finally:
        session.close()  # 确保Session被正确关闭
        logger.info("Session closed.")

def generate_map_amap(df):
    map_center = [28.686295, 115.915317]
    m = folium.Map(location=map_center, zoom_start=12, tiles=None)

    # 添加高德地图作为底图
    folium.TileLayer(
        tiles='http://webrd02.is.autonavi.com/appmaptile?lang=zh_cn&size=1&scale=1&style=7&x={x}&y={y}&z={z}&ltype=11',
        attr='高德地图',
        name='高德街道图',
        overlay=False,
        control=True,
    ).add_to(m)

    # 过滤掉无法获取地理编码的数据
    df = df[(df['经度'] != '未知') & (df['纬度'] != '未知')]

    for index, row in df.iterrows():
        # 创建带有自定义弹出窗口的标记
        popup_html = f"""
        <div style="width:200px; font-size: 14px;">
            <strong>{row['小区名']}</strong><br>
            {row['名称']}<br>
            总价:{row['总价(万)']}万
        </div>
        """
        popup = folium.Popup(popup_html, max_width=300)
        folium.Marker(
            location=[float(row['纬度']), float(row['经度'])],
            popup=popup,
            tooltip=row['小区名'],  # 鼠标悬停时显示的文字
            icon=folium.Icon(color='blue')
        ).add_to(m)

    # 添加图层控制器
    folium.LayerControl().add_to(m)

    # 保存地图为HTML文件
    m.save('map.html')

# 在主逻辑中调用generate_map_amap函数
if __name__ == '__main__':
    city = 'nc'
    try:
        logger.info(f"Start scraping data for {city}")
        getSalesData(city) # 使用变量而非硬编码的字符串
        logger.info(f"Finish scraping data for {city}")

        if os.path.exists(f'{city}_data.csv'):
            df = pd.read_csv(f'{city}_data.csv')
            generate_map_amap(df)  # 生成地图
            logger.info("Map has been generated successfully.")
    except Exception as e:
        logger.error(f"Error occurred during scraping or map generation: {e}")

  • 21
    点赞
  • 32
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
爬虫(Web Crawler)是一种自动化程序,用于从互联网上收集信息。其主要功能是访问网页、提取数据并存储,以便后续分析或展示。爬虫通常由搜索引擎、数据挖掘工具、监测系统等应用于网络数据抓取的场景。 爬虫的工作流程包括以下几个关键步骤: URL收集: 爬虫从一个或多个初始URL开始,递归或迭代地发现新的URL,构建一个URL队列。这些URL可以通过链接分析、站点地图、搜索引擎等方式获取。 请求网页: 爬虫使用HTTP或其他协议向目标URL发起请求,获取网页的HTML内容。这通常通过HTTP请求库实现,如Python中的Requests库。 解析内容: 爬虫对获取的HTML进行解析,提取有用的信息。常用的解析工具有正则表达式、XPath、Beautiful Soup等。这些工具帮助爬虫定位和提取目标数据,如文本、图片、链接等。 数据存储: 爬虫将提取的数据存储到数据库、文件或其他存储介质中,以备后续分析或展示。常用的存储形式包括关系型数据库、NoSQL数据库、JSON文件等。 遵守规则: 为避免对网站造成过大负担或触发反爬虫机制,爬虫需要遵守网站的robots.txt协议,限制访问频率和深度,并模拟人类访问行为,如设置User-Agent。 反爬虫应对: 由于爬虫的存在,一些网站采取了反爬虫措施,如验证码、IP封锁等。爬虫工程师需要设计相应的策略来应对这些挑战。 爬虫在各个领域都有广泛的应用,包括搜索引擎索引、数据挖掘、价格监测、新闻聚合等。然而,使用爬虫需要遵守法律和伦理规范,尊重网站的使用政策,并确保对被访问网站的服务器负责。
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值