零基础爬取链家二手房信息并保存到 MongoDB 和 MySQL 可视化分析

啥都会一点的差不多先生

已于 2024-08-28 22:06:51 修改

阅读量750

点赞数 20

分类专栏：网络爬虫 python 零基础文章标签：人工智能 python plotly matplotlib pandas pip

于 2024-08-28 22:05:25 首次发布

本文链接：https://blog.csdn.net/yaokk1/article/details/141650982

版权

python 同时被 3 个专栏收录

5 篇文章 0 订阅

订阅专栏

网络爬虫

3 篇文章 0 订阅

订阅专栏

零基础

3 篇文章 0 订阅

订阅专栏

爬取链家二手房信息并保存到 MongoDB 和 MySQL 可视化分析

一、环境准备

安装依赖库

你需要安装以下 Python 库：
- requests: 用于发送 HTTP 请求，安装命令：pip install requests
- chardet: 用于检测网页编码，安装命令：pip install chardet
- lxml: 用于解析 HTML，安装命令：pip install lxml
- pymongo: 用于与 MongoDB 交互，安装命令：pip install pymongo
- pymysql: 用于与 MySQL 交互，安装命令：pip install pymysql
准备 MongoDB 和 MySQL

确保你的本地环境中已安装并启动 MongoDB 和 MySQL 数据库。在这个例子中，MongoDB 使用默认的 localhost:27017 连接，MySQL 使用默认的 localhost:3306 连接，用户为 root，密码为 root，数据库为 house。

二、代码实现

我们将分步讲解代码的实现过程。

发送请求并解析列表页

在这个部分，我们通过 requests 库发送 HTTP 请求到链家的某个页面，获取页面内容并用 lxml 解析出每个房源的详情页链接。

def get_detail_url(url):
    retries = 10
    for i in range(retries):
        try:
            resp = requests.get(url, timeout=60)
            try:
                url_lis = html.xpath('//div[@class="info clear"]/div[@class="title"]/a/@href')

            except Exception as e:
                print(e)
        except requests.exceptions.RequestException as e:
            print(f'Request failed: {e}')
            if i < retries - 1:
                print(f'Retrying ({i + 1}/{retries})...')
                continue
            else:
                print(f'Failed after {retries} retries.')
                raise

这里通过 xpath 语法定位每个房源详情页的链接，并调用 get_detail_resource 方法进一步处理。

获取房源详情

对于每个房源详情页，我们再次发送请求，获取详细的房源信息。

def get_detail_resource(url_list):
    for detail_url in url_list:
        detail_resp = requests.get(detail_url, timeout=30)
        detail_resp.encoding = chardet.detect(detail_resp.content)['encoding']
        get_detail_info(detail_resp)

这部分代码简单地遍历了所有详情页链接，并对每个链接调用 get_detail_info 方法解析具体的房源信息。

解析房源信息

在这里，我们使用 xpath 提取房源的详细信息，并将其保存到 CSV、MongoDB 和 MySQL 中。

def get_detail_info(resource):
    tree = etree.HTML(resource.text)
    try:
        title = tree.xpath('/html/body/div[3]/div/div/div[1]/h1/text()')[0]
        total_price = tree.xpath('/html/body/div[5]/div[2]/div[3]/div/span[1]/text()')[0]


        data = {.....}

        save_to_csv((title, total_price, unit_price, size, tier, total_area, size_structure, area, type, direction,
                     build_structure, condition, rating, have_elevator, release_time, transaction_nature, last_relese,
                     application, limit_time, prossess_nature, pledge_info, note))

        save_to_mongo(data)
        save_to_mysql(data)
        print('ok!')
    except Exception as e:
        print(f'Error occurred: {e}')

这个函数不仅解析了房源信息，还将其通过 save_to_csv、save_to_mongo、save_to_mysql 三个方法分别保存到不同的存储介质中。

数据存储

我们定义了 save_to_csv、save_to_mongo 和 save_to_mysql 三个方法分别将数据存储到 CSV 文件、MongoDB 和 MySQL 中。

def save_to_csv(data):
    with open('二手房信息链家.csv', 'a', newline='', encoding='utf-8') as f:
        writer = csv.writer(f)
        if f.tell() == 0:
            writer.writerow(
               )
        writer.writerow(data)
        print(data)

def save_to_mongo(data):
    client = MongoClient(host='localhost', port=27017)
    db = client['house']
    collection = db['houseInfo']
    collection.insert_one(data)

def save_to_mysql(data):
    conn = pymysql.connect(host='localhost', port=3306, user='root', password='root', db='house')
    cursor = conn.cursor()
    sql = "INSERT INTO houseInfo"
    cursor.execute(sql, data)
    conn.commit()
    conn.close()