基于聚类与相关性分析对马来西亚房价数据进行分析

碎碎念:由于最近太忙了,更新的比较慢,提前祝大家新春快乐,万事如意!本数据集的下载地址,读者可以自行下载。

1.项目背景

本项目旨在对马来西亚房地产市场进行初步的数据分析,探索各州的房产市场特征。通过对房产中位数价格、每平方英尺价格和交易数量等指标的可视化,结合聚类分析和点二列相关性分析,试图揭示不同房产类型与市场趋势之间的关系。该分析可以帮助更好地理解市场的基本情况,并为后续研究或决策提供数据支持。

2.数据说明

字段 说明
Township 房产所在的市镇名称
Area 房产所在的地区名称
State 房产所在的州
Tenure 土地所有权性质(如 Freehold 或 Leasehold)
Type 房产类型(如 Terrace House, Cluster House 等)
Median_Price 房产的中位数价格(单位:马币)
Median_PSF 每平方英尺的中位数价格(单位:马币)
Transactions 该地区的房产交易数量

3.Python库导入及数据读取

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from scipy.stats import pointbiserialr
data = pd.read_csv("/home/mw/input/01209371/malaysia_house_price_data_2025.csv")

4.数据预览及数据清洗

data.head()
Township Area State Tenure Type Median_Price Median_PSF Transactions
0 SCIENTEX SUNGAI DUA Tasek Gelugor Penang Freehold Terrace House 331800.0 304.0 593
1 BANDAR PUTRA Kulai Johor Freehold Cluster House, Terrace House 590900.0 322.0 519
2 TAMAN LAGENDA TROPIKA TAPAH Chenderiang Perak Freehold Terrace House 229954.0 130.0 414
3 SCIENTEX JASIN MUTIARA Bemban Melaka Freehold Terrace House 255600.0 218.0 391
4 TAMAN LAGENDA AMAN Tapah Perak Leasehold Terrace House 219300.0 168.0 363
print('查看数据信息:')
data.info()
查看数据信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Township      2000 non-null   object 
 1   Area          2000 non-null   object 
 2   State         2000 non-null   object 
 3   Tenure        2000 non-null   object 
 4   Type          2000 non-null   object 
 5   Median_Price  2000 non-null   float64
 6   Median_PSF    2000 non-null   float64
 7   Transactions  2000 non-null   int64  
dtypes: float64(2), int64(1), object(5)
memory usage: 125.1+ KB
characteristic = data.select_dtypes(include=['object']).columns
print('数据中分类变量的唯一值情况:')
for i in characteristic:
    print(f'{
     i}:')
    print(f'共有:{
     len(data[i].unique())}条唯一值')
    print(data[i].unique())
    print('-'*50)
数据中分类变量的唯一值情况:
Township:
共有:1946条唯一值
['SCIENTEX SUNGAI DUA' 'BANDAR PUTRA' 'TAMAN LAGENDA TROPIKA TAPAH' ...
 'TAMAN PUNCAK JELAPANG MAJU' 'TAMAN TIONG UNG SIEW'
 'TAMAN DESA RISHAH INDAH']
--------------------------------------------------
Area:
共有:303条唯一值
['Tasek Gelugor' 'Kulai' 'Chenderiang' 'Bemban' 'Tapah' 'Tebrau'
 'Pasir Gudang' 'Teluk Intan' 'Jasin' 'Kapar' 'Iskandar Puteri (Nusajaya)'
 'Nilai' 'Skudai' 'Johor Bahru' 'Batu Pahat' 'Kluang' 'Ulu Tiram'
 'Permas Jaya' 'Serendah' 'Chemor' 'Pagoh' 'Lenggeng' 'Perling' 'Penaga'
 'Klang' 'Sungei Petai' 'Shah Alam' 'Padang Serai' 'Sitiawan' 'Ijok'
 'Bukit Katil' 'Masai' 'Tampoi' 'Setia Alam' 'Kota Tinggi' 'Banting'
 'Cheras' 'Tanjong Duabelas' 'Jalan Klang Lama (Old Klang Road)'
 'Bukit Bintang' 'Rawang' 'Bandar Utama' 'Seri Iskandar' 'Ipoh' 'Senai'
 'Tanjong Minyak' 'Labu' 'Jimah' 'Sungai Bakap' 'Batu Caves'
 'Damansara Perdana' 'Kuantan' 'Kampar' 'Seremban 2' 'Pasir Panjang'
 'Kijal' 'Segamat' 'Seri Kembangan' 'Rantau' 'Gurun' 'Dengkil'
 'Subang Jaya' 'Sungai Petani' 'Puchong' 'Kajang' 'Lunas' 'Gopeng'
 'Kepala Batas' 'Bandar Puncak Alam' 'Bangi' 'Bandar Sri Damansara'
 'Kuala Pilah' 'Linggi' 'Bandar Sunway' 'Telok Panglima Garang' 'Beranang'
 'Taman Tun Dr Ismail' 'Kamunting' 'Bandar Sungai Long' 'Sri Petaling'
 'Pengerang' 'Jalan Kuching' 'Batu Arang' 'Kota Samarahan' 'Miri'
 'Semenyih' 'Petaling Jaya' 'Simpang Ampat' 'Krubong' 'Taiping'
 'Kota Damansara' 'Nibong Tebal' 'Bukit Mertajam' 'Wangsa Maju' 'Papar'
 'Telok Kemang' 'Hulu Terengganu' 'Lumut' 'Merlimau' 'Ulu Klang'
 'Damansara Heights' 'Seremban' 'Sabak Bernam' 'Parit Buntar'
 'Seberang Jaya' 'Kepong' 'Bandar Kinrara' 'Gerisek' 'Tampin' 'Ayer Itam'
 'Bahau' 'KLCC' 'Lukut' 'Batang Kali' 'Sungai Jawi' 'Selayang'
 'Simpang Pulai' 'Batu Gajah' 'Sepang' 'Kuching' 'Kulim' 'Kerteh' 'Ampang'
 'Ayer Molek' 'Bidor' 'Bayan Baru' 'Senggarang' 'Ampangan' 'Gelang Patah'
 'Sungai Siput' 'Kota Kinabalu' 'Penampang' 'Ara Damansara' 'Segambut'
 'Tawau' 'Bukit Rambai' 'Rasa' 'Simpang Rengam' 'Duyong' 'Mont Kiara'
 'Durian Tunggal' 'Sibu' 'Alor Setar' 'Bachang' 'Balai Panjang'
 'Hulu Lepar' 'Saujana Utama' 'Bukit Jalil' 'Georgetown' 'Juasseh'
 'Sungai Rambai' 'Arau' 'Jelutong' 'Perai' 'Sungai Karang' 'Ulu Bernam'
 'Paya Rumput' 'Sandakan' 'Tronoh' 'Kuala Terengganu' 'Setapak' 'Rasah'
 'Kuala Selangor' 'Tanjong Tualang' 'Raub' 'Cyberjaya' 'Butterworth'
 'Port Klang' 'Menglembu' 'Cheng' 'Sungai Buloh' 'Kuala Ibai'
 'Bandar Sri Sendayan' 'Bayan Lepas' 'Bandar Tasik Selatan' 'Labuan'
 'Senawang' 'Bukit Baru' 'Tanjong Tokong'
 'Kampung Kerinchi (Bangsar South)' 'Kuala Kubu Baru' 'Bandar Enstek'
 'Batu Berendam' 'Alor Gajah' 'Pandamaran' 'Sungai Ara' 'Dutamas'
 'Sikamat' 'Tambun' 'Glenmarie' 'Muar' 'Mentakab' 'Pusing' 'Jenjarom'
 'Pontian' 'Sungai Dua' 'Kuala Lipis' 'Batu Kawan' 'Lahat' 'Besut'
 'Bukit Kepayang' 'Kuala Kedah' 'Port Dickson' 'KL Sentral' 'Sentul'
 'Kuala Kangsar' 'Tangkak' 'Klebang' 'Kuchai Lama' 'Seri Manjong'
 'Melaka City' 'Gelugor' 'Mantin' 'Bintulu' 'Hutan Melintang' 'Bangsar'
 'Bagan Serai' 'Masjid Tanah' 'Balakong' 'Tuaran' 'Batu Ferringhi'
 'Rompin' 'Tropicana' 'Bandar Menjalara' 'Bakri' 'Putrajaya' 'Taman Desa'
 'Bukit Jambul' 'Desa Petaling' 'Kemaman' 'Kuala Ketil' 'Merbok'
 'Sungai Lalang' 'Sungai Besi' 'Relau' 'Tanjung Bungah' 'Damansara Damai'
 'Bedong' 'Jitra' 'Bentong' 'Teloi Kiri' 'Jementah' 'Paloh' 'Setiawangsa'
 'Kubang Semang' 'Pantai' 'Genting Highlands' 'Jalan Ipoh' 'Bukit Minyak'
 'Padang Enggang' 'Teras' 'Salak Selatan' 'Lahad Datu' 'Dungun'
 'Sungai Udang' 'Triang' 'Desa ParkCity' 'KL City' 'Hulu Langat' 'Gurney'
 'Kelana Jaya' 'Pengkalan Hulu' 'Kuah' 'Kota Marudu' 'Padang Rengas'
 'Yong Peng' 'Tanjong Kling' 'Permatang Pauh' 'Jinjang' 'Labis' 'Umbai'
 'Cherang Ruku' 'Saujana' 'Pekan' 'Simpang Pertang' 'Limbang' 'Paroi'
 'Hulu Selangor' 'Brickfields' 'Balik Pulau' 'Bertam' 'Batu Kurau'
 'Chenor' 'Ulu Langat' 'Simpang' 'Mersing' 'Juru' 'Pokok Sena' 'Rembau'
 'Mutiara Damansara' 'Sungei Baru Tengah' 'Ampang Hilir' 'City Centre'
 'Sri Hartamas' 'Kota Sarang Semut' 'Teluk Kumbar' 'Bandar Baharu' 'Gemas'
 'Sungai Besar' 'Gerik' 'Sri Aman' 'Kuala Sungai Baru' 'Machang']
--------------------------------------------------
State:
共有:16条唯一值
['Penang' 'Johor' 'Perak' 'Melaka' 'Selangor' 'Negeri Sembilan' 'Kedah'
 'Kuala Lumpur' 'Pahang' 'Terengganu' 'Sarawak' 'Sabah' 'Perlis' 'Labuan'
 'Putrajaya' 'Kelantan']
--------------------------------------------------
Tenure:
共有:4条唯一值
['Freehold' 'Leasehold' 'Freehold, Leasehold' 'Leasehold, Freehold']
--------------------------------------------------
Type:
共有:46条唯一值
['Terrace House' 'Cluster House, Terrace House' 'Terrace House, Semi D'
 'Terrace House, Cluster House' 'Cluster House' 'Bungalow, Terrace House'
 'Service Residence' 'Semi D, Terrace House' 'Flat'
 'Cluster House, Semi D' 'Terrace House, Semi D, Town House'
 'Semi D, Cluster House' 'Apartment' 'Semi D, Terrace House, Bungalow'
 'Semi D' 'Terrace House, Town House' 'Semi D, Town House, Terrace House'
 'Condominium' 'Semi D, Bungalow' 'Bungalow, Semi D' 'Bungalow'
 'Town House, Terrace House, Semi D' 'Terrace House, Semi D, Bungalow'
 'Bungalow, Terrace House, Semi D' 'Terrace House, Bungalow'
 'Cluster House, Terrace House, Semi D'
 'Semi D, Terrace House, Cluster House' 'Semi D, Cluster House, Bungalow'
 'Bungalow, Town House' 'Town House, Terrace House' 'Town House, Semi D'
 'Terrace House, Bungalow, Semi D' 'Town House' 'Flat, Condominium'
 'Town House, Bungalow, Terrace House' 'Apartment, Flat'
 'Semi D, Bungalow, Terrace House' 'Town House, Semi D, Terrace House'
 'Town House, Bungalow' 'Condominium, Service Residence'
 'Cluster House, Bungalow' 'Flat, Apartment'
 'Bungalow, Semi D, Terrace House' 'Semi D, Terrace House, Town House'
 'Cluster House, Town House, Terrace House' 'Semi D, Town House']
--------------------------------------------------

由于Township、Area的唯一值特别多,对后续的分析意义不大,考虑删除,只保留State作为地理信息,而Type里存在大量组合的数据,比如’Semi D, Terrace House’这种,这里考虑拆分看看,究竟真正涉及的Type有多少种类型。

# 提取所有的房产类型,并将其拆分为单独的词
all_types = []

# 按照 ", " 来拆分每个类型中的词汇
for property_type in data['Type']:
    types = property_type.split(", ")  # 根据逗号加空格拆分
    all_types.extend(types)

# 统计每个词的出现频次
unique_types = set(all_types)  # 获取唯一类型

# 显示所有唯一的房产类型
unique_types
{'Apartment',
 'Bungalow',
 'Cluster House',
 'Condominium',
 'Flat',
 'Semi D',
 'Service Residence',
 'Terrace House',
 'Town House'}

可以看到,真正的Type并不多,这里考虑把这些转为0-1二值变量,表示该样本数据是否有某个类型,这样也不会导致维度爆炸,同样的针对Tenure特征也是一样的处理。

# 创建每个房产类型的 0-1 二值变量
for property_type in unique_types:
    data[property_type] = data['Type'].apply(lambda x: 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值