碎碎念:由于最近太忙了,更新的比较慢,提前祝大家新春快乐,万事如意!本数据集的下载地址,读者可以自行下载。
1.项目背景
本项目旨在对马来西亚房地产市场进行初步的数据分析,探索各州的房产市场特征。通过对房产中位数价格、每平方英尺价格和交易数量等指标的可视化,结合聚类分析和点二列相关性分析,试图揭示不同房产类型与市场趋势之间的关系。该分析可以帮助更好地理解市场的基本情况,并为后续研究或决策提供数据支持。
2.数据说明
字段 | 说明 |
---|---|
Township | 房产所在的市镇名称 |
Area | 房产所在的地区名称 |
State | 房产所在的州 |
Tenure | 土地所有权性质(如 Freehold 或 Leasehold) |
Type | 房产类型(如 Terrace House, Cluster House 等) |
Median_Price | 房产的中位数价格(单位:马币) |
Median_PSF | 每平方英尺的中位数价格(单位:马币) |
Transactions | 该地区的房产交易数量 |
3.Python库导入及数据读取
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
from scipy.stats import pointbiserialr
data = pd.read_csv("/home/mw/input/01209371/malaysia_house_price_data_2025.csv")
4.数据预览及数据清洗
data.head()
Township | Area | State | Tenure | Type | Median_Price | Median_PSF | Transactions | |
---|---|---|---|---|---|---|---|---|
0 | SCIENTEX SUNGAI DUA | Tasek Gelugor | Penang | Freehold | Terrace House | 331800.0 | 304.0 | 593 |
1 | BANDAR PUTRA | Kulai | Johor | Freehold | Cluster House, Terrace House | 590900.0 | 322.0 | 519 |
2 | TAMAN LAGENDA TROPIKA TAPAH | Chenderiang | Perak | Freehold | Terrace House | 229954.0 | 130.0 | 414 |
3 | SCIENTEX JASIN MUTIARA | Bemban | Melaka | Freehold | Terrace House | 255600.0 | 218.0 | 391 |
4 | TAMAN LAGENDA AMAN | Tapah | Perak | Leasehold | Terrace House | 219300.0 | 168.0 | 363 |
print('查看数据信息:')
data.info()
查看数据信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Township 2000 non-null object
1 Area 2000 non-null object
2 State 2000 non-null object
3 Tenure 2000 non-null object
4 Type 2000 non-null object
5 Median_Price 2000 non-null float64
6 Median_PSF 2000 non-null float64
7 Transactions 2000 non-null int64
dtypes: float64(2), int64(1), object(5)
memory usage: 125.1+ KB
characteristic = data.select_dtypes(include=['object']).columns
print('数据中分类变量的唯一值情况:')
for i in characteristic:
print(f'{
i}:')
print(f'共有:{
len(data[i].unique())}条唯一值')
print(data[i].unique())
print('-'*50)
数据中分类变量的唯一值情况:
Township:
共有:1946条唯一值
['SCIENTEX SUNGAI DUA' 'BANDAR PUTRA' 'TAMAN LAGENDA TROPIKA TAPAH' ...
'TAMAN PUNCAK JELAPANG MAJU' 'TAMAN TIONG UNG SIEW'
'TAMAN DESA RISHAH INDAH']
--------------------------------------------------
Area:
共有:303条唯一值
['Tasek Gelugor' 'Kulai' 'Chenderiang' 'Bemban' 'Tapah' 'Tebrau'
'Pasir Gudang' 'Teluk Intan' 'Jasin' 'Kapar' 'Iskandar Puteri (Nusajaya)'
'Nilai' 'Skudai' 'Johor Bahru' 'Batu Pahat' 'Kluang' 'Ulu Tiram'
'Permas Jaya' 'Serendah' 'Chemor' 'Pagoh' 'Lenggeng' 'Perling' 'Penaga'
'Klang' 'Sungei Petai' 'Shah Alam' 'Padang Serai' 'Sitiawan' 'Ijok'
'Bukit Katil' 'Masai' 'Tampoi' 'Setia Alam' 'Kota Tinggi' 'Banting'
'Cheras' 'Tanjong Duabelas' 'Jalan Klang Lama (Old Klang Road)'
'Bukit Bintang' 'Rawang' 'Bandar Utama' 'Seri Iskandar' 'Ipoh' 'Senai'
'Tanjong Minyak' 'Labu' 'Jimah' 'Sungai Bakap' 'Batu Caves'
'Damansara Perdana' 'Kuantan' 'Kampar' 'Seremban 2' 'Pasir Panjang'
'Kijal' 'Segamat' 'Seri Kembangan' 'Rantau' 'Gurun' 'Dengkil'
'Subang Jaya' 'Sungai Petani' 'Puchong' 'Kajang' 'Lunas' 'Gopeng'
'Kepala Batas' 'Bandar Puncak Alam' 'Bangi' 'Bandar Sri Damansara'
'Kuala Pilah' 'Linggi' 'Bandar Sunway' 'Telok Panglima Garang' 'Beranang'
'Taman Tun Dr Ismail' 'Kamunting' 'Bandar Sungai Long' 'Sri Petaling'
'Pengerang' 'Jalan Kuching' 'Batu Arang' 'Kota Samarahan' 'Miri'
'Semenyih' 'Petaling Jaya' 'Simpang Ampat' 'Krubong' 'Taiping'
'Kota Damansara' 'Nibong Tebal' 'Bukit Mertajam' 'Wangsa Maju' 'Papar'
'Telok Kemang' 'Hulu Terengganu' 'Lumut' 'Merlimau' 'Ulu Klang'
'Damansara Heights' 'Seremban' 'Sabak Bernam' 'Parit Buntar'
'Seberang Jaya' 'Kepong' 'Bandar Kinrara' 'Gerisek' 'Tampin' 'Ayer Itam'
'Bahau' 'KLCC' 'Lukut' 'Batang Kali' 'Sungai Jawi' 'Selayang'
'Simpang Pulai' 'Batu Gajah' 'Sepang' 'Kuching' 'Kulim' 'Kerteh' 'Ampang'
'Ayer Molek' 'Bidor' 'Bayan Baru' 'Senggarang' 'Ampangan' 'Gelang Patah'
'Sungai Siput' 'Kota Kinabalu' 'Penampang' 'Ara Damansara' 'Segambut'
'Tawau' 'Bukit Rambai' 'Rasa' 'Simpang Rengam' 'Duyong' 'Mont Kiara'
'Durian Tunggal' 'Sibu' 'Alor Setar' 'Bachang' 'Balai Panjang'
'Hulu Lepar' 'Saujana Utama' 'Bukit Jalil' 'Georgetown' 'Juasseh'
'Sungai Rambai' 'Arau' 'Jelutong' 'Perai' 'Sungai Karang' 'Ulu Bernam'
'Paya Rumput' 'Sandakan' 'Tronoh' 'Kuala Terengganu' 'Setapak' 'Rasah'
'Kuala Selangor' 'Tanjong Tualang' 'Raub' 'Cyberjaya' 'Butterworth'
'Port Klang' 'Menglembu' 'Cheng' 'Sungai Buloh' 'Kuala Ibai'
'Bandar Sri Sendayan' 'Bayan Lepas' 'Bandar Tasik Selatan' 'Labuan'
'Senawang' 'Bukit Baru' 'Tanjong Tokong'
'Kampung Kerinchi (Bangsar South)' 'Kuala Kubu Baru' 'Bandar Enstek'
'Batu Berendam' 'Alor Gajah' 'Pandamaran' 'Sungai Ara' 'Dutamas'
'Sikamat' 'Tambun' 'Glenmarie' 'Muar' 'Mentakab' 'Pusing' 'Jenjarom'
'Pontian' 'Sungai Dua' 'Kuala Lipis' 'Batu Kawan' 'Lahat' 'Besut'
'Bukit Kepayang' 'Kuala Kedah' 'Port Dickson' 'KL Sentral' 'Sentul'
'Kuala Kangsar' 'Tangkak' 'Klebang' 'Kuchai Lama' 'Seri Manjong'
'Melaka City' 'Gelugor' 'Mantin' 'Bintulu' 'Hutan Melintang' 'Bangsar'
'Bagan Serai' 'Masjid Tanah' 'Balakong' 'Tuaran' 'Batu Ferringhi'
'Rompin' 'Tropicana' 'Bandar Menjalara' 'Bakri' 'Putrajaya' 'Taman Desa'
'Bukit Jambul' 'Desa Petaling' 'Kemaman' 'Kuala Ketil' 'Merbok'
'Sungai Lalang' 'Sungai Besi' 'Relau' 'Tanjung Bungah' 'Damansara Damai'
'Bedong' 'Jitra' 'Bentong' 'Teloi Kiri' 'Jementah' 'Paloh' 'Setiawangsa'
'Kubang Semang' 'Pantai' 'Genting Highlands' 'Jalan Ipoh' 'Bukit Minyak'
'Padang Enggang' 'Teras' 'Salak Selatan' 'Lahad Datu' 'Dungun'
'Sungai Udang' 'Triang' 'Desa ParkCity' 'KL City' 'Hulu Langat' 'Gurney'
'Kelana Jaya' 'Pengkalan Hulu' 'Kuah' 'Kota Marudu' 'Padang Rengas'
'Yong Peng' 'Tanjong Kling' 'Permatang Pauh' 'Jinjang' 'Labis' 'Umbai'
'Cherang Ruku' 'Saujana' 'Pekan' 'Simpang Pertang' 'Limbang' 'Paroi'
'Hulu Selangor' 'Brickfields' 'Balik Pulau' 'Bertam' 'Batu Kurau'
'Chenor' 'Ulu Langat' 'Simpang' 'Mersing' 'Juru' 'Pokok Sena' 'Rembau'
'Mutiara Damansara' 'Sungei Baru Tengah' 'Ampang Hilir' 'City Centre'
'Sri Hartamas' 'Kota Sarang Semut' 'Teluk Kumbar' 'Bandar Baharu' 'Gemas'
'Sungai Besar' 'Gerik' 'Sri Aman' 'Kuala Sungai Baru' 'Machang']
--------------------------------------------------
State:
共有:16条唯一值
['Penang' 'Johor' 'Perak' 'Melaka' 'Selangor' 'Negeri Sembilan' 'Kedah'
'Kuala Lumpur' 'Pahang' 'Terengganu' 'Sarawak' 'Sabah' 'Perlis' 'Labuan'
'Putrajaya' 'Kelantan']
--------------------------------------------------
Tenure:
共有:4条唯一值
['Freehold' 'Leasehold' 'Freehold, Leasehold' 'Leasehold, Freehold']
--------------------------------------------------
Type:
共有:46条唯一值
['Terrace House' 'Cluster House, Terrace House' 'Terrace House, Semi D'
'Terrace House, Cluster House' 'Cluster House' 'Bungalow, Terrace House'
'Service Residence' 'Semi D, Terrace House' 'Flat'
'Cluster House, Semi D' 'Terrace House, Semi D, Town House'
'Semi D, Cluster House' 'Apartment' 'Semi D, Terrace House, Bungalow'
'Semi D' 'Terrace House, Town House' 'Semi D, Town House, Terrace House'
'Condominium' 'Semi D, Bungalow' 'Bungalow, Semi D' 'Bungalow'
'Town House, Terrace House, Semi D' 'Terrace House, Semi D, Bungalow'
'Bungalow, Terrace House, Semi D' 'Terrace House, Bungalow'
'Cluster House, Terrace House, Semi D'
'Semi D, Terrace House, Cluster House' 'Semi D, Cluster House, Bungalow'
'Bungalow, Town House' 'Town House, Terrace House' 'Town House, Semi D'
'Terrace House, Bungalow, Semi D' 'Town House' 'Flat, Condominium'
'Town House, Bungalow, Terrace House' 'Apartment, Flat'
'Semi D, Bungalow, Terrace House' 'Town House, Semi D, Terrace House'
'Town House, Bungalow' 'Condominium, Service Residence'
'Cluster House, Bungalow' 'Flat, Apartment'
'Bungalow, Semi D, Terrace House' 'Semi D, Terrace House, Town House'
'Cluster House, Town House, Terrace House' 'Semi D, Town House']
--------------------------------------------------
由于Township、Area的唯一值特别多,对后续的分析意义不大,考虑删除,只保留State作为地理信息,而Type里存在大量组合的数据,比如’Semi D, Terrace House’这种,这里考虑拆分看看,究竟真正涉及的Type有多少种类型。
# 提取所有的房产类型,并将其拆分为单独的词
all_types = []
# 按照 ", " 来拆分每个类型中的词汇
for property_type in data['Type']:
types = property_type.split(", ") # 根据逗号加空格拆分
all_types.extend(types)
# 统计每个词的出现频次
unique_types = set(all_types) # 获取唯一类型
# 显示所有唯一的房产类型
unique_types
{'Apartment',
'Bungalow',
'Cluster House',
'Condominium',
'Flat',
'Semi D',
'Service Residence',
'Terrace House',
'Town House'}
可以看到,真正的Type并不多,这里考虑把这些转为0-1二值变量,表示该样本数据是否有某个类型,这样也不会导致维度爆炸,同样的针对Tenure特征也是一样的处理。
# 创建每个房产类型的 0-1 二值变量
for property_type in unique_types:
data[property_type] = data['Type'].apply(lambda x: