EDA_RentalListingInquiries

最新推荐文章于 2022-06-17 20:49:03 发布

*Major*

最新推荐文章于 2022-06-17 20:49:03 发布

阅读量699

点赞数 1

本文链接：https://blog.csdn.net/qq_41375318/article/details/104311479

版权

$E D A - R e n t a l L i s t i n g I n q u i r i e s$

探索数据集的基本信息

知道数据集的基本信息对我们建模有用。

以Kaggle2017年举办的Two Sigma Connect: Rental Listing Inquiries竞赛数据为例进行数据集探索分析。
可以参考kernel中更多数据分析示例：https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/kernels
竞赛官网：https://www.kaggle.com/c/two-sigma-connect-rental-listing-inquiries/data

导入所需包

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
#color = sns.color_palette()

%matplotlib inline

读取数据

# path to where the data lies
dpath = '/Users/qing/desktop/XGBoost/data/'
train = pd.read_json(dpath +"RentListingInquries_train.json")
train.head()

在这里插入图片描述
检查数据规模
读取测试数据

test = pd.read_json(dpath+"RentListingInquries_test.json")
#test.head().T

print("Train :", train.shape)
print("Test : ", test.shape)

在这里插入图片描述

Variable Identification

选择该数据集是因为它有各种类型的特征，数值型特征、类别型特征、日期特征、地理位置特征、文本特征和图像特征

#info method provides information about dataset like 
#total values in each column, null/not null, datatype, memory occupied etc
train.info()

在这里插入图片描述

##Describe gives statistical information about numerical columns in the dataset
#train.describe()

### ... check for NAs
#train.isnull().sum()

查看每个变量的分布

在Python中，有很多数据可视化途径。因为这种多样性，造成很难选择。
比较常见的可视化工具有：
　　Pandas
　　Seaborn
　　ggplot
　　Bokeh
　　pygal
Matplotlib非常强大，也很复杂。你可以使用它做几乎所有的事情，然而，它并不是很易于学习。
很多工具(尤其是Pandas和Seaborn)都对它进行了封装。
pandas提供内置的图表功能，可使用pandas.DataFrame画各种图形。Pandas对于简单绘图，可以随手用，但你需要学习定制matplotlib。
Seaborn是在matplotlib的基础上进行了更高级的API封装，从而使得作图更加容易，在大多数情况下使用seaborn就能做出很具有吸引力的图，而使用matplotlib就能制作具有更多特色的图。

http://seaborn.pydata.org/tutorial.html

Target Variable： ‘interest level’

sns.countplot(train.interest_level, order=['low', 'medium', 'high']);
plt.xlabel('Interest Level');
plt.ylabel('Number of occurrences');

在这里插入图片描述

### Quantitative substitute of Interest Level
train['interest'] = np.where(train.interest_level=='low', 0,
                                  np.where(train.interest_level=='medium', 1, 2))

大多数样本都是interest level为low，然后是medium，最后是high。
此处不用LableEncoder，因为LableEncoder不能手动指定每个标签对应的数值。
也可以使用另一种转换方式：
target_num_map = {‘high’:2, ‘medium’:1, ‘low’:0}
y = train[“interest_level”].apply(lambda x: target_num_map[x])

然后来看看数值型特征：
bathrooms，
bedrooms，
price

bathrooms 和bedrooms特征的取值集合较小，seaborn.countplot画分布图
price可能的取值多，用seaborn.distplott画分布图

Bathrooms

fig = plt.figure()
### Number of occurrences
sns.countplot(train.bathrooms);
plt.xlabel('Number of Bathrooms');
plt.ylabel('Number of occurrences');

在这里插入图片描述

查看bathrooms与标签之间的关系
不同interest_level下的bathrooms散点图：stripplot
散点图有时会重叠，所以打点时有某种随机的jitter

order = ['low', 'medium', 'high']
sns.stripplot(train["interest_level"],train["bathrooms"],jitter=True,order=order)
plt.title("Number of Number of Bathrooms Vs Interest_level");

在这里插入图片描述

There is 1 house listing with 10 bathrooms. I think we can treat that as outlier. Lets remove it and plot again.

从直方图也可以看出超过4个bathroom的房子很少

#ulimit = np.percentile(train.bathrooms.values, 99.5)
ulimit = 4
train['bathrooms'].ix[train['bathrooms']>ulimit] = ulimit

fig = plt.figure()
### Number of occurrences
sns.countplot(train.bathrooms);
plt.xlabel('Number of Bathrooms');
plt.ylabel('Number of occurrences');

在这里插入图片描述

sns.stripplot(y="bathrooms", x="interest_level",data=train,jitter=True,order=order);

在这里插入图片描述

sns.countplot(x="bathrooms", hue="interest_level",data=train);

在这里插入图片描述
没有bathroom的房子极少high interest

Bedrooms

fig = plt.figure()
### Number of occurrences
sns.countplot(train.bedrooms);
plt.xlabel('Number of Bedrooms');
plt.ylabel('Number of occurrences');

在这里插入图片描述
查看bedrooms与标签之间的关系

order = ['low', 'medium', 'high']
sns.stripplot(train["interest_level"],train["bedrooms"],jitter=True,order=order)
plt.title("Number of Bedrooms Vs Interest_level");

在这里插入图片描述

sns.countplot(x="bedrooms", hue="interest_level",data=train);

在这里插入图片描述

Price

plt.scatter(range(train.shape[0]), train["price"].values,color='purple')
plt.title("Distribution of Price");

在这里插入图片描述
Looks like there are some outliers in this feature. So let us remove them and then plot again.

ulimit = np.percentile(train.price.values, 99)
train['price'].ix[train['price']>ulimit] = ulimit

sns.distplot(train.price.values, bins=50, kde=True)
plt.xlabel('price', fontsize=12)
plt.show()

在这里插入图片描述

可以看出，该分布为right skewed。

plt.figure(figsize=(13,9))
sns.distplot(np.log1p(train["price"]))

在这里插入图片描述
查看price与标签之间的关系

order = ['low', 'medium', 'high']
sns.stripplot(train["interest_level"],train["price"],jitter=True,order=order)
plt.title("Price Vs Interest_level");

在这里插入图片描述
low interest的price看起来均匀分布，medium和high的price更多在1500-8000之间

violinplot提供在不同类别条件下特征更多的分布信息：
核密度估计（KDE）
三个四分位数quartile（1/4，1/2， 3/4）
1.5倍四分位间距(nterquartile range，IQR)：
IQR：第三四分位数和第一四分位数的区别（即Q1~Q3 的差距），表示变量分散情形，比方差更稳健的统计量

order = ['low', 'medium', 'high']
sns.violinplot(x='interest_level', y='price', data=train, order = order)
plt.xlabel('Interest level', fontsize=12)
plt.ylabel('price', fontsize=12)
plt.show()

在这里插入图片描述

listing_id

sns.distplot(train.listing_id.values, bins=50, kde=True)
plt.xlabel('listing_id')
plt.show()

在这里插入图片描述
listing_id与标签之间的关系

order = ['low', 'medium', 'high']
sns.stripplot(train["interest_level"],train["listing_id"],jitter=True,order=order)
plt.title("listing_id Vs Interest_level");

在这里插入图片描述

order = ['low', 'medium', 'high']
sns.violinplot(x='interest_level', y='listing_id', data=train, order = order)
plt.xlabel('Interest level', fontsize=12)
plt.ylabel('price', fontsize=12)
plt.show()

在这里插入图片描述

地理位置：Latitude & Longitude

Latitude & Longitude虽然是数值型变量，但其物理含义是房屋的地理位置。

sns.lmplot(x="longitude", y="latitude", fit_reg=False, hue='interest_level',
           hue_order=['low', 'medium', 'high'], size=9, scatter_kws={'alpha':0.4,'s':30},
           data=train[(train.longitude>train.longitude.quantile(0.005))
                           &(train.longitude<train.longitude.quantile(0.995))
                           &(train.latitude>train.latitude.quantile(0.005))                           
                           &(train.latitude<train.latitude.quantile(0.995))]);
plt.xlabel('Longitude');
plt.ylabel('Latitude');

在这里插入图片描述

上述显示去掉了经度和纬度偏大或偏小的数据点。可以看出high interet的房屋在一小段很集中。可以load google earth 进一步看看其具体位置。
下面两段代码都是现实地图，可任选一段试试。不过需要先安装相应的工具包。

from mpl_toolkits.basemap import Basemap
from matplotlib import cm

west, south, east, north = -74.02, 40.64, -73.85, 40.86

fig = plt.figure(figsize=(14,10))
ax = fig.add_subplot(111)
m = Basemap(projection='merc', llcrnrlat=south, urcrnrlat=north,
            llcrnrlon=west, urcrnrlon=east, lat_ts=south, resolution='i')
x, y = m(train['longitude'].values, train['latitude'].values)
m.hexbin(x, y, gridsize=200,
         bins='log', cmap=cm.YlOrRd_r);

import gpxpy as gpx
import gpxpy.gpx

gpx = gpxpy.gpx.GPX()

for index, row in train.iterrows():
    #print (row['latitude'], row['longitude'])

    if row['interest_level'] == 'high': #opting for all nominals results in poor performance of Google Earth
        gps_waypoint = gpxpy.gpx.GPXWaypoint(row['latitude'],row['longitude'],elevation=10)
        gpx.waypoints.append(gps_waypoint)
        
filename = "GoogleEarth.gpx"
FILE = open(filename,"w")
FILE.writelines(gpx.to_xml())
FILE.close()

display_address

cnt_srs = train.groupby('display_address')['display_address'].count()

for i in [2, 10, 50, 100, 500]:
    print('Display_address that appear less than {} times: {}%'.format(i, round((cnt_srs < i).mean() * 100, 2)))

plt.figure()
plt.hist(cnt_srs.values, bins=100, log=True, alpha=0.9)
#sns.distplot(cnt_srs.values, bins=100)
plt.xlabel('Number of times display_address appeared')
plt.ylabel('log(Count)')

在这里插入图片描述

### Let's get a list of top 10 display address
top10da = train.display_address.value_counts().nlargest(10).index.tolist()

fig = plt.figure()
ax = sns.countplot(x="display_address", hue="interest_level",
                   data=train[train.display_address.isin(top10da)]);
plt.xlabel('display_address');
plt.ylabel('Number of advert occurrences');
### Manager_ids are too long. Let's remove them
plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom='off',      # ticks along the bottom edge are off
    top='off',         # ticks along the top edge are off
    labelbottom='off');

### Adding percents over bars
height = [0 if np.isnan(p.get_height()) else p.get_height() for p in ax.patches]
ncol = int(len(height)/3)
total = [height[i] + height[i + ncol] + height[i + 2*ncol] for i in range(ncol)] * 3
for i, p in enumerate(ax.patches):    
    ax.text(p.get_x()+p.get_width()/2,
            height[i] + 20,
            '{:1.0%}'.format(height[i]/total[i]),
            ha="center")

在这里插入图片描述

building_id

### Let's get a list of top 10 building id
top10building = train.building_id.value_counts().nlargest(10).index.tolist()
### ...and plot number of different Interest Level rental adverts for each of them
fig = plt.figure()
ax = sns.countplot(x="building_id", hue="interest_level",
                   data=train[train.building_id.isin(top10building)]);
plt.xlabel('Biulding');
plt.ylabel('Number of advert occurrences');
### Manager_ids are too long. Let's remove them
plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom='off',      # ticks along the bottom edge are off
    top='off',         # ticks along the top edge are off
    labelbottom='off');

### Adding percents over bars
height = [0 if np.isnan(p.get_height()) else p.get_height() for p in ax.patches]
ncol = int(len(height)/3)
total = [height[i] + height[i + ncol] + height[i + 2*ncol] for i in range(ncol)] * 3
for i, p in enumerate(ax.patches):    
    ax.text(p.get_x()+p.get_width()/2,
            height[i] + 20,
            '{:1.0%}'.format(height[i]/total[i]),
            ha="center")

在这里插入图片描述

manager_id

处理方法类似building_id

### Let's get a list of top 10 managers
top10managers = train.manager_id.value_counts().nlargest(10).index.tolist()
### ...and plot number of different Interest Level rental adverts for each of them
fig = plt.figure()
ax = sns.countplot(x="manager_id", hue="interest_level",
                   data=train[train.manager_id.isin(top10managers)]);
plt.xlabel('Manager');
plt.ylabel('Number of advert occurrences');
### Manager_ids are too long. Let's remove them
plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom='off',      # ticks along the bottom edge are off
    top='off',         # ticks along the top edge are off
    labelbottom='off');

### Adding percents over bars
height = [0 if np.isnan(p.get_height()) else p.get_height() for p in ax.patches]
ncol = int(len(height)/3)
total = [height[i] + height[i + ncol] + height[i + 2*ncol] for i in range(ncol)] * 3
for i, p in enumerate(ax.patches):    
    ax.text(p.get_x()+p.get_width()/2,
            height[i] + 20,
            '{:1.0%}'.format(height[i]/total[i]),
            ha="center")

在这里插入图片描述

created date

日期型特征

train['created'] = pd.to_datetime(train['created'])
train['date'] = train['created'].dt.date
train["year"] = train["created"].dt.year
train['month'] = train['created'].dt.month
train['day'] = train['created'].dt.day
train['hour'] = train['created'].dt.hour
train['weekday'] = train['created'].dt.weekday
train['week'] = train['created'].dt.week
train['quarter'] = train['created'].dt.quarter
train['weekend'] = ((train['weekday'] == 5) & (train['weekday'] == 6))
train['wd'] = ((train['weekday'] != 5) & (train['weekday'] != 6))

created date --> date

cnt_srs = train['date'].value_counts()

plt.figure(figsize=(12,4))
ax = plt.subplot(111)
ax.bar(cnt_srs.index, cnt_srs.values)
ax.xaxis_date()
plt.xticks(rotation='vertical')
plt.show()

在这里插入图片描述
All listings have been created in April to July 2016 period in our data.

hour

hourDF = train.groupby(['hour', 'interest_level'])['hour'].count().unstack('interest_level').fillna(0)
hourDF[['low','medium',"high"]].plot(kind='bar', stacked=True);

在这里插入图片描述

month

monthDF = train.groupby(['month', 'interest_level'])['month'].count().unstack('interest_level').fillna(0)
monthDF[['low','medium',"high"]].plot(kind='bar', stacked=True);

在这里插入图片描述

Photo Numbers

train['num_photos'] = train['photos'].apply(len)
ulimit = np.percentile(train.num_photos.values, 99)
train['num_photos'].ix[train['num_photos']>ulimit] = ulimit

sns.countplot(train.num_photos);
plt.xlabel('Number of photoes');
plt.ylabel('Number of occurrences');

在这里插入图片描述

train['num_photos'].ix[train['num_photos']>15] = 15
#sns.stripplot(y="num_photos", x="interest_level",data=train,jitter=True,order=order);

plt.figure()
sns.violinplot(x="num_photos", y="interest_level", data=train, order =['low','medium','high'])
plt.xlabel('Number of Photos')
plt.ylabel('Interest Level')
plt.show()

在这里插入图片描述

Features Length

train['len_features'] = train['features'].apply(len)

sns.countplot(train.len_features);
plt.xlabel('Length of features');
plt.ylabel('Number of occurrences');

在这里插入图片描述

train['len_features'].ix[train['len_features'] > 16] = 16

plt.figure()
sns.violinplot(x="len_features", y="interest_level", data=train, order =['low','medium','high'])
plt.xlabel('Length of Features')
plt.ylabel('Interest Level')
plt.show()

在这里插入图片描述

desctiprion words counts

train['num_description_words'] = train['description'].apply(lambda x: len(x.split(' ')))
train['len_description'] = train['description'].apply(len)

#ulimit = np.percentile(train.len_description.values, 99)
#train['len_description'].ix[train['len_description']>ulimit] = ulimit

sns.countplot(train.len_description);
plt.xlabel('Length of description');
plt.ylabel('Number of occurrences');

在这里插入图片描述

fig = plt.figure()
order = ['low', 'medium', 'high']
#ulimit = np.percentile(train.len_description.values, 99)
#train['len_description'].ix[train['len_description']>ulimit] = ulimit

sns.stripplot(train["interest_level"],train["len_description"],jitter=True,order=order)
plt.title("Length of description Vs Interest_level");

在这里插入图片描述

plt.figure()
sns.violinplot(x="len_description", y="interest_level", data=train, order =['low','medium','high'])
plt.xlabel('Length of Description')
plt.ylabel('Interest Level')
plt.show()

在这里插入图片描述

sns.countplot(train.num_description_words);
plt.xlabel('Number of words of description');
plt.ylabel('Number of occurrences');

在这里插入图片描述

fig = plt.figure()
order = ['low', 'medium', 'high']
#ulimit = np.percentile(train.num_description_words.values, 99)
#ulimit = 500
#train['num_description_words'].ix[train['num_description_words']>ulimit] = ulimit
sns.stripplot(train["interest_level"],train["num_description_words"],jitter=True,order=order)
plt.title("Length of description Vs Interest_level");

在这里插入图片描述

plt.figure()
sns.violinplot(x="num_description_words", y="interest_level", data=train, order =['low','medium','high'])
plt.xlabel('Number of Description Words')
plt.ylabel('Interest Level')
plt.show()

在这里插入图片描述

词云(display_address, street_address, features)

from wordcloud import WordCloud

text = ''
text_da = ''
text_street = ''
#text_desc = ''
for ind, row in train.iterrows():
    for feature in row['features']:
        text = " ".join([text, "_".join(feature.strip().split(" "))])
    text_da = " ".join([text_da,"_".join(row['display_address'].strip().split(" "))])
    text_street = " ".join([text_street,"_".join(row['street_address'].strip().split(" "))])
    #text_desc = " ".join([text_desc, row['description']])
text = text.strip()
text_da = text_da.strip()
text_street = text_street.strip()
#text_desc = text_desc.strip()

plt.figure(figsize=(12,6))
wordcloud = WordCloud(background_color='white', width=600, height=300, max_font_size=50, max_words=40).generate(text)
wordcloud.recolor(random_state=0)
plt.imshow(wordcloud)
plt.title("Wordcloud for features", fontsize=30)
plt.axis("off")
plt.show()

# wordcloud for display address
plt.figure()
wordcloud = WordCloud(background_color='white', width=600, height=300, max_font_size=50, max_words=40).generate(text_da)
wordcloud.recolor(random_state=0)
plt.imshow(wordcloud)
plt.title("Wordcloud for Display Address", fontsize=30)
plt.axis("off")
plt.show()

# wordcloud for street address
plt.figure()
wordcloud = WordCloud(background_color='white', width=600, height=300, max_font_size=50, max_words=40).generate(text_street)
wordcloud.recolor(random_state=0)
plt.imshow(wordcloud)
plt.title("Wordcloud for Street Address", fontsize=30)
plt.axis("off")
plt.show()

在这里插入图片描述

特征之间的相关性

contFeatureslist = []
contFeatureslist.append("bathrooms")
contFeatureslist.append("bedrooms")
contFeatureslist.append("price")

print(contFeatureslist)

在这里插入图片描述

correlationMatrix = train[contFeatureslist].corr().abs()

plt.subplots(figsize=(13, 9))
sns.heatmap(correlationMatrix,annot=True)

# Mask unimportant features
sns.heatmap(correlationMatrix, mask=correlationMatrix < 1, cbar=False)
plt.show()