读入数据:
train_df = pd.read_csv("train.csv")
对column进行分析:
train_df.columns
missing data:
total = train_df.isnull().sum().sort_values(ascending=False)
percent = (train_df.isnull().sum()/train_df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Precent'])
missing_data.head(50)
各值的比例
train.product_type.value_counts(normalize= True)
作图:
主要用seaborn 教程见:http://blog.csdn.net/qq_34264472/article/details/53814653
对单个变量进行分析:
price = train_df['price_doc']
plt.figure(figsize=(8,4))
sns.distplot(price, kde=False)
distplot的使用方法可见 https://zhuanlan.zhihu.com/p/24464836
找出重要变量的相关性
corrmat = train_df.corr()
n = 15
cols = corrmat.nlargest(n, 'price_doc')['price_doc'].index
cm_df = train_df[cols].corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(cm_df, square=True, annot=True, fmt='.2f', annot_kws={'size':10}, cbar=True)
两个变量相关性,散点图:
var = 'full_sq'
data = pd.concat([train_df['price_doc'], train_df[var]], axis=1)
data.plot.scatter(x=var, y='price_doc')
一个categorical变量多个属性作图:
两个:
g_p_type = train_df.groupby('product_type').mean()['price_doc']
plt.figure(figsize=(8,4))
sns.barplot(g_p_type.index, g_p_type.values)
plt.ylabel('price_doc')
plt.show()
g_p_type = train_df['product_type'].value_counts()
plt.figure(figsize=(8,4))
sns.barplot(g_p_type.index, g_p_type.values)
plt.ylabel('Number of Occurrences')
plt.show()
多个:
sub_area_list = train_df.groupby('sub_area').mean()['price_doc'].sort_values(ascending=False)[:15]
plt.figure(figsize=(8,4))
sns.barplot(sub_area_list.index, sub_area_list.values)
plt.ylabel('price_doc')
plt.xticks(rotation=70)
plt.show()
preprocessing:
剔除过大的数:
ulimit = np.percentile(train_df.price_doc.values, 99)
train_df['price_doc'].ix[train_df['price_doc']>ulimit] = ulimit
用dummy可以处理像E值域范围较小的标称属性。对于范围大的标称属性,用dummy就不好处理了。pandas提供了一个factorize()函数,用以将标称属性的字符串值映射为一个数字,相同的字符串映射为同一个数字。
df_numeric = df_all.select_dtypes(exclude=['object'])
df_obj = df_all.select_dtypes(include=['object']).copy()
for c in df_obj:
df_obj[c] = pd.factorize(df_obj[c])[0]
df_values = pd.concat([df_numeric, df_obj], axis=1)
处理掉一些不合逻辑的数