数据分析-06
pandas可视化
基本绘图
Series数据可视化
Series提供了plot方法以index作为x,以value作为y,完成数据可视化:
ts = pd. Series( np. random. randn( 1000 ) ,
index= pd. date_range( '1/1/2000' , periods= 1000 ) )
ts = ts. cumsum( )
ts. plot( )
DataFrame数据可视化
DataFrame提供了plot方法可以指定某一列作为x,某一列作为y,完成数据可视化:
df3 = pd. DataFrame( np. random. randn( 1000 , 2 ) ,
columns= [ 'B' , 'C' ] ) . cumsum( )
df3[ 'A' ] = np. arange( len ( df3) )
df3. plot( x= 'A' , y= 'B' )
高级绘图
plot()方法可以通过kind关键字参数提供不同的图像类型,包括:
类型 说明 bar
or barh
柱状图 hist
直方图 box
箱线图 scatter
散点图 pie
饼状图
相关API如下:
series. plot. bar( )
dataFrame. plot. bar( )
dataFrame. plot. barh( )
直方图
series. plot. hist( alpha= 0.5 , bins= 5 )
dataFrame. plot. hist( alpha= 0.5 , bins= 5 )
散点图
df. plot. scatter( x= 'a' , y= 'b' , c= col, colormap= '' ) ;
饼状图
series. plot. pie( figsize= ( 6 , 6 ) )
dataFrame. plot. pie( subplots= True , figsize= ( 6 , 6 ) , layout= ( 2 , 2 ) )
箱线图
df. plot. box( )
df. boxplot( by= 'X' )
箱线图反应一组数据的集中趋势,四分位数的差可以反映一组数据的离散情况:
中位数高,表示平均水平较高;反之则表示平均水平较低。 箱子短,表示数据集中;箱子长,表示数据分散。
代码总结
pandas可视化
import numpy as np
import matplotlib. pyplot as plt
import pandas as pd
基本绘图
s = pd. Series( np. random. normal( 100 , 10 , 10 ) ,
index= pd. date_range( '2020-01-01' , periods= 10 ) )
s. plot( )
<matplotlib.axes._subplots.AxesSubplot at 0x1bf39235be0>
ts = pd. Series( np. random. randn( 1000 ) ,
index= pd. date_range( '1/1/2000' , periods= 1000 ) )
ts = ts. cumsum( )
ts. plot( )
<matplotlib.axes._subplots.AxesSubplot at 0x1bf392a52b0>
data = np. random. normal( 0 , 1 , ( 10 , 2 ) )
df = pd. DataFrame( data, columns= [ 'A' , 'B' ] )
df[ 'C' ] = np. arange( 10 )
df
A B C 0 -1.715010 -1.105532 0 1 -0.059422 -0.444824 1 2 -0.621798 0.653777 2 3 -2.577156 -0.406837 3 4 -2.208147 -0.188947 4 5 -0.120376 -1.299448 5 6 -0.609514 0.611829 6 7 -0.509499 0.682336 7 8 0.873368 -1.808792 8 9 -0.598329 0.618860 9
df. plot( )
<matplotlib.axes._subplots.AxesSubplot at 0x1bf3932b080>
df. plot( x= 'C' , y= [ 'A' , 'B' ] )
<matplotlib.axes._subplots.AxesSubplot at 0x1bf393ca2b0>
pandas高级绘图
s = pd. Series( np. random. normal( 100 , 10 , 10 ) ,
index= pd. date_range( '2020-01-01' , periods= 10 ) )
s. plot. barh( color= 'dodgerblue' )
<matplotlib.axes._subplots.AxesSubplot at 0x1bf394417b8>
data = np. random. normal( 80 , 3 , ( 10 , 2 ) )
df = pd. DataFrame( data, columns= [ 'A' , 'B' ] )
df. plot. bar( )
<matplotlib.axes._subplots.AxesSubplot at 0x1bf396ddc50>
pandas直方图
s. plot. hist( bins= 20 )
<matplotlib.axes._subplots.AxesSubplot at 0x1bf397e5f98>
pandas散点图
df. plot. scatter( x= 'A' , y= 'B' , s= 80 , c= 'A' , cmap= 'jet' )
<matplotlib.axes._subplots.AxesSubplot at 0x1bf39ab08d0>
pandas饼状图
values = [ 15 , 13.3 , 8.5 , 7.3 , 4.62 , 51.28 ]
labels = [ 'Java' , 'C' , 'Python' , 'C++' , 'VB' , 'Other' ]
s = pd. Series( values, index= labels)
s
Java 15.00
C 13.30
Python 8.50
C++ 7.30
VB 4.62
Other 51.28
dtype: float64
s. plot. pie( figsize= ( 6 , 6 ) , startangle= 90 , shadow= True )
<matplotlib.axes._subplots.AxesSubplot at 0x1bf3aefa898>
df = pd. DataFrame( s, columns= [ 'A' ] )
df[ 'B' ] = [ 14.1 , 3 , 18.2 , 8 , 2 , 30.2 ]
df. plot. pie( subplots= True , figsize= ( 8 , 4 ) , layout= ( 1 , 2 ) )
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001BF3CAEA390>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000001BF3CB6DF28>]],
dtype=object)
箱线图
data = pd. read_csv( '../data/学生考试表现数据/StudentsPerformance.csv' )
ms = data[ 'math score' ]
ms. plot. box( )
<matplotlib.axes._subplots.AxesSubplot at 0x1bf3cc080b8>
df = data[ [ 'math score' , 'writing score' , 'reading score' ] ]
df. plot. box( )
<matplotlib.axes._subplots.AxesSubplot at 0x1bf3cc6f5c0>
项目资源下载:
在我的资源文件中下载 下载地址:https://download.csdn.net/download/yegeli/12562286
项目一:分析影响学生成绩的因素
学生成绩影响因素分析
import numpy as np
import pandas as pd
data = pd. read_csv( 'StudentsPerformance.csv' )
data[ 'total score' ] = data. sum ( axis= 1 )
data. describe( include= [ 'number' , 'object' ] )
gender race/ethnicity parental level of education lunch test preparation course math score reading score writing score total score count 1000 1000 1000 1000 1000 1000.00000 1000.000000 1000.000000 1000.000000 unique 2 5 6 2 2 NaN NaN NaN NaN top female group C some college standard none NaN NaN NaN NaN freq 518 319 226 645 642 NaN NaN NaN NaN mean NaN NaN NaN NaN NaN 66.08900 69.169000 68.054000 203.312000 std NaN NaN NaN NaN NaN 15.16308 14.600192 15.195657 42.771978 min NaN NaN NaN NaN NaN 0.00000 17.000000 10.000000 27.000000 25% NaN NaN NaN NaN NaN 57.00000 59.000000 57.750000 175.000000 50% NaN NaN NaN NaN NaN 66.00000 70.000000 69.000000 205.000000 75% NaN NaN NaN NaN NaN 77.00000 79.000000 79.000000 233.000000 max NaN NaN NaN NaN NaN 100.00000 100.000000 100.000000 300.000000
r = data. pivot_table( index= 'gender' )
r
math score reading score total score writing score gender female 63.633205 72.608108 208.708494 72.467181 male 68.728216 65.473029 197.512448 63.311203
r. T. plot. barh( )
<matplotlib.axes._subplots.AxesSubplot at 0x24839422d00>
r. T. plot. pie( subplots= True , figsize= ( 12 , 3 ) )
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B50FFA0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B56EE80>],
dtype=object)
总体来说,女生的成绩普遍比较好,但是男生更善于数学。
r = data. pivot_table( index= 'race/ethnicity' )
r
math score reading score total score writing score race/ethnicity group A 61.629213 64.674157 188.977528 62.674157 group B 63.452632 67.352632 196.405263 65.600000 group C 64.463950 69.103448 201.394984 67.827586 group D 67.362595 70.030534 207.538168 70.145038 group E 73.821429 73.028571 218.257143 71.407143
种族划分(优秀-及格): E - D - C - B - A
r. T. plot. barh( )
<matplotlib.axes._subplots.AxesSubplot at 0x2483b5f6370>
r = data. pivot_table( index= 'parental level of education' )
r. sort_values( by= 'total score' )
math score reading score total score writing score parental level of education high school 62.137755 64.704082 189.290816 62.448980 some high school 63.497207 66.938547 195.324022 64.888268 some college 67.128319 69.460177 205.429204 68.840708 associate's degree 67.882883 70.927928 208.707207 69.896396 bachelor's degree 69.389831 73.000000 215.771186 73.381356 master's degree 69.745763 75.372881 220.796610 75.677966
r. T. plot. barh( )
<matplotlib.axes._subplots.AxesSubplot at 0x2483b672c70>
父母受教育水平越高,学习成绩越好。
r = data. pivot_table( index= 'lunch' )
r. sort_values( by= 'total score' , ascending= False )
math score reading score total score writing score lunch standard 70.034109 71.654264 212.511628 70.823256 free/reduced 58.921127 64.653521 186.597183 63.022535
r. plot. pie( subplots= True , figsize= ( 12 , 3 ) )
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B738310>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B75CFA0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B7891C0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B7B71F0>],
dtype=object)
r = data. pivot_table( index= 'test preparation course' )
r. sort_values( by= 'total score' , ascending= False )
math score reading score total score writing score test preparation course completed 69.695531 73.893855 218.008380 74.418994 none 64.077882 66.534268 195.116822 64.504673
r. plot. pie( subplots= True , figsize= ( 12 , 3 ) )
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B57B280>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B768DC0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B67F9D0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B8A43D0>],
dtype=object)
r = data. pivot_table( index= [ 'gender' , 'test preparation course' ] )
r
math score reading score total score writing score gender test preparation course female completed 67.195652 77.375000 223.364130 78.793478 none 61.670659 69.982036 200.634731 68.982036 male completed 72.339080 70.212644 212.344828 69.793103 none 66.688312 62.795455 189.133117 59.649351
分析前100名与后100名同学的不同情况
r = data. sort_values( by= 'total score' , ascending= False )
top100 = r. head( 100 )
tail100 = r. tail( 100 )
r1 = pd. DataFrame( { 'top100' : top100[ 'gender' ] . value_counts( ) ,
'tail100' : tail100[ 'gender' ] . value_counts( ) } )
r1
top100 tail100 female 66 38 male 34 62
r1. plot. pie( subplots= True , figsize= ( 8 , 4 ) )
array([<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B93EA30>,
<matplotlib.axes._subplots.AxesSubplot object at 0x000002483B963BE0>],
dtype=object)
data = data[ 'parental level of education' ] . value_counts( )
data. plot. pie( figsize= ( 6 , 6 ) )
<matplotlib.axes._subplots.AxesSubplot at 0x2483c98cf70>
data. plot. barh( )
<matplotlib.axes._subplots.AxesSubplot at 0x2483c9daf70>
r2 = pd. DataFrame( { 'top100' : top100[ 'parental level of education' ] . value_counts( ) ,
'tail100' : tail100[ 'parental level of education' ] . value_counts( ) } )
r2
top100 tail100 associate's degree 29 17 bachelor's degree 20 8 high school 6 32 master's degree 15 1 some college 21 14 some high school 9 28
r2. plot. barh( )
<matplotlib.axes._subplots.AxesSubplot at 0x2483ca2e550>
项目二:泰坦尼克号生存人员数据分析与可视化
Kaggle案例泰坦尼克号生存预测分析
查看数据
用pandas加载数据
import pandas as pd
import numpy as np
data_train= pd. read_csv( 'train.csv' )
data_train. head( )
data_train. columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
data_train. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
有以下这些字段 PassengerId => 乘客ID Survived => 生存 Pclass => 乘客等级(1/2/3等舱位) Name => 乘客姓名 Sex => 性别 Age => 年龄 SibSp => 堂兄弟/妹个数 Parch => 父母与小孩个数 Ticket => 船票信息 Fare => 票价 Cabin => 客舱 Embarked => 登船港口
数据简单描述性分析
data_train. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
训练数据中总共有891名乘客,但是我们有些属性的数据不全,比如说:
Age(年龄)属性只有714名乘客有记录 Cabin(客舱)更是只有204名乘客是已知的 具体数据数值情况,我们用下列的方法,得到数值型数据的一些分布
data_train. describe( )
PassengerId Survived Pclass Age SibSp Parch Fare count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
mean字段告诉我们,大概0.383838的人最后获救了,平均乘客年龄大概是29.7岁
通过可视化的方式深入了解数据
获救情况人数可视化
import matplotlib. pyplot as plt
from pylab import mpl
mpl. rcParams[ 'font.sans-serif' ] = [ 'Simhei' ]
mpl. rcParams[ 'axes.unicode_minus' ] = False
data_train. Survived. value_counts( ) . plot( kind= 'bar' )
plt. title( '获救情况(1为获救)' )
plt. ylabel( '人数' )
plt. legend( )
plt. show( )
乘客等级分布可视化
data_train. Pclass. value_counts( ) . plot( kind= 'bar' )
plt. ylabel( '人数' )
plt. xlabel( '乘客等级' )
plt. title( '乘客等级分布情况' )
print ( data_train. Pclass. value_counts( ) )
3 491
1 216
2 184
Name: Pclass, dtype: int64
按年龄看获救分布可视化
data_train[ 'Age' ] . plot. kde( )
<matplotlib.axes._subplots.AxesSubplot at 0x1da647b4358>
plt. scatter( data_train. Survived, data_train. Age)
plt. ylabel( '年龄' )
plt. grid( axis= 'y' )
plt. title( '按照年龄看获救分布可视化(1为获救)' )
plt. show( )
各等级的乘客年龄分布
data_train. Age[ data_train. Pclass == 1 ] . plot( kind= 'kde' )
data_train. Age[ data_train. Pclass == 2 ] . plot( kind= 'kde' )
data_train. Age[ data_train. Pclass == 3 ] . plot( kind= 'kde' )
plt. xlabel( '年龄' )
plt. ylabel( '密度' )
plt. title( '各等级的乘客年龄分布' )
plt. legend( ( '一等舱' , '二等舱' , '三等舱' ) )
plt. show( )
各登船口岸上船人数可视化
data_train. Embarked. value_counts( ) . plot( kind= 'bar' )
plt. title( '各登船港口上船人数' )
plt. ylabel( '人数' )
plt. show( )
所以我们在图上可以看出来:
被救的人300多点,不到半数; 3等舱乘客非常多;遇难和获救的人年龄跨度都很广; 3个不同的舱年龄总体趋势似乎也一致,2/3等舱乘客20岁多点的人最多,1等舱40岁左右的最多 登船港口人数按照S、C、Q递减,而且S远多于另外俩港口。>
查看每一个属性与获救情况的可视化
各乘客等级的获救情况
Survived_1= data_train. Pclass[ data_train. Survived== 1 ] . value_counts( )
Survived_0= data_train. Pclass[ data_train. Survived== 0 ] . value_counts( )
df= pd. DataFrame( { '获救' : Survived_1, '未获救' : Survived_0} )
df. plot( kind= 'bar' )
plt. title( '各乘客等级的获救情况可视化' )
plt. xlabel( '乘客等级' )
plt. ylabel( '人数' )
plt. show( )
各登船港口对于获救情况分析
Survived_1= data_train. Embarked[ data_train. Survived== 1 ] . value_counts( )
Survived_0= data_train. Embarked[ data_train. Survived== 0 ] . value_counts( )
df= pd. DataFrame( { '获救' : Survived_1, '未获救' : Survived_0} )
df. plot( kind= 'bar' )
plt. title( '各登船港口的获救情况可视化' )
plt. xlabel( '登船港口' )
plt. ylabel( '人数' )
plt. show( )
各性别的获救情况
Survived_m= data_train. Survived[ data_train. Sex== 'male' ] . value_counts( )
Survived_f= data_train. Survived[ data_train. Sex== 'female' ] . value_counts( )
df= pd. DataFrame( { '男性' : Survived_m, '女性' : Survived_f} )
df. plot( kind= 'bar' )
plt. title( '按照性别看获救情况' )
plt. xlabel( '获救' )
plt. ylabel( '人数' )
plt. show( )
获救的女性要多于男性。
堂兄弟和父母字段对于获救情况分析
data_train. pivot_table( index= [ 'SibSp' , 'Survived' ] , values= 'PassengerId' , aggfunc= 'count' )
PassengerId SibSp Survived 0 0 398 1 210 1 0 97 1 112 2 0 15 1 13 3 0 12 1 4 4 0 15 1 3 5 0 5 8 0 7
data_train. pivot_table( index= [ 'Parch' , 'Survived' ] , values= 'PassengerId' , aggfunc= 'count' )
PassengerId Parch Survived 0 0 445 1 233 1 0 53 1 65 2 0 40 1 40 3 0 2 1 3 4 0 4 5 0 4 1 1 6 0 1
ticket是船票编号,是unique的,和最后的结果没有太大的关系,不纳入考虑的特征范畴
cabin只有204个乘客有值,我们先看看它的一个分布
data_train. Cabin. value_counts( )
G6 4
C23 C25 C27 4
B96 B98 4
F2 3
E101 3
..
D30 1
A7 1
D47 1
E31 1
C99 1
Name: Cabin, Length: 147, dtype: int64
survival_cabin= data_train. Survived[ pd. notnull( data_train. Cabin) ] . value_counts( )
survival_cabin
survival_nocabin= data_train. Survived[ pd. isnull( data_train. Cabin) ] . value_counts( )
df= pd. DataFrame( { '有' : survival_cabin, '无' : survival_nocabin} )
df. plot( kind= 'bar' )
plt. title( '按照Cabin有无去看获救情况' )
plt. xlabel( '获救情况' )
plt. ylabel( 'Cabin有无' )
plt. show( )
有Cabin记录的似乎获救概率稍高一些
数据预处理
data_train. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
数据缺失值处理
data_train. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
data_train[ 'Age' ] = data_train[ 'Age' ] . fillna( data_train[ 'Age' ] . mean( ) )
def set_cabin ( df) :
df. loc[ ( df. Cabin. notnull( ) ) , 'Cabin' ] = 'Yes'
df. loc[ ( df. Cabin. isnull( ) ) , 'Cabin' ] = 'No'
return df
data_train= set_cabin( data_train)
data_train[ 'Embarked' ] = data_train[ 'Embarked' ] . fillna( 'S' )
data_train. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 No S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 Yes C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 No S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 Yes S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 No S
data_train. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 891 non-null object
Embarked 891 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
数据one-hot处理
因为逻辑回归建模时,需要输入的特征都是数值型特征,我们通常会先对类目型的特征因子化/one-hot编码 什么叫做因子化/one-hot编码?举个例子 以Embarked为例,原本一个属性维度,因为其取值可以是[‘S’,’C’,’Q‘],而将其平展开为’Embarked_C’,’Embarked_S’, ‘Embarked_Q’三个属性
原本Embarked取值为S的,在此处的”Embarked_S”下取值为1,在’Embarked_C’, ‘Embarked_Q’下取值为0 原本Embarked取值为C的,在此处的”Embarked_C”下取值为1,在’Embarked_S’, ‘Embarked_Q’下取值为0 原本Embarked取值为Q的,在此处的”Embarked_Q”下取值为1,在’Embarked_C’, ‘Embarked_S’下取值为0 我们使用pandas的”get_dummies”来完成这个工作,并拼接在原来的”data_train”之上
dummies_Cabin= pd. get_dummies( data_train[ 'Cabin' ] , prefix= 'Cabin' )
dummies_Embarked= pd. get_dummies( data_train[ 'Embarked' ] , prefix= 'Cabin' )
dummies_Pclass= pd. get_dummies( data_train[ 'Pclass' ] , prefix= 'Pclass' )
dummies_Sex= pd. get_dummies( data_train[ 'Sex' ] , prefix= 'Sex' )
df= pd. concat( [ data_train, dummies_Cabin, dummies_Embarked, dummies_Pclass, dummies_Sex] , axis= 1 )
df. drop( [ 'Pclass' , 'Name' , 'Sex' , 'Ticket' , 'Cabin' , 'Embarked' ] , axis= 1 , inplace= True )
df. head( )
PassengerId Survived Age SibSp Parch Fare Cabin_No Cabin_Yes Cabin_C Cabin_Q Cabin_S Pclass_1 Pclass_2 Pclass_3 Sex_female Sex_male 0 1 0 22.0 1 0 7.2500 1 0 0 0 1 0 0 1 0 1 1 2 1 38.0 1 0 71.2833 0 1 1 0 0 1 0 0 1 0 2 3 1 26.0 0 0 7.9250 1 0 0 0 1 0 0 1 1 0 3 4 1 35.0 1 0 53.1000 0 1 0 0 1 1 0 0 1 0 4 5 0 35.0 0 0 8.0500 1 0 0 0 1 0 0 1 0 1
df. describe( )
PassengerId Survived Age SibSp Parch Fare Cabin_No Cabin_Yes Cabin_C Cabin_Q Cabin_S Pclass_1 Pclass_2 Pclass_3 Sex_female Sex_male count 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000 mean 446.000000 0.383838 29.699118 0.523008 0.381594 32.204208 0.771044 0.228956 0.188552 0.086420 0.725028 0.242424 0.206510 0.551066 0.352413 0.647587 std 257.353842 0.486592 13.002015 1.102743 0.806057 49.693429 0.420397 0.420397 0.391372 0.281141 0.446751 0.428790 0.405028 0.497665 0.477990 0.477990 min 1.000000 0.000000 0.420000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25% 223.500000 0.000000 22.000000 0.000000 0.000000 7.910400 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 50% 446.000000 0.000000 29.699118 0.000000 0.000000 14.454200 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 0.000000 1.000000 75% 668.500000 1.000000 35.000000 1.000000 0.000000 31.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 1.000000 1.000000 1.000000 max 891.000000 1.000000 80.000000 8.000000 6.000000 512.329200 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
数据标准化处理
我们还得做一些处理,Age和Fare两个属性,乘客的数值幅度变化太大,进行标准差标准化处理
a= df. Age
df[ 'Age_scaled' ] = ( a- a. mean( ) ) / ( a. std( ) )
df= df. drop( 'Age' , axis= 1 )
b= df. Fare
df[ 'Fare_scaled' ] = ( b- b. mean( ) ) / ( b. std( ) )
df= df. drop( 'Fare' , axis= 1 )
df. head( )
PassengerId Survived SibSp Parch Cabin_No Cabin_Yes Cabin_C Cabin_Q Cabin_S Pclass_1 Pclass_2 Pclass_3 Sex_female Sex_male Age_scaled Fare_scaled 0 1 0 1 0 1 0 0 0 1 0 0 1 0 1 -0.592148 -0.502163 1 2 1 1 0 0 1 1 0 0 1 0 0 1 0 0.638430 0.786404 2 3 1 0 0 1 0 0 0 1 0 0 1 1 0 -0.284503 -0.488580 3 4 1 1 0 0 1 0 0 1 1 0 0 1 0 0.407697 0.420494 4 5 0 0 0 1 0 0 0 1 0 0 1 0 1 0.407697 -0.486064
数据建模–逻辑回归
我们把需要的feature字段取出来,转成numpy格式,使用scikit-learn中的LogisticRegression建模。
from sklearn import linear_model
train_df= df. filter ( regex= 'Survived|Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*' )
train_np= train_df. values
y= train_np[ : , 0 ]
X= train_np[ : , 1 : ]
clf= linear_model. LogisticRegression( penalty= 'l2' )
clf. fit( X, y)
print ( clf. score( X, y) )
clf
0.8125701459034792
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
penalty:惩罚项,str类型,可选参数为l1和l2,默认为l2。用于指定惩罚项中使用的规范。newton-cg、sag和lbfgs求解算法只支持L2规范。L1G规范假设的是模型的参数满足拉普拉斯分布,L2假设的模型参数满足高斯分布,所谓的范式就是加上对参数的约束,使得模型更不会过拟合(overfit) tol:停止求解的标准,float类型,默认为1e-4。就是求解到多少的时候,停止,认为已经求出最优解。 c:正则化系数λ的倒数,float类型,默认为1.0。必须是正浮点型数。像SVM一样,越小的数值表示越强的正则化。
接下来咱们对训练集和测试集做一样的操作
data_test= pd. read_csv( 'test.csv' )
data_test. head( )
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q 1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S 2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q 3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S 4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
data_test. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId 418 non-null int64
Pclass 418 non-null int64
Name 418 non-null object
Sex 418 non-null object
Age 332 non-null float64
SibSp 418 non-null int64
Parch 418 non-null int64
Ticket 418 non-null object
Fare 417 non-null float64
Cabin 91 non-null object
Embarked 418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
data_test. loc[ ( data_test. Fare. isnull( ) ) , 'Fare' ] = 0
data_test[ 'Age' ] = data_test[ 'Age' ] . fillna( data_test[ 'Age' ] . mean( ) )
def set_Cabin ( df) :
df. loc[ ( df. Cabin. notnull( ) ) , 'Cabin' ] = 'Yes'
df. loc[ ( df. Cabin. isnull( ) ) , 'Cabin' ] = 'No'
return df
data_test= set_cabin( data_test)
dummies_Cabin= pd. get_dummies( data_test[ 'Cabin' ] , prefix= 'Cabin' )
dummies_Embarked= pd. get_dummies( data_test[ 'Embarked' ] , prefix= 'Embarked' )
dummies_Pclass= pd. get_dummies( data_test[ 'Pclass' ] , prefix= 'Pclass' )
dummies_Sex= pd. get_dummies( data_test[ 'Sex' ] , prefix= 'Sex' )
df_test= pd. concat( [ data_test, dummies_Cabin, dummies_Embarked, dummies_Pclass, dummies_Sex] , axis= 1 )
df_test. drop( [ 'Pclass' , 'Name' , 'Sex' , 'Ticket' , 'Cabin' , 'Embarked' ] , axis= 1 , inplace= True )
a= df_test. Age
df_test[ 'Age_scaled' ] = ( a- a. mean( ) ) / ( a. std( ) )
df_test= df_test. drop( 'Age' , axis= 1 )
b= df_test. Fare
df_test[ 'Fare_scaled' ] = ( b- b. mean( ) ) / ( b. std( ) )
df_test= df_test. drop( 'Fare' , axis= 1 )
df_test. head( )
PassengerId SibSp Parch Cabin_No Cabin_Yes Embarked_C Embarked_Q Embarked_S Pclass_1 Pclass_2 Pclass_3 Sex_female Sex_male Age_scaled Fare_scaled 0 892 0 0 1 0 0 1 0 0 0 1 0 1 0.334592 -0.496043 1 893 1 0 1 0 0 0 1 0 0 1 1 0 1.323944 -0.510885 2 894 0 0 1 0 0 1 0 0 1 0 0 1 2.511166 -0.462780 3 895 0 0 1 0 0 0 1 0 0 1 0 1 -0.259019 -0.481127 4 896 1 1 1 0 0 0 1 0 0 1 1 0 -0.654760 -0.416242
test= df_test. filter ( regex= 'Age_.*|SibSp|Parch|Fare_.*|Cabin_.*|Embarked_.*|Sex_.*|Pclass_.*' )
predictions= clf. predict( test)
result= pd. DataFrame( { 'PassengerId' : data_test[ 'PassengerId' ] . values,
'Survived' : predictions. astype( np. int32) } )
result. to_csv( 'logistic_regression_predictions.csv' , index= False )
pd. read_csv( 'logistic_regression_predictions.csv' ) . head( 10 )
PassengerId Survived 0 892 0 1 893 0 2 894 0 3 895 0 4 896 1 5 897 0 6 898 1 7 899 0 8 900 1 9 901 0
项目三:movielens电影数据分析与可视化
movielens电影评分数据分析(上)
import numpy as np
import pandas as pd
import matplotlib. pyplot as plt
读取数据
users = pd. read_table( 'users.dat' , header= None , names= [ 'UserID' , 'Gender' , 'Age' , 'Occupation' , 'Zip-code' ] , sep= '::' , engine= 'python' )
print ( len ( users) )
6040
users. head( 5 )
UserID Gender Age Occupation Zip-code 0 1 F 1 10 48067 1 2 M 56 16 70072 2 3 M 25 15 55117 3 4 M 45 7 02460 4 5 M 25 20 55455
ratings = pd. read_table( 'ratings.dat' , header= None , names= [ 'UserID' , 'MovieID' , 'Rating' , 'Timestamp' ] , sep= '::' , engine= 'python' )
print ( len ( ratings) )
print ( ratings. head( 5 ) )
movies = pd. read_table( 'movies.dat' , header= None , names= [ 'MovieID' , 'Title' , 'Genres' ] , sep= '::' , engine= 'python' )
print ( len ( movies) )
print ( movies. head( 5 ) )
1000209
UserID MovieID Rating Timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
3883
MovieID Title Genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
合并数据表
data = pd. merge( pd. merge( users, ratings) , movies)
data. tail( 5 )
UserID Gender Age Occupation Zip-code MovieID Rating Timestamp Title Genres 1000204 5949 M 18 17 47901 2198 5 958846401 Modulations (1998) Documentary 1000205 5675 M 35 14 30030 2703 3 976029116 Broken Vessels (1998) Drama 1000206 5780 M 18 17 92886 2845 1 958153068 White Boys (1999) Drama 1000207 5851 F 18 20 55410 3607 5 957756608 One Little Indian (1973) Comedy|Drama|Western 1000208 5938 M 25 1 35401 2909 4 957273353 Five Wives, Three Secretaries and Me (1998) Documentary
对数据初步描述分析
data. describe( )
UserID Age Occupation MovieID Rating Timestamp count 1.000209e+06 1.000209e+06 1.000209e+06 1.000209e+06 1.000209e+06 1.000209e+06 mean 3.024512e+03 2.973831e+01 8.036138e+00 1.865540e+03 3.581564e+00 9.722437e+08 std 1.728413e+03 1.175198e+01 6.531336e+00 1.096041e+03 1.117102e+00 1.215256e+07 min 1.000000e+00 1.000000e+00 0.000000e+00 1.000000e+00 1.000000e+00 9.567039e+08 25% 1.506000e+03 2.500000e+01 2.000000e+00 1.030000e+03 3.000000e+00 9.653026e+08 50% 3.070000e+03 2.500000e+01 7.000000e+00 1.835000e+03 4.000000e+00 9.730180e+08 75% 4.476000e+03 3.500000e+01 1.400000e+01 2.770000e+03 4.000000e+00 9.752209e+08 max 6.040000e+03 5.600000e+01 2.000000e+01 3.952000e+03 5.000000e+00 1.046455e+09
data. info( )
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000209 entries, 0 to 1000208
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 UserID 1000209 non-null int64
1 Gender 1000209 non-null object
2 Age 1000209 non-null int64
3 Occupation 1000209 non-null int64
4 Zip-code 1000209 non-null object
5 MovieID 1000209 non-null int64
6 Rating 1000209 non-null int64
7 Timestamp 1000209 non-null int64
8 Title 1000209 non-null object
9 Genres 1000209 non-null object
dtypes: int64(6), object(4)
memory usage: 83.9+ MB
查看数据
data[ data. UserID== 1 ] . head( )
UserID Gender Age Occupation Zip-code MovieID Rating Timestamp Title Genres 0 1 F 1 10 48067 1193 5 978300760 One Flew Over the Cuckoo's Nest (1975) Drama 1725 1 F 1 10 48067 661 3 978302109 James and the Giant Peach (1996) Animation|Children's|Musical 2250 1 F 1 10 48067 914 3 978301968 My Fair Lady (1964) Musical|Romance 2886 1 F 1 10 48067 3408 4 978300275 Erin Brockovich (2000) Drama 4201 1 F 1 10 48067 2355 5 978824291 Bug's Life, A (1998) Animation|Children's|Comedy
r = data[ 'Zip-code' ] . value_counts( )
r = r. sort_values( ascending= False ) . head( 10 )
r. plot( kind= 'bar' )
plt. xticks( rotation= 45 )
plt. show( )
data_rating_num= data. groupby( 'Title' ) . size( )
data_rating_num. head( 10 )
Title
$1,000,000 Duck (1971) 37
'Night Mother (1986) 70
'Til There Was You (1997) 52
'burbs, The (1989) 303
...And Justice for All (1979) 199
1-900 (1994) 2
10 Things I Hate About You (1999) 700
101 Dalmatians (1961) 565
101 Dalmatians (1996) 364
12 Angry Men (1957) 616
dtype: int64
data_rating_num_sorted= data_rating_num. sort_values( ascending= False )
data_rating_num_sorted = data_rating_num_sorted[ ( data_rating_num_sorted> 300 ) & ( data_rating_num_sorted< 400 ) ]
data_rating_num_sorted
Title
Yellow Submarine (1968) 399
Anaconda (1997) 399
Snow Falling on Cedars (1999) 398
His Girl Friday (1940) 397
First Blood (1982) 397
...
Godzilla (Gojira) (1954) 301
Rambo III (1988) 301
Zero Effect (1998) 301
Short Cuts (1993) 301
Old Yeller (1957) 301
Length: 256, dtype: int64
查看每一部电影不同性别的平均评分并计算分歧差值,之后排序
data_gender= data. pivot_table( index= 'Title' , columns= 'Gender' , values= 'Rating' , aggfunc= 'mean' )
data_gender = data_gender. loc[ data_rating_num_sorted. index]
data_gender. head( )
Gender F M Title Yellow Submarine (1968) 3.714286 3.689286 Anaconda (1997) 2.000000 2.248447 Snow Falling on Cedars (1999) 3.482014 3.374517 His Girl Friday (1940) 4.312500 4.213439 First Blood (1982) 3.285714 3.599448
data_gender[ 'diff' ] = np. fabs( data_gender. F- data_gender. M)
data_gender. head( )
Gender F M diff Title Yellow Submarine (1968) 3.714286 3.689286 0.025000 Anaconda (1997) 2.000000 2.248447 0.248447 Snow Falling on Cedars (1999) 3.482014 3.374517 0.107497 His Girl Friday (1940) 4.312500 4.213439 0.099061 First Blood (1982) 3.285714 3.599448 0.313733
data_gender_sorted= data_gender. sort_values( by= 'diff' , ascending= False )
data_gender_sorted_top10 = data_gender_sorted. head( 10 )
data_gender_sorted_top10
Gender F M diff Title Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359 Jumpin' Jack Flash (1986) 3.254717 2.578358 0.676359 Longest Day, The (1962) 3.411765 4.031447 0.619682 Cable Guy, The (1996) 2.250000 2.863787 0.613787 For a Few Dollars More (1965) 3.409091 3.953795 0.544704 Porky's (1981) 2.296875 2.836364 0.539489 Fright Night (1985) 2.973684 3.500000 0.526316 Anastasia (1997) 3.800000 3.281609 0.518391 French Kiss (1995) 3.535714 3.056962 0.478752 Little Shop of Horrors, The (1960) 3.650000 3.179688 0.470312
genres = movies. set_index( movies[ 'Title' ] ) . loc[ data_gender_sorted_top10. index] . Genres
data_gender_sorted_top10[ 'Genres' ] = genres
data_gender_sorted_top10
<ipython-input-16-f0465e0ad586>:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
data_gender_sorted_top10['Genres'] = genres
Gender F M diff Genres Title Kentucky Fried Movie, The (1977) 2.878788 3.555147 0.676359 Comedy Jumpin' Jack Flash (1986) 3.254717 2.578358 0.676359 Action|Comedy|Romance|Thriller Longest Day, The (1962) 3.411765 4.031447 0.619682 Action|Drama|War Cable Guy, The (1996) 2.250000 2.863787 0.613787 Comedy For a Few Dollars More (1965) 3.409091 3.953795 0.544704 Western Porky's (1981) 2.296875 2.836364 0.539489 Comedy Fright Night (1985) 2.973684 3.500000 0.526316 Comedy|Horror Anastasia (1997) 3.800000 3.281609 0.518391 Animation|Children's|Musical French Kiss (1995) 3.535714 3.056962 0.478752 Comedy|Romance Little Shop of Horrors, The (1960) 3.650000 3.179688 0.470312 Comedy|Horror
算出每部电影平均得分并对其进行排序
data_rating_num = data_rating_num[ data_rating_num> 100 ]
mask = data[ 'Title' ] . apply ( lambda x: True if x in data_rating_num. index else False )
data_mean_rating = data[ mask] . pivot_table( index= 'Title' , values= [ 'Rating' ] )
data_mean_rating
Rating Title 'burbs, The (1989) 2.910891 ...And Justice for All (1979) 3.713568 10 Things I Hate About You (1999) 3.422857 101 Dalmatians (1961) 3.596460 101 Dalmatians (1996) 3.046703 ... ... Young Guns II (1990) 2.907859 Young Sherlock Holmes (1985) 3.390501 Your Friends and Neighbors (1998) 3.376147 Zero Effect (1998) 3.750831 eXistenZ (1999) 3.256098
2006 rows × 1 columns
data_mean_rating_sorted= data_mean_rating. sort_values( by= 'Rating' , ascending= False )
data_mean_rating_sorted. head( )
Rating Title Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954) 4.560510 Shawshank Redemption, The (1994) 4.554558 Godfather, The (1972) 4.524966 Close Shave, A (1995) 4.520548 Usual Suspects, The (1995) 4.517106
取评分数量最多的前20条数据
hot_movies_sorted= data_rating_num. sort_values( ascending= False )
hot_movies_sorted[ : 20 ]
Title
American Beauty (1999) 3428
Star Wars: Episode IV - A New Hope (1977) 2991
Star Wars: Episode V - The Empire Strikes Back (1980) 2990
Star Wars: Episode VI - Return of the Jedi (1983) 2883
Jurassic Park (1993) 2672
Saving Private Ryan (1998) 2653
Terminator 2: Judgment Day (1991) 2649
Matrix, The (1999) 2590
Back to the Future (1985) 2583
Silence of the Lambs, The (1991) 2578
Men in Black (1997) 2538
Raiders of the Lost Ark (1981) 2514
Fargo (1996) 2513
Sixth Sense, The (1999) 2459
Braveheart (1995) 2443
Shakespeare in Love (1998) 2369
Princess Bride, The (1987) 2318
Schindler's List (1993) 2304
L.A. Confidential (1997) 2288
Groundhog Day (1993) 2278
dtype: int64
查看不同年龄的分布情况并且采用直方图进行可视化
import matplotlib. pyplot as plt
users. Age. plot. hist( bins= 10 , edgecolor= 'white' )
plt. title( 'users_ages' )
plt. xlabel( 'age' )
plt. ylabel( 'count of age' )
xticks = np. linspace( np. min ( users. Age) , np. max ( users. Age) , 11 )
plt. xticks( xticks)
plt. show( )
每10岁一个区间,统计出用户的年龄分组分布
data[ 'Age' ] . plot( kind= 'hist' , bins= 10 )
plt. xticks( rotation= 45 )
plt. show( )
统计数据集中每一类型的电影频数
df = pd. DataFrame( movies. Genres. str . split( '|' ) . tolist( ) )
df = df. stack( ) . reset_index( )
df = df. drop( [ 'level_0' , 'level_1' ] , axis= 1 )
genres = df. groupby( 0 ) . size( )
genres. sort_values( ascending= False ) . plot( kind= 'bar' )
plt. xticks( rotation= 45 )
plt. show( )
项目四:二手房源信息数据分析与可视化
二手房源信息数据分析与可视化
import numpy as np
import pandas as pd
import matplotlib. pyplot as plt
plt. rcParams[ 'font.sans-serif' ] = [ 'SimHei' ]
plt. rcParams[ 'axes.unicode_minus' ] = False
house= pd. read_csv( 'house.csv' )
house. head( 1 )
index title community years housetype square floor taxtype totalPrice unitPrice followInfo 0 0 宝星华庭一层带花园,客厅挑高,通透四居室。房主自荐 宝星国际三期 底层(共22层)2010年建板塔结合 4室1厅 298.79平米 底层(共22层)2010年建板塔结合 距离15号线望京东站680米房本满五年 2598 86951 53人关注 / 共44次带看 / 一年前发布
数据描述性分析
house. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16108 entries, 0 to 16107
Data columns (total 11 columns):
index 16108 non-null int64
title 16108 non-null object
community 16108 non-null object
years 16106 non-null object
housetype 16108 non-null object
square 16108 non-null object
floor 16106 non-null object
taxtype 15361 non-null object
totalPrice 16108 non-null int64
unitPrice 16108 non-null int64
followInfo 16108 non-null object
dtypes: int64(3), object(8)
memory usage: 1.4+ MB
community= pd. read_csv( 'community_describe.csv' )
community. head( )
index id community district bizcircle tagList onsale 0 0 1111000004310 什坊院甲3号院 海淀 田村 NaN 0 1 1 1111027373682 大慧寺6号院 海淀 白石桥 NaN 2 2 2 1111027373683 东花市北里东区 东城 东花市 近地铁1号线王府井站 0 3 3 1111027373684 东花市北里西区 东城 东花市 近地铁7号线广渠门内站 7 4 4 1111027373685 东花市北里中区 东城 东花市 近地铁2号线朝阳门站 9
house_detail= pd. merge( house, community, on= 'community' )
house_detail. head( 1 )
index_x title community years housetype square floor taxtype totalPrice unitPrice followInfo index_y id district bizcircle tagList onsale 0 0 宝星华庭一层带花园,客厅挑高,通透四居室。房主自荐 宝星国际三期 底层(共22层)2010年建板塔结合 4室1厅 298.79平米 底层(共22层)2010年建板塔结合 距离15号线望京东站680米房本满五年 2598 86951 53人关注 / 共44次带看 / 一年前发布 1535 1111027376204 朝阳 望京 近地铁15号线望京东站 7
数值型数据描述
house. describe( )
index totalPrice unitPrice count 16108.000000 16108.000000 16108.000000 mean 8053.500000 747.983735 77656.823814 std 4650.123403 536.202306 23616.114546 min 0.000000 15.000000 2539.000000 25% 4026.750000 439.000000 60449.500000 50% 8053.500000 600.000000 75094.000000 75% 12080.250000 870.000000 91474.250000 max 16107.000000 12500.000000 159991.000000
数据预处理1:将数据从字符串提取出来
def data_ad ( select_data, str ) :
if str in select_data:
return float ( select_data[ 0 : select_data. find( str ) ] )
else :
return None
house[ 'square' ] = house[ 'square' ] . apply ( data_ad, str = '平米' )
house. head( 1 )
index title community years housetype square floor taxtype totalPrice unitPrice followInfo 0 0 宝星华庭一层带花园,客厅挑高,通透四居室。房主自荐 宝星国际三期 底层(共22层)2010年建板塔结合 4室1厅 298.79 底层(共22层)2010年建板塔结合 距离15号线望京东站680米房本满五年 2598 86951 53人关注 / 共44次带看 / 一年前发布
house. describe( )
index square totalPrice unitPrice attention count 16026.000000 16026.000000 16026.000000 16026.000000 16026.000000 mean 8061.290715 95.997246 743.136029 77796.268876 58.154936 std 4648.836720 57.606275 510.155956 23441.070459 68.642351 min 0.000000 15.290000 40.000000 11393.000000 0.000000 25% 4037.250000 61.110000 440.000000 60589.250000 17.000000 50% 8066.500000 81.200000 599.000000 75184.500000 37.000000 75% 12085.750000 112.757500 870.000000 91516.000000 73.000000 max 16107.000000 2623.280000 12000.000000 159991.000000 1401.000000
house[ house[ 'square' ] < 16 ]
index title community years housetype square floor taxtype totalPrice unitPrice followInfo attention layer year 15260 15260 智德北巷(北河沿大街)+小户型一居+南向 智德北巷 中楼层(共6层)1985年建板楼 1室0厅 15.29 中楼层(共6层)1985年建板楼 距离5号线灯市口站1113米 220 143885 56人关注 / 共2次带看 / 8天以前发布 56.0 中楼层 1985年
户型的种类
house. housetype. value_counts( )
2室1厅 6582
3室1厅 2534
1室1厅 2472
3室2厅 1424
2室2厅 1018
1室0厅 620
4室2厅 496
4室1厅 181
2房间1卫 100
5室2厅 92
1房间1卫 87
1室2厅 64
4室3厅 55
3房间1卫 44
3室0厅 35
2室0厅 34
车位 32
6室2厅 29
5室3厅 22
联排别墅 19
1房间0卫 16
5室1厅 15
6室3厅 13
独栋别墅 12
3室3厅 11
4室0厅 10
叠拼别墅 10
双拼别墅 9
4房间2卫 9
4房间1卫 6
2房间2卫 6
6室1厅 5
5室4厅 4
6室4厅 3
7室3厅 3
5室5厅 3
3房间2卫 3
5房间3卫 2
2室3厅 2
9室4厅 2
6房间4卫 2
2房间0卫 2
6房间2卫 2
3房间3卫 2
4房间3卫 2
7室2厅 2
8室2厅 1
5室0厅 1
6室0厅 1
2房间3卫 1
4室4厅 1
5房间2卫 1
7室0厅 1
8房间5卫 1
3室4厅 1
8室4厅 1
6房间3卫 1
7室1厅 1
Name: housetype, dtype: int64
数据预处理2:删除车位信息
car= house[ house. housetype. str . contains( '车位' ) ]
car. shape[ 0 ]
house. drop( car. index, inplace= True )
car. shape
(32, 11)
数据分析1:价格最高的5个别墅
villa= house[ house. housetype. str . contains( '别墅' ) ]
villa. shape[ 0 ]
villa. sort_values( by= 'totalPrice' , ascending= False ) . head( 5 )
index title community years housetype square floor taxtype totalPrice unitPrice followInfo 8020 8020 香山清琴二期独栋别墅,毛坯房原始户型,花园1200平米 香山清琴 2层2007年建 独栋别墅 NaN 2层2007年建 房本满五年 12500 124681 45人关注 / 共7次带看 / 2个月以前发布 102 102 千尺独栋 北入户 红顶商人金融界入住社区 龙湖颐和原著 2层2010年建 独栋别墅 NaN 2层2010年建 距离4号线西苑站839米房本满五年 12000 112012 231人关注 / 共26次带看 / 一年前发布 2729 2729 临湖独栋别墅 花园半亩 观景湖面和绿化 满五年有车库房主自荐 紫玉山庄 3层2000年建 独栋别墅 NaN 3层2000年建 房本满五年 6000 148618 108人关注 / 共16次带看 / 5个月以前发布 3141 3141 银湖别墅 独栋 望京公园旁 五环里 封闭式社区 银湖别墅 3层1998年建 独栋别墅 NaN 3层1998年建 房本满五年 5000 130348 9人关注 / 共3次带看 / 5个月以前发布 4112 4112 首排别墅 位置好 全景小区绿化和人工湖 有车库 亚运新新家园朗月园一期 1层2003年建 联排别墅 NaN 1层2003年建 房本满五年 3800 82364 0人关注 / 共4次带看 / 4个月以前发布
数据预处理3:删除别墅信息
house. drop( villa. index, inplace= True )
house. shape[ 0 ]
16026
数据分析2:找出数据中的住房户型分布
house. housetype. value_counts( )
2室1厅 6582
3室1厅 2534
1室1厅 2472
3室2厅 1424
2室2厅 1018
1室0厅 620
4室2厅 496
4室1厅 181
2房间1卫 100
5室2厅 92
1房间1卫 87
1室2厅 64
4室3厅 55
3房间1卫 44
3室0厅 35
2室0厅 34
6室2厅 29
5室3厅 22
1房间0卫 16
5室1厅 15
6室3厅 13
3室3厅 11
4室0厅 10
4房间2卫 9
2房间2卫 6
4房间1卫 6
6室1厅 5
5室4厅 4
6室4厅 3
3房间2卫 3
7室3厅 3
5室5厅 3
5房间3卫 2
2室3厅 2
9室4厅 2
6房间4卫 2
2房间0卫 2
6房间2卫 2
3房间3卫 2
4房间3卫 2
7室2厅 2
8室2厅 1
5室0厅 1
6室0厅 1
2房间3卫 1
4室4厅 1
5房间2卫 1
7室0厅 1
8房间5卫 1
3室4厅 1
8室4厅 1
6房间3卫 1
7室1厅 1
Name: housetype, dtype: int64
house_type= house. housetype. value_counts( )
house_type. head( 10 ) . plot( kind= 'bar' , title= '户型数量分布' , rot= 30 )
<matplotlib.axes._subplots.AxesSubplot at 0x12dae357a90>
数据分析3:找出关注人数最多的五套房子
house[ 'attention' ] = house[ 'followInfo' ] . apply ( data_ad, str = '人关注' )
house. head( 5 )
house. sort_values( by= 'attention' , ascending= False ) . head( )
index title community years housetype square floor taxtype totalPrice unitPrice followInfo attention 47 47 弘善家园南向开间,满两年,免增值税 弘善家园 中楼层(共28层)2009年建塔楼 1室0厅 42.64 中楼层(共28层)2009年建塔楼 距离10号线十里河站698米房本满两年随时看房 265 62149 1401人关注 / 共305次带看 / 一年前发布 1401.0 2313 2313 四惠东 康家园 南向一居室 地铁1号线出行房主自荐 康家园 顶层(共6层)1995年建板楼 1室1厅 41.97 顶层(共6层)1995年建板楼 距离1号线四惠东站974米房本满五年随时看房 262 62426 1005人关注 / 共86次带看 / 6个月以前发布 1005.0 990 990 远见名苑 东南两居 满五年家庭唯一住房 诚心出售房主自荐 远见名苑 中楼层(共24层)2004年建塔楼 2室1厅 90.14 中楼层(共24层)2004年建塔楼 距离7号线达官营站516米房本满五年 811 89972 979人关注 / 共50次带看 / 8个月以前发布 979.0 2331 2331 荣丰二期朝南复式无遮挡全天采光房主自荐 荣丰2008 中楼层(共10层)2005年建塔楼 1室1厅 32.54 中楼层(共10层)2005年建塔楼 距离7号线达官营站1028米房本满五年随时看房 400 122926 972人关注 / 共369次带看 / 6个月以前发布 972.0 915 915 通州万达北苑地铁站 天时名苑 大两居可改3居 天时名苑 顶层(共9层)2009年建板塔结合 2室2厅 121.30 顶层(共9层)2009年建板塔结合 距离八通线通州北苑站602米房本满五年 645 53174 894人关注 / 共228次带看 / 8个月以前发布 894.0
数据分析4:户型和关注人数分布
type_interest_group= house. groupby( house[ 'housetype' ] ) . agg( { 'housetype' : 'count' , 'attention' : 'sum' } )
interest_sort= type_interest_group[ type_interest_group[ 'housetype' ] > 50 ]
interest_sort. plot( kind= 'barh' , title= '二手房户型和关注人数分布' , y= 'attention' )
interest_sort
housetype attention housetype 1室0厅 620 32920.0 1室1厅 2472 141893.0 1室2厅 64 2614.0 1房间1卫 87 2267.0 2室1厅 6582 394987.0 2室2厅 1018 49526.0 2房间1卫 100 3006.0 3室1厅 2534 162205.0 3室2厅 1424 81140.0 4室1厅 181 10667.0 4室2厅 496 30661.0 4室3厅 55 2846.0 5室2厅 92 4703.0
数据分析5:面积分布
area_level= [ 0 , 50 , 100 , 150 , 200 , 250 , 300 , 350 , 400 , 450 , 500 ]
label_level= [ '小于50' , '50-100' , '100-150' , '150-200' , '200-250' , '250-300' , '300-350' , '350-400' , '400-450' , '450-500' ]
area_cut= pd. cut( house[ 'square' ] , bins= area_level, labels= label_level)
area_cut. value_counts( ) [ : : - 1 ] . plot( kind= 'barh' , title= '二手房面积分布' , fontsize= 'small' )
<matplotlib.axes._subplots.AxesSubplot at 0x12dae064a90>
数据分析6:各个行政区房源单价均价
house_unitPrice= house_detail. groupby( 'district' ) [ 'unitPrice' ] . mean( )
house_unitPrice. plot( kind= 'barh' , title= '各个行政区房源均价' )
<matplotlib.axes._subplots.AxesSubplot at 0x12dae19c7f0>
各个行政区房源价钱箱线图绘制
import seaborn as sns
price= house_detail[ [ 'district' , 'unitPrice' ] ]
price. boxplot( by= 'district' , grid= 0 )
<matplotlib.axes._subplots.AxesSubplot at 0x12dae1ede80>
各个行政区房源在售数量
house_onsale= house_detail. groupby( 'district' ) [ 'onsale' ] . count( )
house_onsale. plot( kind= 'bar' , rot= 30 , title= '各个行政区房源在售数量' )
<matplotlib.axes._subplots.AxesSubplot at 0x12dae3c9fd0>
数据分析7:各个行政区的房源总价对比
price= house_detail[ [ 'district' , 'totalPrice' ] ]
sns. boxplot( x= 'district' , y= 'totalPrice' , data= price)
plt. ylim( ( 0 , 6000 ) )
(0, 6000)
通过箱型图看到,各大区域房屋总价中位数都都在1000万以下,且房屋总价离散值较高
数据分析8:按照地铁信息对各个区域每平米均价排序,柱形图绘制
bizcircle_unitPrice= house_detail. groupby( 'bizcircle' ) [ 'unitPrice' ] . mean( ) . sort_values( ascending= False )
bizcircle_unitPrice. head( 15 ) . plot( kind= 'bar' , title= '各个区域均价分布' , rot= 30 )
plt. legend( [ '均价' ] )
<matplotlib.legend.Legend at 0x12daceabc50>
数据分析9:按小区均价排序
community_unitPrice= house_detail. groupby( 'community' ) [ 'unitPrice' ] . mean( ) . sort_values( ascending= False )
community_unitPrice. head( 10 ) . plot( kind= 'bar' , title= '各个小区均价分布' , rot= 30 )
plt. legend( [ '均价' ] )
<matplotlib.legend.Legend at 0x12dadf03898>
数据分析10: 楼层的分布情况
def data_ads ( select_data, str ) :
if str in select_data:
return ( select_data[ 0 : select_data. find( str ) ] )
else :
return '没有提取到楼层信息'
house[ 'layer' ] = house[ 'years' ] . apply ( data_ads, str = '(' )
house. head( 3 )
index title community years housetype square floor taxtype totalPrice unitPrice followInfo attention layer 0 0 宝星华庭一层带花园,客厅挑高,通透四居室。房主自荐 宝星国际三期 底层(共22层)2010年建板塔结合 4室1厅 298.79 底层(共22层)2010年建板塔结合 距离15号线望京东站680米房本满五年 2598 86951 53人关注 / 共44次带看 / 一年前发布 53.0 底层 1 1 三面采光全明南北朝向 正对小区绿地花园 顶秀青溪 中楼层(共11层)2008年建板塔结合 3室2厅 154.62 中楼层(共11层)2008年建板塔结合 距离5号线立水桥站1170米房本满两年随时看房 1000 64675 323人关注 / 共579次带看 / 一年前发布 323.0 中楼层 2 2 沁园公寓 三居室 距离苏州街地铁站383米 沁园公寓 低楼层(共24层)1999年建塔楼 3室2厅 177.36 低楼层(共24层)1999年建塔楼 距离10号线苏州街站383米房本满五年 1200 67659 185人关注 / 共108次带看 / 一年前发布 185.0 低楼层
house[ 'layer' ] . value_counts( ) . plot( kind= 'bar' , rot= 30 )
<matplotlib.axes._subplots.AxesSubplot at 0x12dae010da0>
数据分析11:绘制2000到2016平均房价(年份与总售价的可视化)
def data_adst ( select_data, str ) :
if str in select_data:
return ( select_data[ select_data. find( str ) - 5 : select_data. find( str ) ] )
else :
return None
house[ 'year' ] = house[ 'years' ] . apply ( data_adst, str = '建' )
house. head( 4 )
index title community years housetype square floor taxtype totalPrice unitPrice followInfo attention layer year 0 0 宝星华庭一层带花园,客厅挑高,通透四居室。房主自荐 宝星国际三期 底层(共22层)2010年建板塔结合 4室1厅 298.79 底层(共22层)2010年建板塔结合 距离15号线望京东站680米房本满五年 2598 86951 53人关注 / 共44次带看 / 一年前发布 53.0 底层 2010年 1 1 三面采光全明南北朝向 正对小区绿地花园 顶秀青溪 中楼层(共11层)2008年建板塔结合 3室2厅 154.62 中楼层(共11层)2008年建板塔结合 距离5号线立水桥站1170米房本满两年随时看房 1000 64675 323人关注 / 共579次带看 / 一年前发布 323.0 中楼层 2008年 2 2 沁园公寓 三居室 距离苏州街地铁站383米 沁园公寓 低楼层(共24层)1999年建塔楼 3室2厅 177.36 低楼层(共24层)1999年建塔楼 距离10号线苏州街站383米房本满五年 1200 67659 185人关注 / 共108次带看 / 一年前发布 185.0 低楼层 1999年 3 3 金星园东南向户型,四居室设计,中间楼层 金星园 中楼层(共28层)2007年建塔楼 4室2厅 245.52 中楼层(共28层)2007年建塔楼 距离机场线三元桥站1153米房本满五年 1650 67205 157人关注 / 共35次带看 / 一年前发布 157.0 中楼层 2007年
data= house. groupby( 'year' ) . agg( { 'totalPrice' : 'mean' } )
data
data[ '2000年' : '2016年' ] . plot( kind= 'bar' , rot= 30 )
<matplotlib.axes._subplots.AxesSubplot at 0x12dae301f98>
综合:紧邻望京地铁站,三室一厅,400万-500万,大于80平米的房子
第一步:找出望京附近的房屋信息
myhouse= house_detail[ house_detail. bizcircle. str . contains( '望京' ) ]
len ( myhouse)
896
第二步:查看分布情况
house_type= myhouse[ 'housetype' ] . value_counts( )
house_type. head( 10 ) . plot( kind= 'bar' , title= '户型数量分布' , rot= 30 )
house_type. head( 10 )
2室1厅 230
3室2厅 155
2室2厅 134
1室1厅 117
3室1厅 108
4室2厅 55
1室0厅 25
4室1厅 25
2房间1卫 13
5室2厅 8
Name: housetype, dtype: int64
第三步:找到三室一厅的房源信息以及400万-500万,大于80平米的房源信息
myhouse= myhouse[ myhouse. housetype. str . contains( '3室1厅' ) ]
len ( myhouse)
108
myhouse= myhouse. loc[ ( myhouse[ 'totalPrice' ] > 400 ) & ( myhouse[ 'totalPrice' ] < 500 ) ]
myhouse. head( )
len ( myhouse)
7
myhouse[ 'square' ] = myhouse[ 'square' ] . apply ( data_ad, str = '平米' )
myhouse= myhouse. loc[ myhouse. square> 80 ]
len ( myhouse)
myhouse. head( )
index_x title community years housetype square floor taxtype totalPrice unitPrice followInfo index_y id district bizcircle tagList onsale 7824 2806 花家地西里一区东西向三居室 中间楼层 带电梯房主自荐 花家地西里一区 中楼层(共12层)1997年建板塔结合 3室1厅 82.10 中楼层(共12层)1997年建板塔结合 房本满五年随时看房 480 58466 245人关注 / 共75次带看 / 5个月以前发布 820 1111027375067 朝阳 望京 近地铁14号线(东段)阜通站 8 14022 8669 经典三居室格局合理社区安静配套成熟房主自荐 中环南路5号院 顶层(共6层)1996年建板楼 3室1厅 88.51 顶层(共6层)1996年建板楼 距离14号线(东段)望京南站701米房本满两年 495 55926 35人关注 / 共0次带看 / 2个月以前发布 4718 1111027382477 朝阳 望京 近地铁14号线(东段)望京南站 1
项目五:电信流失用户数据分析与可视化
手机客户流失预测
import numpy as np
import pandas as pd
import matplotlib. pyplot as plt
% matplotlib inline
plt. style. use( 'ggplot' )
import seaborn as sns
sns. set_style( 'darkgrid' )
sns. set_palette( 'muted' )
df = pd. read_excel( 'CustomerSurvival.xlsx' , encoding= 'utf-8' )
df. head( )
ID 套餐金额 额外通话时长 额外流量 改变行为 服务合约 关联购买 集团用户 使用月数 流失用户 0 1 1 792.833333 -10.450067 0 0 0 0 25 0 1 2 1 121.666667 -21.141117 0 0 0 0 25 0 2 3 1 -30.000000 -25.655273 0 0 0 0 2 1 3 4 1 241.500000 -288.341254 0 1 0 1 25 0 4 5 1 1629.666667 -23.655505 0 0 0 1 25 0
df. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4975 entries, 0 to 4974
Data columns (total 10 columns):
ID 4975 non-null int64
套餐金额 4975 non-null int64
额外通话时长 4975 non-null float64
额外流量 4975 non-null float64
改变行为 4975 non-null int64
服务合约 4975 non-null int64
关联购买 4975 non-null int64
集团用户 4975 non-null int64
使用月数 4975 non-null int64
流失用户 4975 non-null int64
dtypes: float64(2), int64(8)
memory usage: 388.8 KB
df. columns = [ 'id' , 'pack_type' , 'extra_time' , 'extra_flow' , 'pack_change' ,
'contract' , 'asso_pur' , 'group_user' , 'use_month' , 'loss' ]
df. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4975 entries, 0 to 4974
Data columns (total 10 columns):
id 4975 non-null int64
pack_type 4975 non-null int64
extra_time 4975 non-null float64
extra_flow 4975 non-null float64
pack_change 4975 non-null int64
contract 4975 non-null int64
asso_pur 4975 non-null int64
group_user 4975 non-null int64
use_month 4975 non-null int64
loss 4975 non-null int64
dtypes: float64(2), int64(8)
memory usage: 388.8 KB
id – 用户的唯一标识 pack_type – 用户的月套餐的金额,1为96元以下,2为96到225元,3为225元以上 extra_time – 用户在使用期间的每月额外通话时长,这部分需要用户额外交费。数值是每月的额外通话时长的平均值,单位:分钟 extra_flow – 用户在使用期间的每月额外流量,这部分需要用户额外交费。数值是每月的额外流量的平均值,单位:兆 pack_change – 是否曾经改变过套餐金额,1=是,0=否 contract – 用户是否与联通签订过服务合约,1=是,0=否 asso_pur – 用户在使用联通移动服务过程中是否还同时办理其他业务,1=同时办理一项其他业务,2=同时办理两项其他业务,0=没有办理其他业务 group_use – 用户办理的是否是集团业务,相比个人业务,集体办理的号码在集团内拨打有一定优惠。1=是,0=否 use_month – 截止到观测期结束(2012.1-2014.1),用户使用联通服务的时间长短,单位:月 loss – 在25个月的观测期内,用户是否已经流失。1=是,0=否
【数据的探索性分析】
df. describe( )
id pack_type extra_time extra_flow pack_change contract asso_pur group_user use_month loss count 4975.000000 4975.000000 4975.000000 4975.000000 4975.000000 4975.000000 4975.000000 4975.000000 4975.000000 4975.000000 mean 2488.000000 1.057688 258.520030 -71.580403 0.021307 0.245226 0.047437 0.227337 14.774271 0.782714 std 1436.303125 0.258527 723.057190 275.557448 0.144419 0.430264 0.278143 0.419154 6.534273 0.412441 min 1.000000 1.000000 -2828.333333 -2189.875986 0.000000 0.000000 0.000000 0.000000 1.000000 0.000000 25% 1244.500000 1.000000 -126.666667 -74.289824 0.000000 0.000000 0.000000 0.000000 13.000000 1.000000 50% 2488.000000 1.000000 13.500000 -59.652734 0.000000 0.000000 0.000000 0.000000 13.000000 1.000000 75% 3731.500000 1.000000 338.658333 -25.795045 0.000000 0.000000 0.000000 0.000000 19.000000 1.000000 max 4975.000000 3.000000 4314.000000 2568.704293 1.000000 1.000000 2.000000 1.000000 25.000000 1.000000
可以看到extra_time和extra_flow有正负值,正数表示用户有额外的通话时长和流量,负数为用户在月底时剩余的套餐时长和流量。从四分位数中可看出超过一半的用户有额外通话时间,流量的话只有小部分用户超额使用了。另外其他的分类型变量在描述统计上并未发现有异常的地方。 在这里特别注意下use_month这个变量,数据的观测区间为2012.1-2014.1,一共25个月,且案例中关于流失的定义为: 超过一个月没有使用行为(包括通话,使用流量)的用户判定为流失。 在数据集中use_month小于25个月的基本都是流失状态,所以这个变量对于流失的预测并没有什么关键作用,后续导入模型时需剔除这个变量。
2. 变量的分布
plt. figure( figsize = ( 10 , 5 ) )
plt. subplot( 121 )
df. extra_time. hist( bins = 30 )
plt. subplot( 122 )
df. extra_flow. hist( bins = 30 )
<matplotlib.axes._subplots.AxesSubplot at 0x1594afe2cf8>
extra_time呈现的是右偏分布,extra_flow近似服从正态分布,与描述统计中的情况大致吻合
fig, axes = plt. subplots( nrows = 2 , ncols = 3 , figsize = ( 10 , 6 ) )
sns. countplot( x = 'pack_type' , data = df, ax= axes[ 0 , 0 ] )
sns. countplot( x = 'pack_change' , data = df, ax= axes[ 0 , 1 ] )
sns. countplot( x = 'contract' , data = df, ax= axes[ 0 , 2 ] )
sns. countplot( x = 'asso_pur' , data = df, ax= axes[ 1 , 0 ] )
sns. countplot( x = 'group_user' , data = df, ax= axes[ 1 , 1 ] )
sns. countplot( x = 'loss' , data = df, ax= axes[ 1 , 2 ] )
<matplotlib.axes._subplots.AxesSubplot at 0x1594b1aa240>
可以看到pack_type, pack_change, asso_pur的类型分布非常不均衡,例如asso_pur,办理过套餐外业务的用户数量极少,导致样本缺乏足够的代表性,可能会对模型的最终结果产生一定的影响。
3. 自变量与因变量之间的关系:
plt. figure( figsize = ( 10 , 6 ) )
df. plot. scatter( x= 'extra_time' , y= 'loss' )
df. plot. scatter( x= 'extra_flow' , y= 'loss' )
<matplotlib.axes._subplots.AxesSubplot at 0x1594b23de80>
<Figure size 720x432 with 0 Axes>
从散点图上似乎感觉两个自变量与是否流失并无关系,为了更好的展示其相关性,我们对extra_time和extra_flow进行分箱处理,再绘制条形图:
bin1 = [ - 3000 , - 2000 , - 500 , 0 , 500 , 2000 , 3000 , 5000 ]
df[ 'time_label' ] = pd. cut( df. extra_time, bins = bin1)
time_amount = df. groupby( 'time_label' ) . id . count( ) . sort_values( ) . reset_index( )
time_amount
time_amount[ 'amount_cumsum' ] = time_amount. id . cumsum( )
time_amount
time_label id amount_cumsum 0 (-3000, -2000] 3 3 1 (-2000, -500] 15 18 2 (3000, 5000] 79 97 3 (2000, 3000] 129 226 4 (500, 2000] 755 981 5 (0, 500] 1634 2615 6 (-500, 0] 2360 4975
sns. countplot( x = 'time_label' , hue = 'loss' , data = df)
<matplotlib.axes._subplots.AxesSubplot at 0x1594b3a1d68>
bin2 = [ - 3000 , - 2000 , - 500 , 0 , 500 , 2000 , 3000 ]
df[ 'flow_label' ] = pd. cut( df. extra_flow, bins = bin2)
flow_amount = df. groupby( 'flow_label' ) . id . count( ) . sort_values( ) . reset_index( )
flow_amount[ 'amount_cumsum' ] = flow_amount. id . cumsum( )
flow_amount
flow_label id amount_cumsum 0 (2000, 3000] 1 1 1 (-3000, -2000] 3 4 2 (500, 2000] 79 83 3 (-2000, -500] 157 240 4 (0, 500] 827 1067 5 (-500, 0] 3908 4975
—对extra_flow进行累加统计,发现【-500,500】占了95%,且(-500,0】的用户占80%,可以说只有小部分用户每月会超额使用流量。
sns. countplot( x = 'flow_label' , hue = 'loss' , data = df)
<matplotlib.axes._subplots.AxesSubplot at 0x1594b2abd30>
可以明显的看出用户使用的通话时间和流量越多,流失概率越低,这些超额使用的用户在用户分类中属于’高价值用户’,用户粘性很高,运营商应该把重点放在这些用户身上,采取有效的手段预防其流失。
fig, axes = plt. subplots( nrows = 2 , ncols = 3 , figsize = ( 12 , 8 ) )
sns. countplot( x = 'pack_type' , hue = 'loss' , data = df, ax = axes[ 0 ] [ 0 ] )
sns. countplot( x = 'pack_change' , hue = 'loss' , data = df, ax = axes[ 0 ] [ 1 ] )
sns. countplot( x = 'contract' , hue = 'loss' , data = df, ax = axes[ 0 ] [ 2 ] )
sns. countplot( x = 'asso_pur' , hue = 'loss' , data = df, ax = axes[ 1 ] [ 0 ] )
sns. countplot( x = 'group_user' , hue = 'loss' , data = df, ax = axes[ 1 ] [ 1 ] )
<matplotlib.axes._subplots.AxesSubplot at 0x1594ad745c0>
初步得出以下结论: 1).套餐金额越大,用户越不易流失,套餐金额大的用户忠诚度也高 2).改过套餐的用户流失的概率变小 3).签订过合约的流失比例较小,签订合约也意味着一段时间内(比如2年,3年)用户一般都不会更换运营商号码,可以说签订合约的用户比较稳定 4).办理过其它套餐业务的用户因样本量太少,后续再研究 5).集团用户的流失率相比个人用户低很多
internal_chars = [ 'extra_time' , 'extra_flow' , 'pack_type' ,
'pack_change' , 'contract' , 'asso_pur' , 'group_user' , 'loss' ]
corrmat = df[ internal_chars] . corr( )
f, ax = plt. subplots( figsize= ( 10 , 7 ) )
plt. xticks( rotation= '0' )
sns. heatmap( corrmat, square= False , linewidths= .5 , annot= True )
<matplotlib.axes._subplots.AxesSubplot at 0x1594b332470>
各自变量之间的相关性程度很低,排除了共线性问题。在对因变量的相关性上contract和group_user的系数相比其它变量较高,但也不是很强。
数据建模
因为自变量大多数为分类型,所以用决策树的效果比较好,而且决策树对异常值的敏感度很低,生成的结果也有很好的解释性。
—因变量是 ‘loss’,是否流失,也是我们预测的目标值 —自变量分为三类: #连续型变量:extra_time,extra_flow, use_month #二元分类变量:pack_change,contract, group_use #多元分类变量:pack_type,asso_pur
根据前面的探索性分析,并基于业务理解,我们决定筛选这几个特征进入模型: extra_time,extra_flow,pack_type, pack_change, asso_pur contract以及group_use,这些特征都对是否流失有一定的影响。 对于extra_time,extra_flow这两个连续型变量我们作数据转换,变成二分类变量,这样所有特征都是统一的度量。
df[ 'time_tranf' ] = df. apply ( lambda x: 1 if x. extra_time> 0 else 0 , axis = 1 )
df[ 'flow_tranf' ] = df. apply ( lambda x: 1 if x. extra_flow> 0 else 0 , axis = 1 )
df. head( )
id pack_type extra_time extra_flow pack_change contract asso_pur group_user use_month loss time_label flow_label time_tranf flow_tranf 0 1 1 792.833333 -10.450067 0 0 0 0 25 0 (500, 2000] (-500, 0] 1 0 1 2 1 121.666667 -21.141117 0 0 0 0 25 0 (0, 500] (-500, 0] 1 0 2 3 1 -30.000000 -25.655273 0 0 0 0 2 1 (-500, 0] (-500, 0] 0 0 3 4 1 241.500000 -288.341254 0 1 0 1 25 0 (0, 500] (-500, 0] 1 0 4 5 1 1629.666667 -23.655505 0 0 0 1 25 0 (500, 2000] (-500, 0] 1 0
x = df. loc[ : , [ 'pack_type' , 'time_tranf' , 'flow_tranf' , 'pack_change' , 'contract' , 'asso_pur' , 'group_user' ] ]
x = np. array( x)
x
array([[1, 1, 0, ..., 0, 0, 0],
[1, 1, 0, ..., 0, 0, 0],
[1, 0, 0, ..., 0, 0, 0],
...,
[1, 1, 1, ..., 1, 0, 0],
[1, 1, 1, ..., 1, 0, 0],
[3, 0, 0, ..., 1, 0, 1]], dtype=int64)
y = df. loss
y = y[ : , np. newaxis]
y
array([[0],
[0],
[1],
...,
[0],
[1],
[0]], dtype=int64)
from sklearn. model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split( x, y, test_size = 0.3 , random_state= 123 )
from sklearn import tree
clf = tree. DecisionTreeClassifier( criterion= 'gini' ,
splitter= 'best' ,
max_depth= 4 ,
min_samples_split= 10 ,
min_samples_leaf= 5
)
clf = clf. fit( x_train, y_train)
这里我们采用决策树中ID3算法,基于entropy系数进行分类,设置树的最大深度为4,区分一个内部节点需要的最少的样本数为10,一个叶节点所需要的最小样本数为5。
train_score = clf. score( x_train, y_train)
test_score = clf. score( x_test, y_test)
'train_score:{0},test_score:{1}' . format ( train_score, test_score)
'train_score:0.871338311315336,test_score:0.8640321500334897'
参数调优
def cv_score ( d) :
clf2 = tree. DecisionTreeClassifier( max_depth= d)
clf2 = clf2. fit( x_train, y_train)
tr_score = clf2. score( x_train, y_train)
cv_score = clf2. score( x_test, y_test)
return ( tr_score, cv_score)
depths = range ( 2 , 15 )
scores = [ cv_score( d) for d in depths]
tr_scores = [ s[ 0 ] for s in scores]
cv_scores = [ s[ 1 ] for s in scores]
scores
best_score_index = np. argmax( cv_scores)
best_score = cv_scores[ best_score_index]
best_param = depths[ best_score_index]
best_param
4
plt. figure( figsize = ( 4 , 2 ) , dpi= 150 )
plt. grid( )
plt. xlabel( 'max_depth' )
plt. ylabel( 'best_score' )
plt. plot( depths, cv_scores, '.g-' , label = 'cross_validation scores' )
plt. plot( depths, tr_scores, '.r--' , label = 'train scores' )
plt. legend( )
<matplotlib.legend.Legend at 0x1594bb49518>
在生成的图中可以看出当深度为4时,交叉验证数据集的评分与训练集的评分比较接近,且两者的评分比较高,当深度超过5以后,俩者的差距变大,交叉验证数据集的评分变低,出现了过拟合情况。
模型结果评价
from sklearn. metrics import classification_report
y_pre = clf. predict( x_test)
print ( classification_report( y_pre, y_test) )
precision recall f1-score support
0 0.65 0.71 0.68 304
1 0.92 0.90 0.91 1189
accuracy 0.86 1493
macro avg 0.79 0.81 0.80 1493
weighted avg 0.87 0.86 0.87 1493
精确率 = TP/(TP+FP) :在预测为流失的用户中,预测正确的(实际也是流失)用户占比 召回率 = TP/(TP+FN) : 在实际为流失的用户中,预测正确的(预测为流失的)用户占比 F1值为精确率和召回率的调和均值,相当于这两个的综合评价指标。 通过输出的分析报告可以得出建立的预测模型的精确率为0.88,说明在预测为流失的用户中,实际流失的用户占88%,召回率为0.86,说明实际为流失的用户中,预测为流失的占86%,F1值为0.87,说明模型的综合评价还不错。
from IPython. display import Image
from sklearn import tree
import pydotplus
from sklearn. tree import export_graphviz
def TreeShow ( dtClass, irisDataSet) :
dot_data = export_graphviz( dtClass, out_file= None )
graph = pydotplus. graph_from_dot_data( dot_data)
graph. write_pdf( "tree.pdf" )
dot_data = export_graphviz( dtClass, out_file= None ,
feature_names= [ 'pack_type' , 'time_tranf' , 'flow_tranf'
, 'pack_change' , 'contract' , 'asso_pur' , 'group_user' ] ,
class_names= [ 'loss' , 'not loss' ] ,
filled= True , rounded= True ,
special_characters= True )
graph = pydotplus. graph_from_dot_data( dot_data)
Image( graph. create_png( ) )
TreeShow( clf, df)