DataWhale之task01-探索性数据分析(EDA)

该博客展示了如何使用Python对泰坦尼克号乘客数据进行预处理和分析,包括数据排序、年龄分布、家族人数统计以及票价和生存率的关系。通过探索,发现票价高和家庭成员多的乘客生存概率较高,而年轻男性乘客的生存率较低。
摘要由CSDN通过智能技术生成
import numpy as np
import pandas as pd
df = pd.read_csv("train_chinese.csv")
df.head()
乘客ID是否幸存乘客等级(1/2/3等舱位)乘客姓名性别年龄堂兄弟/妹个数父母与小孩个数船票信息票价客舱登船港口
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
df_01 = pd.DataFrame(data=np.arange(8).reshape((2,4)), index=[2,1], columns=['d','a','b','c'])
df_01
dabc
20123
14567

索引排序

df_01.sort_index()
dabc
14567
20123
df_01.sort_index(axis=1)
abcd
21230
15674
df_01.sort_index(axis=1,ascending=False)
dcba
20321
14765

列排序

df_01.sort_values(by=['b','c'], ascending=False)
dabc
14567
20123
#按票价和年龄进行降序
df.sort_values(by=['票价','年龄'], ascending=False).head(20)
乘客ID是否幸存乘客等级(1/2/3等舱位)乘客姓名性别年龄堂兄弟/妹个数父母与小孩个数船票信息票价客舱登船港口
67968011Cardeza, Mr. Thomas Drake Martinezmale36.001PC 17755512.3292B51 B53 B55C
25825911Ward, Miss. Annafemale35.000PC 17755512.3292NaNC
73773811Lesurer, Mr. Gustave Jmale35.000PC 17755512.3292B101C
43843901Fortune, Mr. Markmale64.01419950263.0000C23 C25 C27S
34134211Fortune, Miss. Alice Elizabethfemale24.03219950263.0000C23 C25 C27S
888911Fortune, Miss. Mabel Helenfemale23.03219950263.0000C23 C25 C27S
272801Fortune, Mr. Charles Alexandermale19.03219950263.0000C23 C25 C27S
74274311Ryerson, Miss. Susan Parker "Suzette"female21.022PC 17608262.3750B57 B59 B63 B66C
31131211Ryerson, Miss. Emily Boriefemale18.022PC 17608262.3750B57 B59 B63 B66C
29930011Baxter, Mrs. James (Helene DeLaudeniere Chaput)female50.001PC 17558247.5208B58 B60C
11811901Baxter, Mr. Quigg Edmondmale24.001PC 17558247.5208B58 B60C
38038111Bidois, Miss. Rosaliefemale42.000PC 17757227.5250NaNC
71671711Endres, Miss. Caroline Louisefemale38.000PC 17757227.5250C45C
70070111Astor, Mrs. John Jacob (Madeleine Talmadge Force)female18.010PC 17757227.5250C62 C64C
55755801Robbins, Mr. VictormaleNaN00PC 17757227.5250NaNC
52752801Farthing, Mr. JohnmaleNaN00PC 17483221.7792C95S
37737801Widener, Mr. Harry Elkinsmale27.002113503211.5000C82C
77978011Robert, Mrs. Edward Scott (Elisabeth Walton Mc...female43.00124160211.3375B3S
73073111Allen, Miss. Elisabeth Waltonfemale29.00024160211.3375B5S
68969011Madill, Miss. Georgette Alexandrafemale15.00124160211.3375B5S
#按年龄进行降序
df.sort_values(by='年龄',ascending=False).head(20)
乘客ID是否幸存乘客等级(1/2/3等舱位)乘客姓名性别年龄堂兄弟/妹个数父母与小孩个数船票信息票价客舱登船港口
63063111Barkworth, Mr. Algernon Henry Wilsonmale80.0002704230.0000A23S
85185203Svensson, Mr. Johanmale74.0003470607.7750NaNS
49349401Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC
969701Goldschmidt, Mr. George Bmale71.000PC 1775434.6542A5C
11611703Connors, Mr. Patrickmale70.5003703697.7500NaNQ
67267302Mitchell, Mr. Henry Michaelmale70.000C.A. 2458010.5000NaNS
74574601Crosby, Capt. Edward Giffordmale70.011WE/P 573571.0000B22S
333402Wheadon, Mr. Edward Hmale66.000C.A. 2457910.5000NaNS
545501Ostby, Mr. Engelhart Corneliusmale65.00111350961.9792B30C
28028103Duane, Mr. Frankmale65.0003364397.7500NaNQ
45645701Millet, Mr. Francis Davismale65.0001350926.5500E38S
43843901Fortune, Mr. Markmale64.01419950263.0000C23 C25 C27S
54554601Nicholson, Mr. Arthur Ernestmale64.00069326.0000NaNS
27527611Andrews, Miss. Kornelia Theodosiafemale63.0101350277.9583D7S
48348413Turkula, Mrs. (Hedwig)female63.00041349.5875NaNS
57057112Harris, Mr. Georgemale62.000S.W./PP 75210.5000NaNS
25225301Stead, Mr. William Thomasmale62.00011351426.5500C87S
82983011Stone, Mrs. George Nelson (Martha Evelyn)female62.00011357280.0000B28NaN
55555601Wright, Mr. Georgemale62.00011380726.5500NaNS
62562601Sutton, Mr. Frederickmale61.0003696332.3208D50S

年龄大的前20人中只有5人存活

#两个DF相加
frame_a = pd.DataFrame(np.arange(9.).reshape(3,3),
                      index = ['one','two','three'],
                      columns = ['a','b','c'])
frame_b = pd.DataFrame(np.arange(12.).reshape(4,3),
                      index = ['first','one','two','second'],
                      columns = ['a','e','c'])
frame_a+frame_b
abce
firstNaNNaNNaNNaN
one3.0NaN7.0NaN
secondNaNNaNNaNNaN
threeNaNNaNNaNNaN
two9.0NaN13.0NaN
df['家族人数'] = df['堂兄弟/妹个数'] + df['父母与小孩个数']
df.head()
乘客ID是否幸存乘客等级(1/2/3等舱位)乘客姓名性别年龄堂兄弟/妹个数父母与小孩个数船票信息票价客舱登船港口家族人数
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS1
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C1
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS0
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S1
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS0
df['家族人数'].max()
10
df['家族人数'].value_counts()
0     537
1     161
2     102
3      29
5      22
4      15
6      12
10      7
7       6
Name: 家族人数, dtype: int64
#年龄分布
df['性别'].value_counts()
male      577
female    314
Name: 性别, dtype: int64
#基本统计信息
df.describe()
乘客ID是否幸存乘客等级(1/2/3等舱位)年龄堂兄弟/妹个数父母与小孩个数票价家族人数
count891.000000891.000000891.000000714.000000891.000000891.000000891.000000891.000000
mean446.0000000.3838382.30864229.6991180.5230080.38159432.2042080.904602
std257.3538420.4865920.83607114.5264971.1027430.80605749.6934291.613459
min1.0000000.0000001.0000000.4200000.0000000.0000000.0000000.000000
25%223.5000000.0000002.00000020.1250000.0000000.0000007.9104000.000000
50%446.0000000.0000003.00000028.0000000.0000000.00000014.4542000.000000
75%668.5000001.0000003.00000038.0000001.0000000.00000031.0000001.000000
max891.0000001.0000003.00000080.0000008.0000006.000000512.32920010.000000
df['票价'].describe()
count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: 票价, dtype: float64

票价一共有891个数据,平均值为:32.20, 标准差为:49.69 说明票价波动较大
75%的人票价低于31, 票价最大约为:512

df['父母与小孩个数'].describe()
count    891.000000
mean       0.381594
std        0.806057
min        0.000000
25%        0.000000
50%        0.000000
75%        0.000000
max        6.000000
Name: 父母与小孩个数, dtype: float64
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值