Python数据分析——探索性数据分析

1.对数据进行排序

导入库和数据,利用泰坦尼克号的数据作为例子

import numpy as np
import pandas as pd
df=pd.read_csv('train_chinese.csv')
df.head(3)
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口Unnamed: 12
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNSNaN
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85CNaN
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNSNaN
#创建一个简单的DataFrame数据
frame=pd.DataFrame(np.arange(8).reshape((2,4)),
                  index=['2','1'],
                  columns=['d','a','b','c'])
frame
dabc
20123
14567

(1)对某一列排序

对’c’列降序处理

frame.sort_values(by='c',ascending=False)
dabc
14567
20123

(2)行索引,列索引排序

frame.sort_index()#行索引升序处理
dabc
14567
20123
frame.sort_index(axis=1)#列索引升序
abcd
21230
15674
frame.sort_index(axis=1,ascending=False)#列索引降序
dcba
20321
14765

(3)两列数据同时排序

frame.sort_values(by=['a','c'])
dabc
20123
14567

对’train_chinese.csv’中的票价和年龄排序

df.sort_values(by=['票价','年龄'],ascending=False).head(20)

乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口Unnamed: 12
67968011Cardeza, Mr. Thomas Drake Martinezmale36.001PC 17755512.3292B51 B53 B55CNaN
25825911Ward, Miss. Annafemale35.000PC 17755512.3292NaNCNaN
73773811Lesurer, Mr. Gustave Jmale35.000PC 17755512.3292B101CNaN
43843901Fortune, Mr. Markmale64.01419950263.0000C23 C25 C27SNaN
34134211Fortune, Miss. Alice Elizabethfemale24.03219950263.0000C23 C25 C27SNaN
888911Fortune, Miss. Mabel Helenfemale23.03219950263.0000C23 C25 C27SNaN
272801Fortune, Mr. Charles Alexandermale19.03219950263.0000C23 C25 C27SNaN
74274311Ryerson, Miss. Susan Parker "Suzette"female21.022PC 17608262.3750B57 B59 B63 B66CNaN
31131211Ryerson, Miss. Emily Boriefemale18.022PC 17608262.3750B57 B59 B63 B66CNaN
29930011Baxter, Mrs. James (Helene DeLaudeniere Chaput)female50.001PC 17558247.5208B58 B60CNaN
11811901Baxter, Mr. Quigg Edmondmale24.001PC 17558247.5208B58 B60CNaN
38038111Bidois, Miss. Rosaliefemale42.000PC 17757227.5250NaNCNaN
71671711Endres, Miss. Caroline Louisefemale38.000PC 17757227.5250C45CNaN
70070111Astor, Mrs. John Jacob (Madeleine Talmadge Force)female18.010PC 17757227.5250C62 C64CNaN
55755801Robbins, Mr. VictormaleNaN00PC 17757227.5250NaNCNaN
52752801Farthing, Mr. JohnmaleNaN00PC 17483221.7792C95SNaN
37737801Widener, Mr. Harry Elkinsmale27.002113503211.5000C82CNaN
77978011Robert, Mrs. Edward Scott (Elisabeth Walton Mc...female43.00124160211.3375B3SNaN
73073111Allen, Miss. Elisabeth Waltonfemale29.00024160211.3375B5SNaN
68969011Madill, Miss. Georgette Alexandrafemale15.00124160211.3375B5SNaN
df.sort_values(by=['票价','年龄'],ascending=False).tail(20)
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口Unnamed: 12
81881903Holm, Mr. John Fredrik Alexandermale43.000C 70756.4500NaNSNaN
84384403Lemberopolous, Mr. Peter Lmale34.50026836.4375NaNCNaN
32632703Nysveen, Mr. Johan Hansenmale61.0003453646.2375NaNSNaN
87287301Carlsson, Mr. Frans Olofmale33.0006955.0000B51 B53 B55SNaN
37837903Betros, Mr. Tannousmale20.00026484.0125NaNCNaN
59759803Johnson, Mr. Alfredmale49.000LINE0.0000NaNSNaN
26326401Harrison, Mr. Williammale40.0001120590.0000B94SNaN
80680701Andrews, Mr. Thomas Jrmale39.0001120500.0000A36SNaN
82282301Reuchlin, Jonkheer. John Georgemale38.000199720.0000NaNSNaN
17918003Leonard, Mr. Lionelmale36.000LINE0.0000NaNSNaN
27127213Tornquist, Mr. William Henrymale25.000LINE0.0000NaNSNaN
30230303Johnson, Mr. William Cahoone Jrmale19.000LINE0.0000NaNSNaN
27727802Parkes, Mr. Francis "Frank"maleNaN002398530.0000NaNSNaN
41341402Cunningham, Mr. Alfred FlemingmaleNaN002398530.0000NaNSNaN
46646702Campbell, Mr. WilliammaleNaN002398530.0000NaNSNaN
48148202Frost, Mr. Anthony Wood "Archie"maleNaN002398540.0000NaNSNaN
63363401Parr, Mr. William Henry MarshmaleNaN001120520.0000NaNSNaN
67467502Watson, Mr. Ennis HastingsmaleNaN002398560.0000NaNSNaN
73273302Knight, Mr. Robert JmaleNaN002398550.0000NaNSNaN
81581601Fry, Mr. RichardmaleNaN001120580.0000B102SNaN

可以从上面关于票价和年龄的排序看出票价最高的数据中,20个人有14个生还,而票价最低的数据中,只有一个人生还,可以从一定程度说明票价和生还几率有一定的关系

2.两个DataFrame数据相加

#举个例子,建立两个DataFrame
a=pd.DataFrame(np.arange(4.).reshape(2,2),
              columns=['a','b'],
              index=['1','2'])
b=pd.DataFrame(np.arange(9.).reshape(3,3),
              columns=['a','b','c'],
              index=['1','2','3'])
a
ab
10.01.0
22.03.0
b
abc
10.01.02.0
23.04.05.0
36.07.08.0
a+b#对应的行和列会相加,没有对应的值会返回空值
abc
10.02.0NaN
25.07.0NaN
3NaNNaNNaN

计算船上最大的家族的人口数

max(df['兄弟姐妹个数']+df['父母子女个数'])
10

得出船上人数最大的家族人数为10

3.查看数据基本统计信息

利用descible函数查看票价,年龄的基本统计信息

df['票价'].describe()
count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: 票价, dtype: float64
df['年龄'].describe()
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: 年龄, dtype: float64
可以得出年龄和票价的数据量,最大值,最小值,分位数,平均值,标准差
  • 1
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值