学习打卡:动手数据分析Task01

本文介绍了使用Pandas进行数据加载、数据块读取、修改表头、查看数据信息及缺失值、数据筛选和排序等操作。通过实例展示了DataFrame和Series类型,以及逐块读取大文件以节省内存的方法。此外,还探讨了如何处理数据列的删除、隐藏和筛选,以及利用describe()函数进行数据探索。
摘要由CSDN通过智能技术生成

1 载入数据和数据观察

1.1导入numpy和pandas

import numpy as np
import pandas as pd

1.2 载入数据

(1) 使用相对路径载入数据

data=pd.read_csv('train.csv')   #相对路径
data.head()         #head()默认是前5行
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

(2) 使用绝对路径载入数据

data=pd.read_csv("D:/Jupyter NoteBook/组队学习/hands-on-data-analysis-master/第一单元项目集合/train.csv")
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

【提示】相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录。

【思考】知道数据加载的方法后,试试pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?了解一下’.tsv’和’.csv’的不同,如何加载这两个数据集?

1)read_table是以制表符 \t 作为数据的标志,以行为单位进行存储;read_csv是以分隔符号 ‘,’ 作为数据的标志。使参数 sep=‘,’ 就能让他们效果一样。

2 ) TSV:tab separated values;即“制表符分隔值”,CSV: comma separated values;即“逗号分隔值”

使用pd.read_table()来读取数据

data=pd.read_table('train.csv',sep=',')
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

1.3每1000行为一个数据模块,逐块读取

chunks=pd.read_csv('train.csv',chunksize=1000)
chunks
<pandas.io.parsers.TextFileReader at 0x20b2ef42490>

【思考】什么是逐块读取?为什么要逐块读取呢?

【提示】大家可以chunker(数据块)是什么类型?用for循环打印出来出处具体的样子是什么?

使用read_csv会把整个文件的数据读取到DataFrame中,当数据量大时,就会很吃内存;所以在read_csv中通过设置参数chunksize来指定一个chunksize分块大小来读取文件,它会返回一个可迭代的对象TextFileReader,然后使用for循环取出数据。

chunks=pd.read_csv('train.csv',chunksize=500)      #该数据集不到1000,这里使用500
for chunk in chunks:
    print(chunk)
     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
495          496         0       3   
496          497         1       1   
497          498         0       3   
498          499         0       1   
499          500         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                                 ...     ...   ...    ...   
495                              Yousseff, Mr. Gerious    male   NaN      0   
496                     Eustis, Miss. Elizabeth Mussey  female  54.0      1   
497                    Shellard, Mr. Frederick William    male   NaN      0   
498    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  female  25.0      1   
499                                 Svensson, Mr. Olof    male  24.0      0   

     Parch            Ticket      Fare    Cabin Embarked  
0        0         A/5 21171    7.2500      NaN        S  
1        0          PC 17599   71.2833      C85        C  
2        0  STON/O2. 3101282    7.9250      NaN        S  
3        0            113803   53.1000     C123        S  
4        0            373450    8.0500      NaN        S  
..     ...               ...       ...      ...      ...  
495      0              2627   14.4583      NaN        C  
496      0             36947   78.2667      D20        C  
497      0         C.A. 6212   15.1000      NaN        S  
498      2            113781  151.5500  C22 C26        S  
499      0            350035    7.7958      NaN        S  

[500 rows x 12 columns]
     PassengerId  Survived  Pclass                                      Name  \
500          501         0       3                          Calic, Mr. Petar   
501          502         0       3                       Canavan, Miss. Mary   
502          503         0       3            O'Sullivan, Miss. Bridget Mary   
503          504         0       3            Laitinen, Miss. Kristina Sofia   
504          505         1       1                     Maioni, Miss. Roberta   
..           ...       ...     ...                                       ...   
886          887         0       2                     Montvila, Rev. Juozas   
887          888         1       1              Graham, Miss. Margaret Edith   
888          889         0       3  Johnston, Miss. Catherine Helen "Carrie"   
889          890         1       1                     Behr, Mr. Karl Howell   
890          891         0       3                       Dooley, Mr. Patrick   

        Sex   Age  SibSp  Parch      Ticket     Fare Cabin Embarked  
500    male  17.0      0      0      315086   8.6625   NaN        S  
501  female  21.0      0      0      364846   7.7500   NaN        Q  
502  female   NaN      0      0      330909   7.6292   NaN        Q  
503  female  37.0      0      0        4135   9.5875   NaN        S  
504  female  16.0      0      0      110152  86.5000   B79        S  
..      ...   ...    ...    ...         ...      ...   ...      ...  
886    male  27.0      0      0      211536  13.0000   NaN        S  
887  female  19.0      0      0      112053  30.0000   B42        S  
888  female   NaN      1      2  W./C. 6607  23.4500   NaN        S  
889    male  26.0      0      0      111369  30.0000  C148        C  
890    male  32.0      0      0      370376   7.7500   NaN        Q  

[391 rows x 12 columns]

1.4 修改表头

PassengerId => 乘客ID
Survived => 是否幸存
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口

#查看数据信息

data=pd.read_csv('train.csv')
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
data.columns=['乘客ID','是否幸存', '乘客等级', '乘客姓名','性别','年龄',
              '堂兄弟/妹个数','父母与小孩个数','船票信息','票价','客舱','登船港口']
data_=data.set_index("乘客ID")
data_.head(2)
是否幸存乘客等级乘客姓名性别年龄堂兄弟/妹个数父母与小孩个数船票信息票价客舱登船港口
乘客ID
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
data_1= pd.read_csv('train.csv', 
                    names=['乘客ID','是否幸存','乘客等级','乘客姓名','性别',
                           '年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],
                    index_col='乘客ID',header=0)
data_1.head(3)
是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

【思考】所谓将表头改为中文其中一个思路是:将英文列名表头替换成中文。还有其他的方法吗?



data_1.rename(columns=('PassengerId': '乘客ID', 'Survived': '是否幸存', 
                       'Pclass': '乘客等级(1/2/3等舱位)', 'Name': '乘客姓名', 'Sex': '性别',
                       'Age':'年龄','SibSp':'堂兄弟/妹个数','Parch':'父母与小孩个数',
                       'Ticket':'船票信息','Fare':'票价','Cabin':'客舱','Embarked':'登船港口' }, 
                       inplace=True) 

1.5 查看数据的基本信息

data_1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   是否幸存    891 non-null    int64  
 1   乘客等级    891 non-null    int64  
 2   乘客姓名    891 non-null    object 
 3   性别      891 non-null    object 
 4   年龄      714 non-null    float64
 5   兄弟姐妹个数  891 non-null    int64  
 6   父母子女个数  891 non-null    int64  
 7   船票信息    891 non-null    object 
 8   票价      891 non-null    float64
 9   客舱      204 non-null    object 
 10  登船港口    889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 123.5+ KB

1.6查看表格前10行的数据和后15行的数据

#前10行数据

data_1.head(10)
是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
1012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
#后15行数据

data_1.tail(15)
是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
87703Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS
87803Petroff, Mr. Nedeliomale19.0003492127.8958NaNS
87903Laleff, Mr. KristomaleNaN003492177.8958NaNS
88011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
88203Markun, Mr. Johannmale33.0003492577.8958NaNS
88303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS
88402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS
88503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS
88603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
89011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

1.7 查看数据的缺失情况

#查看每类标签数据的缺失量

data_1.isnull().sum()
是否幸存        0
乘客等级        0
乘客姓名        0
性别          0
年龄        177
兄弟姐妹个数      0
父母子女个数      0
船票信息        0
票价          0
客舱        687
登船港口        2
dtype: int64
#有空的地方返回false

data_1.notnull().head()
是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
1TrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue
2TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
3TrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue
4TrueTrueTrueTrueTrueTrueTrueTrueTrueTrueTrue
5TrueTrueTrueTrueTrueTrueTrueTrueTrueFalseTrue
#判断数据是否为空,为空的地方返回True,其余地方返回False

data_1.isnull().head()
是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
1FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse

1.8 保存数据

# 注意:不同的操作系统保存下来可能会有乱码。可以加入`encoding='GBK' 或者 ’encoding = ’uft-8‘‘`

data_1.to_csv('train_chinese.csv',encoding='GBK')

2 pandas基础

开始前导入numpy和pandas

import numpy as np
import pandas as pd

2.1 DateFrame和Series类型

#Series的创建

name_ages={'张三': 35, '李四': 42, '王二麻子': 25, '李华': 15}
example_1=pd.Series(name_ages)
example_1
张三      35
李四      42
王二麻子    25
李华      15
dtype: int64
#DataFrame的创建

data = {'name': ['张三', '李四', '王二麻子', '李华'],
        'ages': [35, 42, 25, 15],
        'height': [170,165,175,180]
       }
example_2 = pd.DataFrame(data)
example_2
nameagesheight
0张三35170
1李四42165
2王二麻子25175
3李华15180
#删除height列

example_2=example_2.drop(labels='height',axis=1)
example_2
nameages
0张三35
1李四42
2王二麻子25
3李华15

2.2查看DataFrame数据的每列的名称

#加载数据

data=pd.read_csv('train_chinese.csv',encoding='GBK')
data.head(3)
乘客ID是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
#查看表头

data.columns
Index(['乘客ID', '是否幸存', '乘客等级', '乘客姓名', '性别', '年龄', '兄弟姐妹个数', '父母子女个数', '船票信息',
       '票价', '客舱', '登船港口'],
      dtype='object')

2.3 查看相关列的列名

#查看客舱,法一

data['客舱'].head()
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: 客舱, dtype: object
#查看客舱,法二

data.客舱.head()
0     NaN
1     C85
2     NaN
3    C123
4     NaN
Name: 客舱, dtype: object

2.4 对比"test_1.csv"和"train.csv",将"test_1.csv"多出的列删除

data_1=pd.read_csv('test_1.csv')
data_1.head()
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkeda
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS100
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C100
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS100
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S100
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS100
#删除多余的列a

del data_1['a']
data_1.head()
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS

【思考】还有其他的删除多余的列的方式吗?

思考回答

data_1.drop(labels='a',axis=1)

2.5 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏

data.drop(['乘客ID','乘客姓名','年龄','船票信息'],axis=1).head()
是否幸存乘客等级性别兄弟姐妹个数父母子女个数票价客舱登船港口
003male107.2500NaNS
111female1071.2833C85C
213female007.9250NaNS
311female1053.1000C123S
403male008.0500NaNS

【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?

【思考回答】

如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用

2.6 筛选的逻辑

表格数据中,最重要的一个功能就是要具有可筛选的能力,选出所需要的信息,丢弃无用的信息。

2.6.1 筛选"Age"在10岁以下的乘客信息

data[data.年龄<10].head()
乘客ID是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
242503Palsson, Miss. Torborg Danirafemale8.03134990921.0750NaNS
434412Laroche, Miss. Simonne Marie Anne Andreefemale3.012SC/Paris 212341.5792NaNC

2.6.2 筛选"Age"在10岁以上50岁以下的乘客信息

midage=data[(data.年龄>10) & (data.年龄<50)]
midage.head(3)
乘客ID是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

2.6.3 显示midage的数据中第100行的"Pclass"和"Sex"的数据

midage.reset_index(drop=True).loc[[100],['乘客等级','性别']]
乘客等级性别
1002male

【提示】在抽取数据中,我们希望数据的相对顺序保持不变,用什么函数可以达到这个效果呢?

使用reset_index函数重置索引,参数drop,False表示重新设置索引后将原索引作为新的一列并入DataFrame,True表示删除原索引

2.6.4 用loc方法抽取midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据

midage.reset_index(drop=True).loc[[100,105,108],['乘客等级','乘客姓名','性别']]
乘客等级乘客姓名性别
1002Byles, Rev. Thomas Roussel Davidsmale
1053Cribb, Mr. John Hatfieldmale
1083Calic, Mr. Jovomale

2.6.5 使用iloc方法抽取midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据

midage.reset_index(drop=True).iloc[[100,105,108],[2,4,5]]
乘客等级性别年龄
1002male42.0
1053male44.0
1083male17.0

【思考】对比ilocloc的异同

iloc和loc的索引方式不同

3 探索性数据分析

导入numpy、pandas包和数据

import numpy as np
import pandas as pd
#载入train_chinese.csv数据
data=pd.read_csv('train_chinese.csv',encoding='GBK')
data.head(3)
乘客ID是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

3.1 构建一个的DataFrame数据

test = pd.DataFrame(np.arange(8).reshape((2, 4)), 
                     index=['2', '1'], 
                     columns=['d', 'a', 'b', 'c'])
test
dabc
20123
14567

【问题】:大多数时候我们都是想根据列的值来排序,所以将你构建的DataFrame中的数据根据某一列,升序排列

#根据d列进行升序排列

test.sort_values(by='d',ascending=True)      #ascending,默认为True,即升序排列
dabc
20123
14567

1.让行索引升序排序

test.sort_index()
dabc
14567
20123

2.让列索引升序排序

test.sort_index(axis=1)
abcd
21230
15674

3.让列索引降序排序

test.sort_index(axis=1,ascending=False)    #ascending=False,降序排列
dcba
20321
14765

4.让任选两列数据同时降序排序

test.sort_values(by=['a','b'])
dabc
20123
14567

3.2 按票价和年龄两列进行综合排序(降序排列)

data.sort_values(by=['票价','年龄'],ascending=False).head()
乘客ID是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
67968011Cardeza, Mr. Thomas Drake Martinezmale36.001PC 17755512.3292B51 B53 B55C
25825911Ward, Miss. Annafemale35.000PC 17755512.3292NaNC
73773811Lesurer, Mr. Gustave Jmale35.000PC 17755512.3292B101C
43843901Fortune, Mr. Markmale64.01419950263.0000C23 C25 C27S
34134211Fortune, Miss. Alice Elizabethfemale24.03219950263.0000C23 C25 C27S

多做几个数据的排序

#以年龄和是否幸存降序排列
data.sort_values(by=['年龄','是否幸存'],ascending=False).head(20)
乘客ID是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
63063111Barkworth, Mr. Algernon Henry Wilsonmale80.0002704230.0000A23S
85185203Svensson, Mr. Johanmale74.0003470607.7750NaNS
969701Goldschmidt, Mr. George Bmale71.000PC 1775434.6542A5C
49349401Artagaveytia, Mr. Ramonmale71.000PC 1760949.5042NaNC
11611703Connors, Mr. Patrickmale70.5003703697.7500NaNQ
67267302Mitchell, Mr. Henry Michaelmale70.000C.A. 2458010.5000NaNS
74574601Crosby, Capt. Edward Giffordmale70.011WE/P 573571.0000B22S
333402Wheadon, Mr. Edward Hmale66.000C.A. 2457910.5000NaNS
545501Ostby, Mr. Engelhart Corneliusmale65.00111350961.9792B30C
28028103Duane, Mr. Frankmale65.0003364397.7500NaNQ
45645701Millet, Mr. Francis Davismale65.0001350926.5500E38S
43843901Fortune, Mr. Markmale64.01419950263.0000C23 C25 C27S
54554601Nicholson, Mr. Arthur Ernestmale64.00069326.0000NaNS
27527611Andrews, Miss. Kornelia Theodosiafemale63.0101350277.9583D7S
48348413Turkula, Mrs. (Hedwig)female63.00041349.5875NaNS
57057112Harris, Mr. Georgemale62.000S.W./PP 75210.5000NaNS
82983011Stone, Mrs. George Nelson (Martha Evelyn)female62.00011357280.0000B28NaN
25225301Stead, Mr. William Thomasmale62.00011351426.5500C87S
55555601Wright, Mr. Georgemale62.00011380726.5500NaNS
17017101Van der hoef, Mr. Wyckoffmale61.00011124033.5000B19S
#以年龄和是否幸存升序排列
data.sort_values(by=['年龄','是否幸存']).head(20)
乘客ID是否幸存乘客等级乘客姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
80380413Thomas, Master. Assad Alexandermale0.420126258.5167NaNC
75575612Hamalainen, Master. Viljomale0.671125064914.5000NaNS
46947013Baclini, Miss. Helene Barbarafemale0.7521266619.2583NaNC
64464513Baclini, Miss. Eugeniefemale0.7521266619.2583NaNC
787912Caldwell, Master. Alden Gatesmale0.830224873829.0000NaNS
83183212Richards, Master. George Sibleymale0.83112910618.7500NaNS
30530611Allison, Master. Hudson Trevormale0.9212113781151.5500C22 C26S
16416503Panula, Master. Eino Viljamimale1.0041310129539.6875NaNS
38638703Goodwin, Master. Sidney Leonardmale1.0052CA 214446.9000NaNS
17217313Johnson, Miss. Eleanor Ileenfemale1.001134774211.1333NaNS
18318412Becker, Master. Richard Fmale1.002123013639.0000F4S
38138213Nakid, Miss. Maria ("Mary")female1.0002265315.7417NaNC
78878913Dean, Master. Bertram Veremale1.0012C.A. 231520.5750NaNS
82782812Mallet, Master. Andremale1.0002S.C./PARIS 207937.0042NaNC
7803Palsson, Master. Gosta Leonardmale2.003134990921.0750NaNS
161703Rice, Master. Eugenemale2.004138265229.1250NaNQ
11912003Andersson, Miss. Ellis Anna Mariafemale2.004234708231.2750NaNS
20520603Strom, Miss. Telma Matildafemale2.000134705410.4625G6S
29729801Allison, Miss. Helen Lorainefemale2.0012113781151.5500C22 C26S
64264303Skoog, Miss. Margit Elizabethfemale2.003234708827.9000NaNS
"""年龄跟存活率有一定的关系"""

3.3 计算两个DataFrame数据相加结果

#创建两个DataFrame,test1_a test1_b

test1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),
                     columns=['a', 'b', 'c'],
                     index=['one', 'two', 'three'])
test1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),
                     columns=['a', 'e', 'c'],
                     index=['first', 'one', 'two', 'second'])
test1_a
abc
one0.01.02.0
two3.04.05.0
three6.07.08.0
test1_b
aec
first0.01.02.0
one3.04.05.0
two6.07.08.0
second9.010.011.0

将test1_a和test1_b进行相加

test1_a+test1_b
abce
firstNaNNaNNaNNaN
one3.0NaN7.0NaN
secondNaNNaNNaNNaN
threeNaNNaNNaNNaN
two9.0NaN13.0NaN

【提醒】两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。

3.4 用describe()函数查看数据信息

describe()函数输出信息的基本含义

'''
count : 样本数据大小
mean : 样本数据的平均值
std : 样本数据的标准差
min : 样本数据的最小值
25% : 样本数据25%的时候的值
50% : 样本数据50%的时候的值
75% : 样本数据75%的时候的值
max : 样本数据的最大值
'''
data.drop(['乘客ID','乘客姓名','船票信息','客舱'],axis=1).describe()
是否幸存乘客等级年龄兄弟姐妹个数父母子女个数票价
count891.000000891.000000714.000000891.000000891.000000891.000000
mean0.3838382.30864229.6991180.5230080.38159432.204208
std0.4865920.83607114.5264971.1027430.80605749.693429
min0.0000001.0000000.4200000.0000000.0000000.000000
25%0.0000002.00000020.1250000.0000000.0000007.910400
50%0.0000003.00000028.0000000.0000000.00000014.454200
75%1.0000003.00000038.0000001.0000000.00000031.000000
max1.0000003.00000080.0000008.0000006.000000512.329200

分析

从年龄和票价的来看,年龄最大的为80岁 票价最高512.3292元

从父母子女个数和兄弟姐妹个数数据来看,大部分人是独自旅行

幸存的人数少,大部分使用的是三等舱且年龄在三十岁上下

本文主要学习内容来源:datawhale

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值