Datawhale-数据分析-泰坦尼克-第一单元

1 第一章:数据载入及初步观察

1.1 载入数据

数据集下载 https://www.kaggle.com/c/titanic/overview

1.1.1 任务一:导入numpy和pandas
#写入代码
import numpy as np
import pandas as pd
import os

【提示】如果加载失败,学会如何在你的python环境下安装numpy和pandas这两个库

1.1.2 任务二:载入数据

(1) 使用相对路径载入数据
(2) 使用绝对路径载入数据

#写入代码
test_data = pd.read_csv('test_1.csv')
f = open('E://study//master3//数据分析//DataWhale//Titanic//hands-on-data-analysis-master//hands-on-data-analysis-master//第一单元项目集合/train.csv')
train_data = pd.read_csv(f)
# test_data_t = pd.read_table('./test_1.csv')
# os.getcwd()
# test_data_t
train_data.head(5)


PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
#写入代码
test_data.head(3)
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkeda
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS100
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C100
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS100

【提示】相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录。
【思考】知道数据加载的方法后,试试pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?了解一下’.tsv’和’.csv’的不同,如何加载这两个数据集?
【总结】加载的数据是所有工作的第一步,我们的工作会接触到不同的数据格式(eg:.csv;.tsv;.xlsx),但是加载的方法和思路都是一样的,在以后工作和做项目的过程中,遇到之前没有碰到的问题,要多多查资料吗,使用googel,了解业务逻辑,明白输入和输出是什么。

1.1.3 任务三:每1000行为一个数据模块,逐块读取
#写入代码
chunker = pd.read_csv('train.csv',chunksize=1000)
for piece in chunker:
    print(type(piece))
    print(len(piece))
    print(piece)

<class 'pandas.core.frame.DataFrame'>
891
     PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
5              6         0       3   
6              7         0       1   
7              8         0       3   
8              9         1       3   
9             10         1       2   
10            11         1       3   
11            12         1       1   
12            13         0       3   
13            14         0       3   
14            15         0       3   
15            16         1       2   
16            17         0       3   
17            18         1       2   
18            19         0       3   
19            20         1       3   
20            21         0       2   
21            22         1       2   
22            23         1       3   
23            24         1       1   
24            25         0       3   
25            26         1       3   
26            27         0       3   
27            28         0       1   
28            29         1       3   
29            30         0       3   
..           ...       ...     ...   
861          862         0       2   
862          863         1       1   
863          864         0       3   
864          865         0       2   
865          866         1       2   
866          867         1       2   
867          868         0       1   
868          869         0       3   
869          870         1       3   
870          871         0       3   
871          872         1       1   
872          873         0       1   
873          874         0       3   
874          875         1       2   
875          876         1       3   
876          877         0       3   
877          878         0       3   
878          879         0       3   
879          880         1       1   
880          881         1       2   
881          882         0       3   
882          883         0       3   
883          884         0       2   
884          885         0       3   
885          886         0       3   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
5                                     Moran, Mr. James    male   NaN      0   
6                              McCarthy, Mr. Timothy J    male  54.0      0   
7                       Palsson, Master. Gosta Leonard    male   2.0      3   
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)  female  27.0      0   
9                  Nasser, Mrs. Nicholas (Adele Achem)  female  14.0      1   
10                     Sandstrom, Miss. Marguerite Rut  female   4.0      1   
11                            Bonnell, Miss. Elizabeth  female  58.0      0   
12                      Saundercock, Mr. William Henry    male  20.0      0   
13                         Andersson, Mr. Anders Johan    male  39.0      1   
14                Vestrom, Miss. Hulda Amanda Adolfina  female  14.0      0   
15                    Hewlett, Mrs. (Mary D Kingcome)   female  55.0      0   
16                                Rice, Master. Eugene    male   2.0      4   
17                        Williams, Mr. Charles Eugene    male   NaN      0   
18   Vander Planke, Mrs. Julius (Emelia Maria Vande...  female  31.0      1   
19                             Masselmani, Mrs. Fatima  female   NaN      0   
20                                Fynney, Mr. Joseph J    male  35.0      0   
21                               Beesley, Mr. Lawrence    male  34.0      0   
22                         McGowan, Miss. Anna "Annie"  female  15.0      0   
23                        Sloper, Mr. William Thompson    male  28.0      0   
24                       Palsson, Miss. Torborg Danira  female   8.0      3   
25   Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...  female  38.0      1   
26                             Emir, Mr. Farred Chehab    male   NaN      0   
27                      Fortune, Mr. Charles Alexander    male  19.0      3   
28                       O'Dwyer, Miss. Ellen "Nellie"  female   NaN      0   
29                                 Todoroff, Mr. Lalio    male   NaN      0   
..                                                 ...     ...   ...    ...   
861                        Giles, Mr. Frederick Edward    male  21.0      1   
862  Swift, Mrs. Frederick Joel (Margaret Welles Ba...  female  48.0      0   
863                  Sage, Miss. Dorothy Edith "Dolly"  female   NaN      8   
864                             Gill, Mr. John William    male  24.0      0   
865                           Bystrom, Mrs. (Karolina)  female  42.0      0   
866                       Duran y More, Miss. Asuncion  female  27.0      1   
867               Roebling, Mr. Washington Augustus II    male  31.0      0   
868                        van Melkebeke, Mr. Philemon    male   NaN      0   
869                    Johnson, Master. Harold Theodor    male   4.0      1   
870                                  Balkic, Mr. Cerin    male  26.0      0   
871   Beckwith, Mrs. Richard Leonard (Sallie Monypeny)  female  47.0      1   
872                           Carlsson, Mr. Frans Olof    male  33.0      0   
873                        Vander Cruyssen, Mr. Victor    male  47.0      0   
874              Abelson, Mrs. Samuel (Hannah Wizosky)  female  28.0      1   
875                   Najib, Miss. Adele Kiamie "Jane"  female  15.0      0   
876                      Gustafsson, Mr. Alfred Ossian    male  20.0      0   
877                               Petroff, Mr. Nedelio    male  19.0      0   
878                                 Laleff, Mr. Kristo    male   NaN      0   
879      Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)  female  56.0      0   
880       Shelley, Mrs. William (Imanita Parrish Hall)  female  25.0      0   
881                                 Markun, Mr. Johann    male  33.0      0   
882                       Dahlberg, Miss. Gerda Ulrika  female  22.0      0   
883                      Banfield, Mr. Frederick James    male  28.0      0   
884                             Sutehall, Mr. Henry Jr    male  25.0      0   
885               Rice, Mrs. William (Margaret Norton)  female  39.0      0   
886                              Montvila, Rev. Juozas    male  27.0      0   
887                       Graham, Miss. Margaret Edith  female  19.0      0   
888           Johnston, Miss. Catherine Helen "Carrie"  female   NaN      1   
889                              Behr, Mr. Karl Howell    male  26.0      0   
890                                Dooley, Mr. Patrick    male  32.0      0   

     Parch            Ticket      Fare        Cabin Embarked  
0        0         A/5 21171    7.2500          NaN        S  
1        0          PC 17599   71.2833          C85        C  
2        0  STON/O2. 3101282    7.9250          NaN        S  
3        0            113803   53.1000         C123        S  
4        0            373450    8.0500          NaN        S  
5        0            330877    8.4583          NaN        Q  
6        0             17463   51.8625          E46        S  
7        1            349909   21.0750          NaN        S  
8        2            347742   11.1333          NaN        S  
9        0            237736   30.0708          NaN        C  
10       1           PP 9549   16.7000           G6        S  
11       0            113783   26.5500         C103        S  
12       0         A/5. 2151    8.0500          NaN        S  
13       5            347082   31.2750          NaN        S  
14       0            350406    7.8542          NaN        S  
15       0            248706   16.0000          NaN        S  
16       1            382652   29.1250          NaN        Q  
17       0            244373   13.0000          NaN        S  
18       0            345763   18.0000          NaN        S  
19       0              2649    7.2250          NaN        C  
20       0            239865   26.0000          NaN        S  
21       0            248698   13.0000          D56        S  
22       0            330923    8.0292          NaN        Q  
23       0            113788   35.5000           A6        S  
24       1            349909   21.0750          NaN        S  
25       5            347077   31.3875          NaN        S  
26       0              2631    7.2250          NaN        C  
27       2             19950  263.0000  C23 C25 C27        S  
28       0            330959    7.8792          NaN        Q  
29       0            349216    7.8958          NaN        S  
..     ...               ...       ...          ...      ...  
861      0             28134   11.5000          NaN        S  
862      0             17466   25.9292          D17        S  
863      2          CA. 2343   69.5500          NaN        S  
864      0            233866   13.0000          NaN        S  
865      0            236852   13.0000          NaN        S  
866      0     SC/PARIS 2149   13.8583          NaN        C  
867      0          PC 17590   50.4958          A24        S  
868      0            345777    9.5000          NaN        S  
869      1            347742   11.1333          NaN        S  
870      0            349248    7.8958          NaN        S  
871      1             11751   52.5542          D35        S  
872      0               695    5.0000  B51 B53 B55        S  
873      0            345765    9.0000          NaN        S  
874      0         P/PP 3381   24.0000          NaN        C  
875      0              2667    7.2250          NaN        C  
876      0              7534    9.8458          NaN        S  
877      0            349212    7.8958          NaN        S  
878      0            349217    7.8958          NaN        S  
879      1             11767   83.1583          C50        C  
880      1            230433   26.0000          NaN        S  
881      0            349257    7.8958          NaN        S  
882      0              7552   10.5167          NaN        S  
883      0  C.A./SOTON 34068   10.5000          NaN        S  
884      0   SOTON/OQ 392076    7.0500          NaN        S  
885      5            382652   29.1250          NaN        Q  
886      0            211536   13.0000          NaN        S  
887      0            112053   30.0000          B42        S  
888      2        W./C. 6607   23.4500          NaN        S  
889      0            111369   30.0000         C148        C  
890      0            370376    7.7500          NaN        Q  

[891 rows x 12 columns]

【思考】什么是逐块读取?为什么要逐块读取呢?
将文本分成若干块,每次处理chunksize行的数据,最终返回一个TextParser对象,对该对象进行迭代遍历,可以完成逐块统计的合并处理。
因为文本太大,需要一部分数据,或者需要一块一块进行处理。
【提示】大家可以chunker(数据块)是什么类型?用for循环打印出来出处具体的样子是什么?
DataFrame的数据类型

1.1.4 任务四:将表头改成中文,索引改为乘客ID [对于某些英文资料,我们可以通过翻译来更直观的熟悉我们的数据]

PassengerId => 乘客ID
Survived => 是否幸存
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口

#写入代码
train_data = pd.read_csv('train.csv',names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)
train_data.head(3)

是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS

【思考】所谓将表头改为中文其中一个思路是:将英文列名表头替换成中文。还有其他的方法吗?

1.2 初步观察

导入数据后,你可能要对数据的整体结构和样例进行概览,比如说,数据大小、有多少列,各列都是什么格式的,是否包含null等

1.2.1 任务一:查看数据的基本信息
#写入代码
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
是否幸存      891 non-null int64
仓位等级      891 non-null int64
姓名        891 non-null object
性别        891 non-null object
年龄        714 non-null float64
兄弟姐妹个数    891 non-null int64
父母子女个数    891 non-null int64
船票信息      891 non-null object
票价        891 non-null float64
客舱        204 non-null object
登船港口      889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

【提示】有多个函数可以这样做,你可以做一下总结

train_data.describe()
是否幸存仓位等级年龄兄弟姐妹个数父母子女个数票价
count891.000000891.000000714.000000891.000000891.000000891.000000
mean0.3838382.30864229.6991180.5230080.38159432.204208
std0.4865920.83607114.5264971.1027430.80605749.693429
min0.0000001.0000000.4200000.0000000.0000000.000000
25%0.0000002.00000020.1250000.0000000.0000007.910400
50%0.0000003.00000028.0000000.0000000.00000014.454200
75%1.0000003.00000038.0000001.0000000.00000031.000000
max1.0000003.00000080.0000008.0000006.000000512.329200
1.2.2 任务二:观察表格前10行的数据和后15行的数据
#写入代码
train_data.head(10)
是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
503Allen, Mr. William Henrymale35.0003734508.0500NaNS
603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
1012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
#写入代码
train_data.tail(15)
是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
87703Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS
87803Petroff, Mr. Nedeliomale19.0003492127.8958NaNS
87903Laleff, Mr. KristomaleNaN003492177.8958NaNS
88011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
88203Markun, Mr. Johannmale33.0003492577.8958NaNS
88303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS
88402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS
88503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS
88603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
89011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
1.2.4 任务三:判断数据是否为空,为空的地方返回True,其余地方返回False
#写入代码
train_data.isnull().head()
是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
乘客ID
1FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
2FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
3FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
4FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
5FalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse

【总结】上面的操作都是数据分析中对于数据本身的观察

【思考】对于一个数据,还可以从哪些方面来观察?找找答案,这个将对下面的数据分析有很大的帮助

1.3 保存数据

1.3.1 任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv
#写入代码
# 注意:不同的操作系统保存下来可能会有乱码。大家可以加入`encoding='GBK' 或者 ’encoding = ’uft-8‘‘`
train_data.to_csv('train_Chinese.csv',encoding='utf-8')

【总结】数据的加载以及入门,接下来就要接触数据本身的运算,我们将主要掌握numpy和pandas在工作和项目场景的运用。

1 第一章:数据载入及初步观察

1.4 知道你的数据叫什么

我们学习pandas的基础操作,那么上一节通过pandas加载之后的数据,其数据类型是什么呢?

开始前导入numpy和pandas

import numpy as np
import pandas as pd
1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子🌰[开放题]

https://www.cnblogs.com/lavender1221/p/12664641.html#
Pandas的核心是三大数据结构:Series、DataFrame和Index。绝大多数操作都是围绕这三种结构进行的。

Series是一个一维的数组对象,它包含一个值序列和一个对应的索引序列。 Numpy的一维数组通过隐式定义的整数索引获取元素值,而Series用一种显式定义的索引与元素关联。显式索引让Series对象拥有更强的能力,索引也不再仅仅是整数,还可以是别的类型,比如字符串,索引也不需要连续,也可以重复,自由度非常高。

DataFrame是Pandas的核心数据结构,表示的是二维的矩阵数据表,类似关系型数据库的结构,每一列可以是不同的值类型,比如数值、字符串、布尔值等等。DataFrame既有行索引,也有列索引,它可以被看做为一个共享相同索引的Series的字典。

创建DataFrame对象的方法有很多,最常用的是利用包含等长度列表或Numpy数组的字典来生成。可以查看DataFrame对象的columns和index属性。

#写入代码
sdata_1 = [7,-2,567,8]
example_1 = pd.Series(sdata_1,index = ['a','b','c','d'])
example_1
a      7
b     -2
c    567
d      8
dtype: int64
sdata_2 = {'a':7,'b':-2,'c':567,'d':8}
example_2 = pd.Series(sdata_2)
example_2
a      7
b     -2
c    567
d      8
dtype: int64
sdata_3 = {'city':['nanjing','wuxi','wuhan','changsha'],
           'code':['001','002','003','004']}
example_3 = pd.DataFrame(sdata_3)
example_3
citycode
0nanjing001
1wuxi002
2wuhan003
3changsha004
'''
#我们举的例子
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
example_1 = pd.Series(sdata)
example_1
'''
'''#我们举的例子data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],        'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}example_2 = pd.DataFrame(data)example_2'''
1.4.2 任务二:根据上节课的方法载入"train.csv"文件
#写入代码train_chinese = pd.read_csv('train_Chinese.csv')train_chinese.head()train_data = pd.read_csv('train.csv')

也可以加载上一节课保存的"train_chinese.csv"文件。通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作

1.4.3 任务三:查看DataFrame数据的每列的名称
#写入代码train_chinese.columns
Index(['乘客ID', '是否幸存', '仓位等级', '姓名', '性别', '年龄', '兄弟姐妹个数', '父母子女个数', '船票信息',       '票价', '客舱', '登船港口'],      dtype='object')
train_data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],      dtype='object')
train_data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
1.4.4任务四:查看"Cabin"这列的所有值[有多种方法]
#写入代码train_data['Cabin'].head()
0     NaN1     C852     NaN3    C1234     NaNName: Cabin, dtype: object
#写入代码train_data.Cabin.head()
0     NaN1     C852     NaN3    C1234     NaNName: Cabin, dtype: object
1.4.5 任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除

经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的列删去

#写入代码test_data = pd.read_csv('test_1.csv')test_data.head()
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarkeda
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS100
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C100
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS100
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S100
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS100
#写入代码test_data.pop('a').head()test_data
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS
55603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
66701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
77803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
88913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
991012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
10101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
11111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
12121303Saundercock, Mr. William Henrymale20.000A/5. 21518.0500NaNS
13131403Andersson, Mr. Anders Johanmale39.01534708231.2750NaNS
14141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.8542NaNS
15151612Hewlett, Mrs. (Mary D Kingcome)female55.00024870616.0000NaNS
16161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
17171812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
18181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female31.01034576318.0000NaNS
19192013Masselmani, Mrs. FatimafemaleNaN0026497.2250NaNC
20202102Fynney, Mr. Joseph Jmale35.00023986526.0000NaNS
21212212Beesley, Mr. Lawrencemale34.00024869813.0000D56S
22222313McGowan, Miss. Anna "Annie"female15.0003309238.0292NaNQ
23232411Sloper, Mr. William Thompsonmale28.00011378835.5000A6S
24242503Palsson, Miss. Torborg Danirafemale8.03134990921.0750NaNS
25252613Asplund, Mrs. Carl Oscar (Selma Augusta Emilia...female38.01534707731.3875NaNS
26262703Emir, Mr. Farred ChehabmaleNaN0026317.2250NaNC
27272801Fortune, Mr. Charles Alexandermale19.03219950263.0000C23 C25 C27S
28282913O'Dwyer, Miss. Ellen "Nellie"femaleNaN003309597.8792NaNQ
29293003Todoroff, Mr. LaliomaleNaN003492167.8958NaNS
..........................................
86186186202Giles, Mr. Frederick Edwardmale21.0102813411.5000NaNS
86286286311Swift, Mrs. Frederick Joel (Margaret Welles Ba...female48.0001746625.9292D17S
86386386403Sage, Miss. Dorothy Edith "Dolly"femaleNaN82CA. 234369.5500NaNS
86486486502Gill, Mr. John Williammale24.00023386613.0000NaNS
86586586612Bystrom, Mrs. (Karolina)female42.00023685213.0000NaNS
86686686712Duran y More, Miss. Asuncionfemale27.010SC/PARIS 214913.8583NaNC
86786786801Roebling, Mr. Washington Augustus IImale31.000PC 1759050.4958A24S
86886886903van Melkebeke, Mr. PhilemonmaleNaN003457779.5000NaNS
86986987013Johnson, Master. Harold Theodormale4.01134774211.1333NaNS
87087087103Balkic, Mr. Cerinmale26.0003492487.8958NaNS
87187187211Beckwith, Mrs. Richard Leonard (Sallie Monypeny)female47.0111175152.5542D35S
87287287301Carlsson, Mr. Frans Olofmale33.0006955.0000B51 B53 B55S
87387387403Vander Cruyssen, Mr. Victormale47.0003457659.0000NaNS
87487487512Abelson, Mrs. Samuel (Hannah Wizosky)female28.010P/PP 338124.0000NaNC
87587587613Najib, Miss. Adele Kiamie "Jane"female15.00026677.2250NaNC
87687687703Gustafsson, Mr. Alfred Ossianmale20.00075349.8458NaNS
87787787803Petroff, Mr. Nedeliomale19.0003492127.8958NaNS
87887887903Laleff, Mr. KristomaleNaN003492177.8958NaNS
87987988011Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)female56.0011176783.1583C50C
88088088112Shelley, Mrs. William (Imanita Parrish Hall)female25.00123043326.0000NaNS
88188188203Markun, Mr. Johannmale33.0003492577.8958NaNS
88288288303Dahlberg, Miss. Gerda Ulrikafemale22.000755210.5167NaNS
88388388402Banfield, Mr. Frederick Jamesmale28.000C.A./SOTON 3406810.5000NaNS
88488488503Sutehall, Mr. Henry Jrmale25.000SOTON/OQ 3920767.0500NaNS
88588588603Rice, Mrs. William (Margaret Norton)female39.00538265229.1250NaNQ
88688688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888888903Johnston, Miss. Catherine Helen "Carrie"femaleNaN12W./C. 660723.4500NaNS
88988989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ

891 rows × 13 columns

【思考】还有其他的删除多余的列的方式吗?

# 思考回答del test_data['a']test_data.head()
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS
1.4.6 任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素
#写入代码test_data.drop(['PassengerId','Name','Age','Ticket'],axis=1).head()
Unnamed: 0SurvivedPclassSexSibSpParchFareCabinEmbarked
0003male107.2500NaNS
1111female1071.2833C85C
2213female007.9250NaNS
3311female1053.1000C123S
4403male008.0500NaNS

【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?

【思考回答】

如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用

1.5 筛选的逻辑

表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。

下面我们还是用实战来学习pandas这个功能。

1.5.1 任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。
#写入代码test_data[test_data['Age']<10].head()
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
77803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
10101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
16161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
24242503Palsson, Miss. Torborg Danirafemale8.03134990921.0750NaNS
43434412Laroche, Miss. Simonne Marie Anne Andreefemale3.012SC/Paris 212341.5792NaNC
1.5.2 任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage
#写入代码midage = test_data[(test_data['Age']>10) & (test_data['Age']<50)]midage.head()
Unnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
00103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
11211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
22313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
33411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
44503Allen, Mr. William Henrymale35.0003734508.0500NaNS

【提示】了解pandas的条件筛选方式以及如何使用交集和并集操作

1.5.3 任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来
#写入代码midage = midage.reset_index()midage.head()
indexUnnamed: 0PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
000103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
111211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
222313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
333411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
444503Allen, Mr. William Henrymale35.0003734508.0500NaNS

【提示】在抽取数据中,我们希望数据的相对顺序保持不变,用什么函数可以达到这个效果呢?
reset_index()函数: 使用索引重置生成一个新的DataFrame或Series,可以把索引用作列。保留原索引,即保持数据的相对顺序

midage.loc[[100],['Pclass','Sex']]
PclassSex
1002male
1.5.4 任务四:使用loc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
#写入代码midage.loc[[100,105,108],['Pclass','Name','Sex']] #因为你主动的延长了行的距离,所以会产生表格形式
PclassNameSex
1002Byles, Rev. Thomas Roussel Davidsmale
1053Cribb, Mr. John Hatfieldmale
1083Calic, Mr. Jovomale
1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
#写入代码midage.iloc[[100,105,108],[4,5,6]]  #iloc的行和列都按照整数,不能按照列名
PclassNameSex
1002Byles, Rev. Thomas Roussel Davidsmale
1053Cribb, Mr. John Hatfieldmale
1083Calic, Mr. Jovomale

【思考】对比ilocloc的异同
iloc是按照行数取值,而loc按着index名取值

复习:在前面我们已经学习了Pandas基础,知道利用Pandas读取csv数据的增删查改,今天我们要学习的就是探索性数据分析,主要介绍如何利用Pandas进行排序、算术计算以及计算描述函数describe()的使用。

1 第一章:探索性数据分析

开始之前,导入numpy、pandas包和数据
#加载所需的库
import numpy as np
import pandas as pd
#载入之前保存的train_chinese.csv数据,关于泰坦尼克号的任务,我们就使用这个数据
train_data = pd.read_csv('train_Chinese.csv')

1.6 了解你的数据吗?

教材《Python for Data Analysis》第五章

1.6.1 任务一:利用Pandas对示例数据进行排序,要求升序
# 具体请看《利用Python进行数据分析》第五章 排序和排名 部分

#自己构建一个都为数字的DataFrame数据

'''
我们举了一个例子
pd.DataFrame() :创建一个DataFrame对象 
np.arange(8).reshape((2, 4)) : 生成一个二维数组(2*4),第一列:0,1,2,3 第二列:4,5,6,7
index=[2,1] :DataFrame 对象的索引列
columns=['d', 'a', 'b', 'c'] :DataFrame 对象的索引行
'''
frame = pd.DataFrame(np.arange(8).reshape(2,4),index=[2,1],columns=['d','a','b','c'])
frame

dabc
20123
14567

【代码解析】

pd.DataFrame() :创建一个DataFrame对象

np.arange(8).reshape((2, 4)) : 生成一个二维数组(2*4),第一列:0,1,2,3 第二列:4,5,6,7

index=['2, 1] :DataFrame 对象的索引列

columns=[‘d’, ‘a’, ‘b’, ‘c’] :DataFrame 对象的索引行

【问题】:大多数时候我们都是想根据列的值来排序,所以将你构建的DataFrame中的数据根据某一列,升序排列

#回答代码
frame.sort_values(by = 'c',ascending = True)
dabc
20123
14567

【思考】通过书本你能说出Pandas对DataFrame数据的其他排序方式吗?
sort_index()对索引进行排序,axis=1是对列

frame.sort_index()
dabc
14567
20123

【总结】下面将不同的排序方式做一个总结

1.让行索引升序排序

#代码frame.sort_index()
dabc
14567
20123

2.让列索引升序排序

#代码frame.sort_index(axis=1)
abcd
21230
15674

3.让列索引降序排序

#代码frame.sort_index(axis=1,ascending=False)
dcba
20321
14765

4.让任选两列数据同时降序排序

#代码frame.sort_values(['a','c'],ascending=False)
dabc
14567
20123
1.6.2 任务二:对泰坦尼克号数据(trian.csv)按票价和年龄两列进行综合排序(降序排列),从这个数据中你可以分析出什么?
'''在开始我们已经导入了train_chinese.csv数据,而且前面我们也学习了导入数据过程,根据上面学习,我们直接对目标列进行排序即可head(20) : 读取前20条数据'''train_data.head(20)
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale2.03134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female27.00234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female14.01023773630.0708NaNC
101113Sandstrom, Miss. Marguerite Rutfemale4.011PP 954916.7000G6S
111211Bonnell, Miss. Elizabethfemale58.00011378326.5500C103S
121303Saundercock, Mr. William Henrymale20.000A/5. 21518.0500NaNS
131403Andersson, Mr. Anders Johanmale39.01534708231.2750NaNS
141503Vestrom, Miss. Hulda Amanda Adolfinafemale14.0003504067.8542NaNS
151612Hewlett, Mrs. (Mary D Kingcome)female55.00024870616.0000NaNS
161703Rice, Master. Eugenemale2.04138265229.1250NaNQ
171812Williams, Mr. Charles EugenemaleNaN0024437313.0000NaNS
181903Vander Planke, Mrs. Julius (Emelia Maria Vande...female31.01034576318.0000NaNS
192013Masselmani, Mrs. FatimafemaleNaN0026497.2250NaNC
#代码train_data.sort_values(['票价','年龄'],ascending=False)
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
67968011Cardeza, Mr. Thomas Drake Martinezmale36.0001PC 17755512.3292B51 B53 B55C
25825911Ward, Miss. Annafemale35.0000PC 17755512.3292NaNC
73773811Lesurer, Mr. Gustave Jmale35.0000PC 17755512.3292B101C
43843901Fortune, Mr. Markmale64.001419950263.0000C23 C25 C27S
34134211Fortune, Miss. Alice Elizabethfemale24.003219950263.0000C23 C25 C27S
888911Fortune, Miss. Mabel Helenfemale23.003219950263.0000C23 C25 C27S
272801Fortune, Mr. Charles Alexandermale19.003219950263.0000C23 C25 C27S
74274311Ryerson, Miss. Susan Parker "Suzette"female21.0022PC 17608262.3750B57 B59 B63 B66C
31131211Ryerson, Miss. Emily Boriefemale18.0022PC 17608262.3750B57 B59 B63 B66C
29930011Baxter, Mrs. James (Helene DeLaudeniere Chaput)female50.0001PC 17558247.5208B58 B60C
11811901Baxter, Mr. Quigg Edmondmale24.0001PC 17558247.5208B58 B60C
38038111Bidois, Miss. Rosaliefemale42.0000PC 17757227.5250NaNC
71671711Endres, Miss. Caroline Louisefemale38.0000PC 17757227.5250C45C
70070111Astor, Mrs. John Jacob (Madeleine Talmadge Force)female18.0010PC 17757227.5250C62 C64C
55755801Robbins, Mr. VictormaleNaN00PC 17757227.5250NaNC
52752801Farthing, Mr. JohnmaleNaN00PC 17483221.7792C95S
37737801Widener, Mr. Harry Elkinsmale27.0002113503211.5000C82C
77978011Robert, Mrs. Edward Scott (Elisabeth Walton Mc...female43.000124160211.3375B3S
73073111Allen, Miss. Elisabeth Waltonfemale29.000024160211.3375B5S
68969011Madill, Miss. Georgette Alexandrafemale15.000124160211.3375B5S
85685711Wick, Mrs. George Dennick (Mary Hitchcock)female45.001136928164.8667NaNS
31831911Wick, Miss. Mary Nataliefemale31.000236928164.8667C7S
26826911Graham, Mrs. William Thompson (Edith Junkins)female58.0001PC 17582153.4625C125S
60961011Shutes, Miss. Elizabeth Wfemale40.0000PC 17582153.4625C125S
33233301Graham, Mr. George Edwardmale38.0001PC 17582153.4625C91S
49849901Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female25.0012113781151.5500C22 C26S
70870911Cleaver, Miss. Alicefemale22.0000113781151.5500NaNS
29729801Allison, Miss. Helen Lorainefemale2.0012113781151.5500C22 C26S
30530611Allison, Master. Hudson Trevormale0.9212113781151.5500C22 C26S
19519611Lurette, Miss. Elisefemale58.0000PC 17569146.5208B80C
.......................................
61161203Jardin, Mr. Jose NetomaleNaN00SOTON/O.Q. 31013057.0500NaNS
47747803Braund, Mr. Lewis Richardmale29.001034607.0458NaNS
12913003Ekstrom, Mr. Johanmale45.00003470616.9750NaNS
80480513Hedman, Mr. Oskar Arvidmale27.00003470896.9750NaNS
82582603Flynn, Mr. JohnmaleNaN003683236.9500NaNQ
41141203Hart, Mr. HenrymaleNaN003941406.8583NaNQ
14314403Burke, Mr. Jeremiahmale19.00003652226.7500NaNQ
65465503Hegarty, Miss. Hanora "Nora"female18.00003652266.7500NaNQ
20220303Johanson, Mr. Jakob Alfredmale34.000031012646.4958NaNS
37137203Wiklund, Mr. Jakob Alfredmale18.001031012676.4958NaNS
81881903Holm, Mr. John Fredrik Alexandermale43.0000C 70756.4500NaNS
84384403Lemberopolous, Mr. Peter Lmale34.500026836.4375NaNC
32632703Nysveen, Mr. Johan Hansenmale61.00003453646.2375NaNS
87287301Carlsson, Mr. Frans Olofmale33.00006955.0000B51 B53 B55S
37837903Betros, Mr. Tannousmale20.000026484.0125NaNC
59759803Johnson, Mr. Alfredmale49.0000LINE0.0000NaNS
26326401Harrison, Mr. Williammale40.00001120590.0000B94S
80680701Andrews, Mr. Thomas Jrmale39.00001120500.0000A36S
82282301Reuchlin, Jonkheer. John Georgemale38.0000199720.0000NaNS
17918003Leonard, Mr. Lionelmale36.0000LINE0.0000NaNS
27127213Tornquist, Mr. William Henrymale25.0000LINE0.0000NaNS
30230303Johnson, Mr. William Cahoone Jrmale19.0000LINE0.0000NaNS
27727802Parkes, Mr. Francis "Frank"maleNaN002398530.0000NaNS
41341402Cunningham, Mr. Alfred FlemingmaleNaN002398530.0000NaNS
46646702Campbell, Mr. WilliammaleNaN002398530.0000NaNS
48148202Frost, Mr. Anthony Wood "Archie"maleNaN002398540.0000NaNS
63363401Parr, Mr. William Henry MarshmaleNaN001120520.0000NaNS
67467502Watson, Mr. Ennis HastingsmaleNaN002398560.0000NaNS
73273302Knight, Mr. Robert JmaleNaN002398550.0000NaNS
81581601Fry, Mr. RichardmaleNaN001120580.0000B102S

891 rows × 12 columns

【思考】排序后,如果我们仅仅关注年龄和票价两列。根据常识我知道发现票价越高的应该客舱越好,所以我们会明显看出,票价前20的乘客中存活的有14人,这是相当高的一个比例,那么我们后面是不是可以进一步分析一下票价和存活之间的关系,年龄和存活之间的关系呢?当你开始发现数据之间的关系了,数据分析就开始了。

当然,这只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。
存活数与男女之间的关系

多做几个数据的排序

#代码train_data.sort_values(['兄弟姐妹个数','父母子女个数','性别'],ascending=False).head(20)
乘客ID是否幸存仓位等级姓名性别年龄兄弟姐妹个数父母子女个数船票信息票价客舱登船港口
15916003Sage, Master. Thomas HenrymaleNaN82CA. 234369.5500NaNS
20120203Sage, Mr. FrederickmaleNaN82CA. 234369.5500NaNS
32432503Sage, Mr. George John JrmaleNaN82CA. 234369.5500NaNS
84684703Sage, Mr. Douglas BullenmaleNaN82CA. 234369.5500NaNS
18018103Sage, Miss. Constance GladysfemaleNaN82CA. 234369.5500NaNS
79279303Sage, Miss. Stella AnnafemaleNaN82CA. 234369.5500NaNS
86386403Sage, Miss. Dorothy Edith "Dolly"femaleNaN82CA. 234369.5500NaNS
596003Goodwin, Master. William Frederickmale11.052CA 214446.9000NaNS
38638703Goodwin, Master. Sidney Leonardmale1.052CA 214446.9000NaNS
48048103Goodwin, Master. Harold Victormale9.052CA 214446.9000NaNS
68368403Goodwin, Mr. Charles Edwardmale14.052CA 214446.9000NaNS
717203Goodwin, Miss. Lillian Amyfemale16.052CA 214446.9000NaNS
18218303Asplund, Master. Clarence Gustaf Hugomale9.04234707731.3875NaNS
26126213Asplund, Master. Edvin Rojj Felixmale3.04234707731.3875NaNS
85085103Andersson, Master. Sigvard Harald Eliasmale4.04234708231.2750NaNS
686913Andersson, Miss. Erna Alexandrafemale17.04231012817.9250NaNS
11912003Andersson, Miss. Ellis Anna Mariafemale2.04234708231.2750NaNS
23323413Asplund, Miss. Lillian Gertrudfemale5.04234707731.3875NaNS
54154203Andersson, Miss. Ingeborg Constanziafemale9.04234708231.2750NaNS
54254303Andersson, Miss. Sigrid Elisabethfemale11.04234708231.2750NaNS
#写下你的思考兄弟姐妹越多的,存活率越低,男性可能比女性存活率低
1.6.3 任务三:利用Pandas进行算术计算,计算两个DataFrame数据相加结果
# 具体请看《利用Python进行数据分析》第五章 算术运算与数据对齐 部分#自己构建两个都为数字的DataFrame数据"""我们举了一个例子:frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),                     columns=['a', 'b', 'c'],                     index=['one', 'two', 'three'])frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),                     columns=['a', 'e', 'c'],                     index=['first', 'one', 'two', 'second'])frame1_a"""
#代码frame1_a = pd.DataFrame(np.arange(9.).reshape(3,3),columns=['a','b','c'],index=['one','two','three'])frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),columns=['a', 'e', 'c'], index=['first', 'one', 'two', 'second'])

将frame_a和frame_b进行相加

#代码frame1_a
abc
one0.01.02.0
two3.04.05.0
three6.07.08.0

【提醒】两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。

当然,DataFrame还有很多算术运算,如减法,除法等,有兴趣的同学可以看《利用Python进行数据分析》第五章 算术运算与数据对齐 部分,多在网络上查找相关学习资料。

frame1_b
aec
first0.01.02.0
one3.04.05.0
two6.07.08.0
second9.010.011.0
frame1_a + frame1_b
abce
firstNaNNaNNaNNaN
one3.0NaN7.0NaN
secondNaNNaNNaNNaN
threeNaNNaNNaNNaN
two9.0NaN13.0NaN
1.6.4 任务四:通过泰坦尼克号数据如何计算出在船上最大的家族有多少人?
'''还是用之前导入的chinese_train.csv如果我们想看看在船上,最大的家族有多少人(‘兄弟姐妹个数’+‘父母子女个数’),我们该怎么做呢?'''max(train_data['兄弟姐妹个数']+train_data['父母子女个数'])
10

【提醒】我们只需找出”兄弟姐妹个数“和”父母子女个数“之和最大的数,当然你还可以想出很多方法和思考角度,欢迎你来说出你的看法。

多做几个数据的相加,看看你能分析出什么?

1.6.5 任务五:学会使用Pandas describe()函数查看数据基本统计信息
#(1) 关键知识点示例做一遍(简单数据)# 具体请看《利用Python进行数据分析》第五章 汇总和计算描述统计 部分#自己构建一个有数字有空值的DataFrame数据"""我们举了一个例子:frame2 = pd.DataFrame([[1.4, np.nan],                        [7.1, -4.5],                       [np.nan, np.nan],                        [0.75, -1.3]                      ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])frame2"""
#代码frame2 = pd.DataFrame([[1.4, np.nan],                        [7.1, -4.5],                       [np.nan, np.nan],                        [0.75, -1.3]                      ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])frame2
onetwo
a1.40NaN
b7.10-4.5
cNaNNaN
d0.75-1.3

调用 describe 函数,观察frame2的数据基本信息

#代码frame2.describe()
onetwo
count3.0000002.000000
mean3.083333-2.900000
std3.4936852.262742
min0.750000-4.500000
25%1.075000-3.700000
50%1.400000-2.900000
75%4.250000-2.100000
max7.100000-1.300000
1.6.6 任务六:分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据,你能发现什么?
'''看看泰坦尼克号数据集中 票价 这列数据的基本统计数据'''
#代码train_data['票价'].describe()
count    891.000000mean      32.204208std       49.693429min        0.00000025%        7.91040050%       14.45420075%       31.000000max      512.329200Name: 票价, dtype: float64
train_data['父母子女个数'].describe()
count    891.000000mean       0.381594std        0.806057min        0.00000025%        0.00000050%        0.00000075%        0.000000max        6.000000Name: 父母子女个数, dtype: float64

【思考】从上面数据我们可以看出,试试在下面写出你的看法。然后看看我们给出的答案。
【思考】从上面数据我们可以看出,
一共有891个票价数据,
平均值约为:32.20,
标准差约为49.69,说明票价波动特别大,
25%的人的票价是低于7.91的,50%的人的票价低于14.45,75%的人的票价低于31.00,
票价最大值约为512.33,最小值为0。

75%的人没有子女或父母,说明出玩人员大部分都孤身一身

当然,答案只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。

多做几个组数据的统计,看看你能分析出什么?

# 写下你的其他分析

【思考】有更多想法,欢迎写在你的学习笔记中。

【总结】本节中我们通过Pandas的一些内置函数对数据进行了初步统计查看,这个过程最重要的不是大家得掌握这些函数,而是看懂从这些函数出来的数据,构建自己的数据分析思维,这也是第一章最重要的点,希望大家学完第一章能对数据有个基本认识,了解自己在做什么,为什么这么做,后面的章节我们将开始对数据进行清洗,进一步分析。

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值