1 第一章:数据载入及初步观察
1.1 载入数据
数据集下载 https://www.kaggle.com/c/titanic/overview
1.1.1 任务一:导入numpy和pandas
#写入代码
import numpy as np
import pandas as pd
import os
【提示】如果加载失败,学会如何在你的python环境下安装numpy和pandas这两个库
1.1.2 任务二:载入数据
(1) 使用相对路径载入数据
(2) 使用绝对路径载入数据
#写入代码
test_data = pd.read_csv('test_1.csv')
f = open('E://study//master3//数据分析//DataWhale//Titanic//hands-on-data-analysis-master//hands-on-data-analysis-master//第一单元项目集合/train.csv')
train_data = pd.read_csv(f)
# test_data_t = pd.read_table('./test_1.csv')
# os.getcwd()
# test_data_t
train_data.head(5)
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
#写入代码
test_data.head(3)
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 100 |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 100 |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 100 |
【提示】相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录。
【思考】知道数据加载的方法后,试试pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?了解一下’.tsv’和’.csv’的不同,如何加载这两个数据集?
【总结】加载的数据是所有工作的第一步,我们的工作会接触到不同的数据格式(eg:.csv;.tsv;.xlsx),但是加载的方法和思路都是一样的,在以后工作和做项目的过程中,遇到之前没有碰到的问题,要多多查资料吗,使用googel,了解业务逻辑,明白输入和输出是什么。
1.1.3 任务三:每1000行为一个数据模块,逐块读取
#写入代码
chunker = pd.read_csv('train.csv',chunksize=1000)
for piece in chunker:
print(type(piece))
print(len(piece))
print(piece)
<class 'pandas.core.frame.DataFrame'>
891
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
5 6 0 3
6 7 0 1
7 8 0 3
8 9 1 3
9 10 1 2
10 11 1 3
11 12 1 1
12 13 0 3
13 14 0 3
14 15 0 3
15 16 1 2
16 17 0 3
17 18 1 2
18 19 0 3
19 20 1 3
20 21 0 2
21 22 1 2
22 23 1 3
23 24 1 1
24 25 0 3
25 26 1 3
26 27 0 3
27 28 0 1
28 29 1 3
29 30 0 3
.. ... ... ...
861 862 0 2
862 863 1 1
863 864 0 3
864 865 0 2
865 866 1 2
866 867 1 2
867 868 0 1
868 869 0 3
869 870 1 3
870 871 0 3
871 872 1 1
872 873 0 1
873 874 0 3
874 875 1 2
875 876 1 3
876 877 0 3
877 878 0 3
878 879 0 3
879 880 1 1
880 881 1 2
881 882 0 3
882 883 0 3
883 884 0 2
884 885 0 3
885 886 0 3
886 887 0 2
887 888 1 1
888 889 0 3
889 890 1 1
890 891 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
5 Moran, Mr. James male NaN 0
6 McCarthy, Mr. Timothy J male 54.0 0
7 Palsson, Master. Gosta Leonard male 2.0 3
8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0
9 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1
10 Sandstrom, Miss. Marguerite Rut female 4.0 1
11 Bonnell, Miss. Elizabeth female 58.0 0
12 Saundercock, Mr. William Henry male 20.0 0
13 Andersson, Mr. Anders Johan male 39.0 1
14 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0
15 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0
16 Rice, Master. Eugene male 2.0 4
17 Williams, Mr. Charles Eugene male NaN 0
18 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1
19 Masselmani, Mrs. Fatima female NaN 0
20 Fynney, Mr. Joseph J male 35.0 0
21 Beesley, Mr. Lawrence male 34.0 0
22 McGowan, Miss. Anna "Annie" female 15.0 0
23 Sloper, Mr. William Thompson male 28.0 0
24 Palsson, Miss. Torborg Danira female 8.0 3
25 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1
26 Emir, Mr. Farred Chehab male NaN 0
27 Fortune, Mr. Charles Alexander male 19.0 3
28 O'Dwyer, Miss. Ellen "Nellie" female NaN 0
29 Todoroff, Mr. Lalio male NaN 0
.. ... ... ... ...
861 Giles, Mr. Frederick Edward male 21.0 1
862 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0
863 Sage, Miss. Dorothy Edith "Dolly" female NaN 8
864 Gill, Mr. John William male 24.0 0
865 Bystrom, Mrs. (Karolina) female 42.0 0
866 Duran y More, Miss. Asuncion female 27.0 1
867 Roebling, Mr. Washington Augustus II male 31.0 0
868 van Melkebeke, Mr. Philemon male NaN 0
869 Johnson, Master. Harold Theodor male 4.0 1
870 Balkic, Mr. Cerin male 26.0 0
871 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1
872 Carlsson, Mr. Frans Olof male 33.0 0
873 Vander Cruyssen, Mr. Victor male 47.0 0
874 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1
875 Najib, Miss. Adele Kiamie "Jane" female 15.0 0
876 Gustafsson, Mr. Alfred Ossian male 20.0 0
877 Petroff, Mr. Nedelio male 19.0 0
878 Laleff, Mr. Kristo male NaN 0
879 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0
880 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0
881 Markun, Mr. Johann male 33.0 0
882 Dahlberg, Miss. Gerda Ulrika female 22.0 0
883 Banfield, Mr. Frederick James male 28.0 0
884 Sutehall, Mr. Henry Jr male 25.0 0
885 Rice, Mrs. William (Margaret Norton) female 39.0 0
886 Montvila, Rev. Juozas male 27.0 0
887 Graham, Miss. Margaret Edith female 19.0 0
888 Johnston, Miss. Catherine Helen "Carrie" female NaN 1
889 Behr, Mr. Karl Howell male 26.0 0
890 Dooley, Mr. Patrick male 32.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
5 0 330877 8.4583 NaN Q
6 0 17463 51.8625 E46 S
7 1 349909 21.0750 NaN S
8 2 347742 11.1333 NaN S
9 0 237736 30.0708 NaN C
10 1 PP 9549 16.7000 G6 S
11 0 113783 26.5500 C103 S
12 0 A/5. 2151 8.0500 NaN S
13 5 347082 31.2750 NaN S
14 0 350406 7.8542 NaN S
15 0 248706 16.0000 NaN S
16 1 382652 29.1250 NaN Q
17 0 244373 13.0000 NaN S
18 0 345763 18.0000 NaN S
19 0 2649 7.2250 NaN C
20 0 239865 26.0000 NaN S
21 0 248698 13.0000 D56 S
22 0 330923 8.0292 NaN Q
23 0 113788 35.5000 A6 S
24 1 349909 21.0750 NaN S
25 5 347077 31.3875 NaN S
26 0 2631 7.2250 NaN C
27 2 19950 263.0000 C23 C25 C27 S
28 0 330959 7.8792 NaN Q
29 0 349216 7.8958 NaN S
.. ... ... ... ... ...
861 0 28134 11.5000 NaN S
862 0 17466 25.9292 D17 S
863 2 CA. 2343 69.5500 NaN S
864 0 233866 13.0000 NaN S
865 0 236852 13.0000 NaN S
866 0 SC/PARIS 2149 13.8583 NaN C
867 0 PC 17590 50.4958 A24 S
868 0 345777 9.5000 NaN S
869 1 347742 11.1333 NaN S
870 0 349248 7.8958 NaN S
871 1 11751 52.5542 D35 S
872 0 695 5.0000 B51 B53 B55 S
873 0 345765 9.0000 NaN S
874 0 P/PP 3381 24.0000 NaN C
875 0 2667 7.2250 NaN C
876 0 7534 9.8458 NaN S
877 0 349212 7.8958 NaN S
878 0 349217 7.8958 NaN S
879 1 11767 83.1583 C50 C
880 1 230433 26.0000 NaN S
881 0 349257 7.8958 NaN S
882 0 7552 10.5167 NaN S
883 0 C.A./SOTON 34068 10.5000 NaN S
884 0 SOTON/OQ 392076 7.0500 NaN S
885 5 382652 29.1250 NaN Q
886 0 211536 13.0000 NaN S
887 0 112053 30.0000 B42 S
888 2 W./C. 6607 23.4500 NaN S
889 0 111369 30.0000 C148 C
890 0 370376 7.7500 NaN Q
[891 rows x 12 columns]
【思考】什么是逐块读取?为什么要逐块读取呢?
将文本分成若干块,每次处理chunksize行的数据,最终返回一个TextParser对象,对该对象进行迭代遍历,可以完成逐块统计的合并处理。
因为文本太大,需要一部分数据,或者需要一块一块进行处理。
【提示】大家可以chunker(数据块)是什么类型?用for
循环打印出来出处具体的样子是什么?
DataFrame的数据类型
1.1.4 任务四:将表头改成中文,索引改为乘客ID [对于某些英文资料,我们可以通过翻译来更直观的熟悉我们的数据]
PassengerId => 乘客ID
Survived => 是否幸存
Pclass => 乘客等级(1/2/3等舱位)
Name => 乘客姓名
Sex => 性别
Age => 年龄
SibSp => 堂兄弟/妹个数
Parch => 父母与小孩个数
Ticket => 船票信息
Fare => 票价
Cabin => 客舱
Embarked => 登船港口
#写入代码
train_data = pd.read_csv('train.csv',names=['乘客ID','是否幸存','仓位等级','姓名','性别','年龄','兄弟姐妹个数','父母子女个数','船票信息','票价','客舱','登船港口'],index_col='乘客ID',header=0)
train_data.head(3)
是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
【思考】所谓将表头改为中文其中一个思路是:将英文列名表头替换成中文。还有其他的方法吗?
1.2 初步观察
导入数据后,你可能要对数据的整体结构和样例进行概览,比如说,数据大小、有多少列,各列都是什么格式的,是否包含null等
1.2.1 任务一:查看数据的基本信息
#写入代码
train_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
是否幸存 891 non-null int64
仓位等级 891 non-null int64
姓名 891 non-null object
性别 891 non-null object
年龄 714 non-null float64
兄弟姐妹个数 891 non-null int64
父母子女个数 891 non-null int64
船票信息 891 non-null object
票价 891 non-null float64
客舱 204 non-null object
登船港口 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
【提示】有多个函数可以这样做,你可以做一下总结
train_data.describe()
是否幸存 | 仓位等级 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 票价 | |
---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
1.2.2 任务二:观察表格前10行的数据和后15行的数据
#写入代码
train_data.head(10)
是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
#写入代码
train_data.tail(15)
是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
877 | 0 | 3 | Gustafsson, Mr. Alfred Ossian | male | 20.0 | 0 | 0 | 7534 | 9.8458 | NaN | S |
878 | 0 | 3 | Petroff, Mr. Nedelio | male | 19.0 | 0 | 0 | 349212 | 7.8958 | NaN | S |
879 | 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 0 | 1 | 230433 | 26.0000 | NaN | S |
882 | 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | NaN | S |
885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
1.2.4 任务三:判断数据是否为空,为空的地方返回True,其余地方返回False
#写入代码
train_data.isnull().head()
是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|
乘客ID | |||||||||||
1 | False | False | False | False | False | False | False | False | False | True | False |
2 | False | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | True | False |
4 | False | False | False | False | False | False | False | False | False | False | False |
5 | False | False | False | False | False | False | False | False | False | True | False |
【总结】上面的操作都是数据分析中对于数据本身的观察
【思考】对于一个数据,还可以从哪些方面来观察?找找答案,这个将对下面的数据分析有很大的帮助
1.3 保存数据
1.3.1 任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv
#写入代码
# 注意:不同的操作系统保存下来可能会有乱码。大家可以加入`encoding='GBK' 或者 ’encoding = ’uft-8‘‘`
train_data.to_csv('train_Chinese.csv',encoding='utf-8')
【总结】数据的加载以及入门,接下来就要接触数据本身的运算,我们将主要掌握numpy和pandas在工作和项目场景的运用。
1 第一章:数据载入及初步观察
1.4 知道你的数据叫什么
我们学习pandas的基础操作,那么上一节通过pandas加载之后的数据,其数据类型是什么呢?
开始前导入numpy和pandas
import numpy as np
import pandas as pd
1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子🌰[开放题]
https://www.cnblogs.com/lavender1221/p/12664641.html#
Pandas的核心是三大数据结构:Series、DataFrame和Index。绝大多数操作都是围绕这三种结构进行的。
Series是一个一维的数组对象,它包含一个值序列和一个对应的索引序列。 Numpy的一维数组通过隐式定义的整数索引获取元素值,而Series用一种显式定义的索引与元素关联。显式索引让Series对象拥有更强的能力,索引也不再仅仅是整数,还可以是别的类型,比如字符串,索引也不需要连续,也可以重复,自由度非常高。
DataFrame是Pandas的核心数据结构,表示的是二维的矩阵数据表,类似关系型数据库的结构,每一列可以是不同的值类型,比如数值、字符串、布尔值等等。DataFrame既有行索引,也有列索引,它可以被看做为一个共享相同索引的Series的字典。
创建DataFrame对象的方法有很多,最常用的是利用包含等长度列表或Numpy数组的字典来生成。可以查看DataFrame对象的columns和index属性。
#写入代码
sdata_1 = [7,-2,567,8]
example_1 = pd.Series(sdata_1,index = ['a','b','c','d'])
example_1
a 7
b -2
c 567
d 8
dtype: int64
sdata_2 = {'a':7,'b':-2,'c':567,'d':8}
example_2 = pd.Series(sdata_2)
example_2
a 7
b -2
c 567
d 8
dtype: int64
sdata_3 = {'city':['nanjing','wuxi','wuhan','changsha'],
'code':['001','002','003','004']}
example_3 = pd.DataFrame(sdata_3)
example_3
city | code | |
---|---|---|
0 | nanjing | 001 |
1 | wuxi | 002 |
2 | wuhan | 003 |
3 | changsha | 004 |
'''
#我们举的例子
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
example_1 = pd.Series(sdata)
example_1
'''
'''#我们举的例子data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003],'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}example_2 = pd.DataFrame(data)example_2'''
1.4.2 任务二:根据上节课的方法载入"train.csv"文件
#写入代码train_chinese = pd.read_csv('train_Chinese.csv')train_chinese.head()train_data = pd.read_csv('train.csv')
也可以加载上一节课保存的"train_chinese.csv"文件。通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作
1.4.3 任务三:查看DataFrame数据的每列的名称
#写入代码train_chinese.columns
Index(['乘客ID', '是否幸存', '仓位等级', '姓名', '性别', '年龄', '兄弟姐妹个数', '父母子女个数', '船票信息', '票价', '客舱', '登船港口'], dtype='object')
train_data.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
train_data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
1.4.4任务四:查看"Cabin"这列的所有值[有多种方法]
#写入代码train_data['Cabin'].head()
0 NaN1 C852 NaN3 C1234 NaNName: Cabin, dtype: object
#写入代码train_data.Cabin.head()
0 NaN1 C852 NaN3 C1234 NaNName: Cabin, dtype: object
1.4.5 任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除
经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的列删去
#写入代码test_data = pd.read_csv('test_1.csv')test_data.head()
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | a | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 100 |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 100 |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 100 |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 100 |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 100 |
#写入代码test_data.pop('a').head()test_data
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
10 | 10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
11 | 11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
12 | 12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
13 | 13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
14 | 14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | NaN | S |
15 | 15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 0 | 0 | 248706 | 16.0000 | NaN | S |
16 | 16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
17 | 17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
18 | 18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.0 | 1 | 0 | 345763 | 18.0000 | NaN | S |
19 | 19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
20 | 20 | 21 | 0 | 2 | Fynney, Mr. Joseph J | male | 35.0 | 0 | 0 | 239865 | 26.0000 | NaN | S |
21 | 21 | 22 | 1 | 2 | Beesley, Mr. Lawrence | male | 34.0 | 0 | 0 | 248698 | 13.0000 | D56 | S |
22 | 22 | 23 | 1 | 3 | McGowan, Miss. Anna "Annie" | female | 15.0 | 0 | 0 | 330923 | 8.0292 | NaN | Q |
23 | 23 | 24 | 1 | 1 | Sloper, Mr. William Thompson | male | 28.0 | 0 | 0 | 113788 | 35.5000 | A6 | S |
24 | 24 | 25 | 0 | 3 | Palsson, Miss. Torborg Danira | female | 8.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
25 | 25 | 26 | 1 | 3 | Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... | female | 38.0 | 1 | 5 | 347077 | 31.3875 | NaN | S |
26 | 26 | 27 | 0 | 3 | Emir, Mr. Farred Chehab | male | NaN | 0 | 0 | 2631 | 7.2250 | NaN | C |
27 | 27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.0 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
28 | 28 | 29 | 1 | 3 | O'Dwyer, Miss. Ellen "Nellie" | female | NaN | 0 | 0 | 330959 | 7.8792 | NaN | Q |
29 | 29 | 30 | 0 | 3 | Todoroff, Mr. Lalio | male | NaN | 0 | 0 | 349216 | 7.8958 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
861 | 861 | 862 | 0 | 2 | Giles, Mr. Frederick Edward | male | 21.0 | 1 | 0 | 28134 | 11.5000 | NaN | S |
862 | 862 | 863 | 1 | 1 | Swift, Mrs. Frederick Joel (Margaret Welles Ba... | female | 48.0 | 0 | 0 | 17466 | 25.9292 | D17 | S |
863 | 863 | 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
864 | 864 | 865 | 0 | 2 | Gill, Mr. John William | male | 24.0 | 0 | 0 | 233866 | 13.0000 | NaN | S |
865 | 865 | 866 | 1 | 2 | Bystrom, Mrs. (Karolina) | female | 42.0 | 0 | 0 | 236852 | 13.0000 | NaN | S |
866 | 866 | 867 | 1 | 2 | Duran y More, Miss. Asuncion | female | 27.0 | 1 | 0 | SC/PARIS 2149 | 13.8583 | NaN | C |
867 | 867 | 868 | 0 | 1 | Roebling, Mr. Washington Augustus II | male | 31.0 | 0 | 0 | PC 17590 | 50.4958 | A24 | S |
868 | 868 | 869 | 0 | 3 | van Melkebeke, Mr. Philemon | male | NaN | 0 | 0 | 345777 | 9.5000 | NaN | S |
869 | 869 | 870 | 1 | 3 | Johnson, Master. Harold Theodor | male | 4.0 | 1 | 1 | 347742 | 11.1333 | NaN | S |
870 | 870 | 871 | 0 | 3 | Balkic, Mr. Cerin | male | 26.0 | 0 | 0 | 349248 | 7.8958 | NaN | S |
871 | 871 | 872 | 1 | 1 | Beckwith, Mrs. Richard Leonard (Sallie Monypeny) | female | 47.0 | 1 | 1 | 11751 | 52.5542 | D35 | S |
872 | 872 | 873 | 0 | 1 | Carlsson, Mr. Frans Olof | male | 33.0 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S |
873 | 873 | 874 | 0 | 3 | Vander Cruyssen, Mr. Victor | male | 47.0 | 0 | 0 | 345765 | 9.0000 | NaN | S |
874 | 874 | 875 | 1 | 2 | Abelson, Mrs. Samuel (Hannah Wizosky) | female | 28.0 | 1 | 0 | P/PP 3381 | 24.0000 | NaN | C |
875 | 875 | 876 | 1 | 3 | Najib, Miss. Adele Kiamie "Jane" | female | 15.0 | 0 | 0 | 2667 | 7.2250 | NaN | C |
876 | 876 | 877 | 0 | 3 | Gustafsson, Mr. Alfred Ossian | male | 20.0 | 0 | 0 | 7534 | 9.8458 | NaN | S |
877 | 877 | 878 | 0 | 3 | Petroff, Mr. Nedelio | male | 19.0 | 0 | 0 | 349212 | 7.8958 | NaN | S |
878 | 878 | 879 | 0 | 3 | Laleff, Mr. Kristo | male | NaN | 0 | 0 | 349217 | 7.8958 | NaN | S |
879 | 879 | 880 | 1 | 1 | Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) | female | 56.0 | 0 | 1 | 11767 | 83.1583 | C50 | C |
880 | 880 | 881 | 1 | 2 | Shelley, Mrs. William (Imanita Parrish Hall) | female | 25.0 | 0 | 1 | 230433 | 26.0000 | NaN | S |
881 | 881 | 882 | 0 | 3 | Markun, Mr. Johann | male | 33.0 | 0 | 0 | 349257 | 7.8958 | NaN | S |
882 | 882 | 883 | 0 | 3 | Dahlberg, Miss. Gerda Ulrika | female | 22.0 | 0 | 0 | 7552 | 10.5167 | NaN | S |
883 | 883 | 884 | 0 | 2 | Banfield, Mr. Frederick James | male | 28.0 | 0 | 0 | C.A./SOTON 34068 | 10.5000 | NaN | S |
884 | 884 | 885 | 0 | 3 | Sutehall, Mr. Henry Jr | male | 25.0 | 0 | 0 | SOTON/OQ 392076 | 7.0500 | NaN | S |
885 | 885 | 886 | 0 | 3 | Rice, Mrs. William (Margaret Norton) | female | 39.0 | 0 | 5 | 382652 | 29.1250 | NaN | Q |
886 | 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 13 columns
【思考】还有其他的删除多余的列的方式吗?
# 思考回答del test_data['a']test_data.head()
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
1.4.6 任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素
#写入代码test_data.drop(['PassengerId','Name','Age','Ticket'],axis=1).head()
Unnamed: 0 | Survived | Pclass | Sex | SibSp | Parch | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 3 | male | 1 | 0 | 7.2500 | NaN | S |
1 | 1 | 1 | 1 | female | 1 | 0 | 71.2833 | C85 | C |
2 | 2 | 1 | 3 | female | 0 | 0 | 7.9250 | NaN | S |
3 | 3 | 1 | 1 | female | 1 | 0 | 53.1000 | C123 | S |
4 | 4 | 0 | 3 | male | 0 | 0 | 8.0500 | NaN | S |
【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?
【思考回答】
如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用
1.5 筛选的逻辑
表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。
下面我们还是用实战来学习pandas这个功能。
1.5.1 任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。
#写入代码test_data[test_data['Age']<10].head()
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
10 | 10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
16 | 16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
24 | 24 | 25 | 0 | 3 | Palsson, Miss. Torborg Danira | female | 8.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
43 | 43 | 44 | 1 | 2 | Laroche, Miss. Simonne Marie Anne Andree | female | 3.0 | 1 | 2 | SC/Paris 2123 | 41.5792 | NaN | C |
1.5.2 任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage
#写入代码midage = test_data[(test_data['Age']>10) & (test_data['Age']<50)]midage.head()
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
【提示】了解pandas的条件筛选方式以及如何使用交集和并集操作
1.5.3 任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来
#写入代码midage = midage.reset_index()midage.head()
index | Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
【提示】在抽取数据中,我们希望数据的相对顺序保持不变,用什么函数可以达到这个效果呢?
reset_index()函数: 使用索引重置生成一个新的DataFrame或Series,可以把索引用作列。保留原索引,即保持数据的相对顺序
midage.loc[[100],['Pclass','Sex']]
Pclass | Sex | |
---|---|---|
100 | 2 | male |
1.5.4 任务四:使用loc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
#写入代码midage.loc[[100,105,108],['Pclass','Name','Sex']] #因为你主动的延长了行的距离,所以会产生表格形式
Pclass | Name | Sex | |
---|---|---|---|
100 | 2 | Byles, Rev. Thomas Roussel Davids | male |
105 | 3 | Cribb, Mr. John Hatfield | male |
108 | 3 | Calic, Mr. Jovo | male |
1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
#写入代码midage.iloc[[100,105,108],[4,5,6]] #iloc的行和列都按照整数,不能按照列名
Pclass | Name | Sex | |
---|---|---|---|
100 | 2 | Byles, Rev. Thomas Roussel Davids | male |
105 | 3 | Cribb, Mr. John Hatfield | male |
108 | 3 | Calic, Mr. Jovo | male |
【思考】对比iloc
和loc
的异同
iloc是按照行数取值,而loc按着index名取值
复习:在前面我们已经学习了Pandas基础,知道利用Pandas读取csv数据的增删查改,今天我们要学习的就是探索性数据分析,主要介绍如何利用Pandas进行排序、算术计算以及计算描述函数describe()的使用。
1 第一章:探索性数据分析
开始之前,导入numpy、pandas包和数据
#加载所需的库
import numpy as np
import pandas as pd
#载入之前保存的train_chinese.csv数据,关于泰坦尼克号的任务,我们就使用这个数据
train_data = pd.read_csv('train_Chinese.csv')
1.6 了解你的数据吗?
教材《Python for Data Analysis》第五章
1.6.1 任务一:利用Pandas对示例数据进行排序,要求升序
# 具体请看《利用Python进行数据分析》第五章 排序和排名 部分
#自己构建一个都为数字的DataFrame数据
'''
我们举了一个例子
pd.DataFrame() :创建一个DataFrame对象
np.arange(8).reshape((2, 4)) : 生成一个二维数组(2*4),第一列:0,1,2,3 第二列:4,5,6,7
index=[2,1] :DataFrame 对象的索引列
columns=['d', 'a', 'b', 'c'] :DataFrame 对象的索引行
'''
frame = pd.DataFrame(np.arange(8).reshape(2,4),index=[2,1],columns=['d','a','b','c'])
frame
d | a | b | c | |
---|---|---|---|---|
2 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
【代码解析】
pd.DataFrame() :创建一个DataFrame对象
np.arange(8).reshape((2, 4)) : 生成一个二维数组(2*4),第一列:0,1,2,3 第二列:4,5,6,7
index=['2, 1] :DataFrame 对象的索引列
columns=[‘d’, ‘a’, ‘b’, ‘c’] :DataFrame 对象的索引行
【问题】:大多数时候我们都是想根据列的值来排序,所以将你构建的DataFrame中的数据根据某一列,升序排列
#回答代码
frame.sort_values(by = 'c',ascending = True)
d | a | b | c | |
---|---|---|---|---|
2 | 0 | 1 | 2 | 3 |
1 | 4 | 5 | 6 | 7 |
【思考】通过书本你能说出Pandas对DataFrame数据的其他排序方式吗?
sort_index()对索引进行排序,axis=1是对列
frame.sort_index()
d | a | b | c | |
---|---|---|---|---|
1 | 4 | 5 | 6 | 7 |
2 | 0 | 1 | 2 | 3 |
【总结】下面将不同的排序方式做一个总结
1.让行索引升序排序
#代码frame.sort_index()
d | a | b | c | |
---|---|---|---|---|
1 | 4 | 5 | 6 | 7 |
2 | 0 | 1 | 2 | 3 |
2.让列索引升序排序
#代码frame.sort_index(axis=1)
a | b | c | d | |
---|---|---|---|---|
2 | 1 | 2 | 3 | 0 |
1 | 5 | 6 | 7 | 4 |
3.让列索引降序排序
#代码frame.sort_index(axis=1,ascending=False)
d | c | b | a | |
---|---|---|---|---|
2 | 0 | 3 | 2 | 1 |
1 | 4 | 7 | 6 | 5 |
4.让任选两列数据同时降序排序
#代码frame.sort_values(['a','c'],ascending=False)
d | a | b | c | |
---|---|---|---|---|
1 | 4 | 5 | 6 | 7 |
2 | 0 | 1 | 2 | 3 |
1.6.2 任务二:对泰坦尼克号数据(trian.csv)按票价和年龄两列进行综合排序(降序排列),从这个数据中你可以分析出什么?
'''在开始我们已经导入了train_chinese.csv数据,而且前面我们也学习了导入数据过程,根据上面学习,我们直接对目标列进行排序即可head(20) : 读取前20条数据'''train_data.head(20)
乘客ID | 是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | NaN | S |
15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 0 | 0 | 248706 | 16.0000 | NaN | S |
16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.0 | 1 | 0 | 345763 | 18.0000 | NaN | S |
19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
#代码train_data.sort_values(['票价','年龄'],ascending=False)
乘客ID | 是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
679 | 680 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.00 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C |
258 | 259 | 1 | 1 | Ward, Miss. Anna | female | 35.00 | 0 | 0 | PC 17755 | 512.3292 | NaN | C |
737 | 738 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35.00 | 0 | 0 | PC 17755 | 512.3292 | B101 | C |
438 | 439 | 0 | 1 | Fortune, Mr. Mark | male | 64.00 | 1 | 4 | 19950 | 263.0000 | C23 C25 C27 | S |
341 | 342 | 1 | 1 | Fortune, Miss. Alice Elizabeth | female | 24.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
88 | 89 | 1 | 1 | Fortune, Miss. Mabel Helen | female | 23.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
27 | 28 | 0 | 1 | Fortune, Mr. Charles Alexander | male | 19.00 | 3 | 2 | 19950 | 263.0000 | C23 C25 C27 | S |
742 | 743 | 1 | 1 | Ryerson, Miss. Susan Parker "Suzette" | female | 21.00 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
311 | 312 | 1 | 1 | Ryerson, Miss. Emily Borie | female | 18.00 | 2 | 2 | PC 17608 | 262.3750 | B57 B59 B63 B66 | C |
299 | 300 | 1 | 1 | Baxter, Mrs. James (Helene DeLaudeniere Chaput) | female | 50.00 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
118 | 119 | 0 | 1 | Baxter, Mr. Quigg Edmond | male | 24.00 | 0 | 1 | PC 17558 | 247.5208 | B58 B60 | C |
380 | 381 | 1 | 1 | Bidois, Miss. Rosalie | female | 42.00 | 0 | 0 | PC 17757 | 227.5250 | NaN | C |
716 | 717 | 1 | 1 | Endres, Miss. Caroline Louise | female | 38.00 | 0 | 0 | PC 17757 | 227.5250 | C45 | C |
700 | 701 | 1 | 1 | Astor, Mrs. John Jacob (Madeleine Talmadge Force) | female | 18.00 | 1 | 0 | PC 17757 | 227.5250 | C62 C64 | C |
557 | 558 | 0 | 1 | Robbins, Mr. Victor | male | NaN | 0 | 0 | PC 17757 | 227.5250 | NaN | C |
527 | 528 | 0 | 1 | Farthing, Mr. John | male | NaN | 0 | 0 | PC 17483 | 221.7792 | C95 | S |
377 | 378 | 0 | 1 | Widener, Mr. Harry Elkins | male | 27.00 | 0 | 2 | 113503 | 211.5000 | C82 | C |
779 | 780 | 1 | 1 | Robert, Mrs. Edward Scott (Elisabeth Walton Mc... | female | 43.00 | 0 | 1 | 24160 | 211.3375 | B3 | S |
730 | 731 | 1 | 1 | Allen, Miss. Elisabeth Walton | female | 29.00 | 0 | 0 | 24160 | 211.3375 | B5 | S |
689 | 690 | 1 | 1 | Madill, Miss. Georgette Alexandra | female | 15.00 | 0 | 1 | 24160 | 211.3375 | B5 | S |
856 | 857 | 1 | 1 | Wick, Mrs. George Dennick (Mary Hitchcock) | female | 45.00 | 1 | 1 | 36928 | 164.8667 | NaN | S |
318 | 319 | 1 | 1 | Wick, Miss. Mary Natalie | female | 31.00 | 0 | 2 | 36928 | 164.8667 | C7 | S |
268 | 269 | 1 | 1 | Graham, Mrs. William Thompson (Edith Junkins) | female | 58.00 | 0 | 1 | PC 17582 | 153.4625 | C125 | S |
609 | 610 | 1 | 1 | Shutes, Miss. Elizabeth W | female | 40.00 | 0 | 0 | PC 17582 | 153.4625 | C125 | S |
332 | 333 | 0 | 1 | Graham, Mr. George Edward | male | 38.00 | 0 | 1 | PC 17582 | 153.4625 | C91 | S |
498 | 499 | 0 | 1 | Allison, Mrs. Hudson J C (Bessie Waldo Daniels) | female | 25.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
708 | 709 | 1 | 1 | Cleaver, Miss. Alice | female | 22.00 | 0 | 0 | 113781 | 151.5500 | NaN | S |
297 | 298 | 0 | 1 | Allison, Miss. Helen Loraine | female | 2.00 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
305 | 306 | 1 | 1 | Allison, Master. Hudson Trevor | male | 0.92 | 1 | 2 | 113781 | 151.5500 | C22 C26 | S |
195 | 196 | 1 | 1 | Lurette, Miss. Elise | female | 58.00 | 0 | 0 | PC 17569 | 146.5208 | B80 | C |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
611 | 612 | 0 | 3 | Jardin, Mr. Jose Neto | male | NaN | 0 | 0 | SOTON/O.Q. 3101305 | 7.0500 | NaN | S |
477 | 478 | 0 | 3 | Braund, Mr. Lewis Richard | male | 29.00 | 1 | 0 | 3460 | 7.0458 | NaN | S |
129 | 130 | 0 | 3 | Ekstrom, Mr. Johan | male | 45.00 | 0 | 0 | 347061 | 6.9750 | NaN | S |
804 | 805 | 1 | 3 | Hedman, Mr. Oskar Arvid | male | 27.00 | 0 | 0 | 347089 | 6.9750 | NaN | S |
825 | 826 | 0 | 3 | Flynn, Mr. John | male | NaN | 0 | 0 | 368323 | 6.9500 | NaN | Q |
411 | 412 | 0 | 3 | Hart, Mr. Henry | male | NaN | 0 | 0 | 394140 | 6.8583 | NaN | Q |
143 | 144 | 0 | 3 | Burke, Mr. Jeremiah | male | 19.00 | 0 | 0 | 365222 | 6.7500 | NaN | Q |
654 | 655 | 0 | 3 | Hegarty, Miss. Hanora "Nora" | female | 18.00 | 0 | 0 | 365226 | 6.7500 | NaN | Q |
202 | 203 | 0 | 3 | Johanson, Mr. Jakob Alfred | male | 34.00 | 0 | 0 | 3101264 | 6.4958 | NaN | S |
371 | 372 | 0 | 3 | Wiklund, Mr. Jakob Alfred | male | 18.00 | 1 | 0 | 3101267 | 6.4958 | NaN | S |
818 | 819 | 0 | 3 | Holm, Mr. John Fredrik Alexander | male | 43.00 | 0 | 0 | C 7075 | 6.4500 | NaN | S |
843 | 844 | 0 | 3 | Lemberopolous, Mr. Peter L | male | 34.50 | 0 | 0 | 2683 | 6.4375 | NaN | C |
326 | 327 | 0 | 3 | Nysveen, Mr. Johan Hansen | male | 61.00 | 0 | 0 | 345364 | 6.2375 | NaN | S |
872 | 873 | 0 | 1 | Carlsson, Mr. Frans Olof | male | 33.00 | 0 | 0 | 695 | 5.0000 | B51 B53 B55 | S |
378 | 379 | 0 | 3 | Betros, Mr. Tannous | male | 20.00 | 0 | 0 | 2648 | 4.0125 | NaN | C |
597 | 598 | 0 | 3 | Johnson, Mr. Alfred | male | 49.00 | 0 | 0 | LINE | 0.0000 | NaN | S |
263 | 264 | 0 | 1 | Harrison, Mr. William | male | 40.00 | 0 | 0 | 112059 | 0.0000 | B94 | S |
806 | 807 | 0 | 1 | Andrews, Mr. Thomas Jr | male | 39.00 | 0 | 0 | 112050 | 0.0000 | A36 | S |
822 | 823 | 0 | 1 | Reuchlin, Jonkheer. John George | male | 38.00 | 0 | 0 | 19972 | 0.0000 | NaN | S |
179 | 180 | 0 | 3 | Leonard, Mr. Lionel | male | 36.00 | 0 | 0 | LINE | 0.0000 | NaN | S |
271 | 272 | 1 | 3 | Tornquist, Mr. William Henry | male | 25.00 | 0 | 0 | LINE | 0.0000 | NaN | S |
302 | 303 | 0 | 3 | Johnson, Mr. William Cahoone Jr | male | 19.00 | 0 | 0 | LINE | 0.0000 | NaN | S |
277 | 278 | 0 | 2 | Parkes, Mr. Francis "Frank" | male | NaN | 0 | 0 | 239853 | 0.0000 | NaN | S |
413 | 414 | 0 | 2 | Cunningham, Mr. Alfred Fleming | male | NaN | 0 | 0 | 239853 | 0.0000 | NaN | S |
466 | 467 | 0 | 2 | Campbell, Mr. William | male | NaN | 0 | 0 | 239853 | 0.0000 | NaN | S |
481 | 482 | 0 | 2 | Frost, Mr. Anthony Wood "Archie" | male | NaN | 0 | 0 | 239854 | 0.0000 | NaN | S |
633 | 634 | 0 | 1 | Parr, Mr. William Henry Marsh | male | NaN | 0 | 0 | 112052 | 0.0000 | NaN | S |
674 | 675 | 0 | 2 | Watson, Mr. Ennis Hastings | male | NaN | 0 | 0 | 239856 | 0.0000 | NaN | S |
732 | 733 | 0 | 2 | Knight, Mr. Robert J | male | NaN | 0 | 0 | 239855 | 0.0000 | NaN | S |
815 | 816 | 0 | 1 | Fry, Mr. Richard | male | NaN | 0 | 0 | 112058 | 0.0000 | B102 | S |
891 rows × 12 columns
【思考】排序后,如果我们仅仅关注年龄和票价两列。根据常识我知道发现票价越高的应该客舱越好,所以我们会明显看出,票价前20的乘客中存活的有14人,这是相当高的一个比例,那么我们后面是不是可以进一步分析一下票价和存活之间的关系,年龄和存活之间的关系呢?当你开始发现数据之间的关系了,数据分析就开始了。
当然,这只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。
存活数与男女之间的关系
多做几个数据的排序
#代码train_data.sort_values(['兄弟姐妹个数','父母子女个数','性别'],ascending=False).head(20)
乘客ID | 是否幸存 | 仓位等级 | 姓名 | 性别 | 年龄 | 兄弟姐妹个数 | 父母子女个数 | 船票信息 | 票价 | 客舱 | 登船港口 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
159 | 160 | 0 | 3 | Sage, Master. Thomas Henry | male | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
201 | 202 | 0 | 3 | Sage, Mr. Frederick | male | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
324 | 325 | 0 | 3 | Sage, Mr. George John Jr | male | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
846 | 847 | 0 | 3 | Sage, Mr. Douglas Bullen | male | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
180 | 181 | 0 | 3 | Sage, Miss. Constance Gladys | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
792 | 793 | 0 | 3 | Sage, Miss. Stella Anna | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
863 | 864 | 0 | 3 | Sage, Miss. Dorothy Edith "Dolly" | female | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | S |
59 | 60 | 0 | 3 | Goodwin, Master. William Frederick | male | 11.0 | 5 | 2 | CA 2144 | 46.9000 | NaN | S |
386 | 387 | 0 | 3 | Goodwin, Master. Sidney Leonard | male | 1.0 | 5 | 2 | CA 2144 | 46.9000 | NaN | S |
480 | 481 | 0 | 3 | Goodwin, Master. Harold Victor | male | 9.0 | 5 | 2 | CA 2144 | 46.9000 | NaN | S |
683 | 684 | 0 | 3 | Goodwin, Mr. Charles Edward | male | 14.0 | 5 | 2 | CA 2144 | 46.9000 | NaN | S |
71 | 72 | 0 | 3 | Goodwin, Miss. Lillian Amy | female | 16.0 | 5 | 2 | CA 2144 | 46.9000 | NaN | S |
182 | 183 | 0 | 3 | Asplund, Master. Clarence Gustaf Hugo | male | 9.0 | 4 | 2 | 347077 | 31.3875 | NaN | S |
261 | 262 | 1 | 3 | Asplund, Master. Edvin Rojj Felix | male | 3.0 | 4 | 2 | 347077 | 31.3875 | NaN | S |
850 | 851 | 0 | 3 | Andersson, Master. Sigvard Harald Elias | male | 4.0 | 4 | 2 | 347082 | 31.2750 | NaN | S |
68 | 69 | 1 | 3 | Andersson, Miss. Erna Alexandra | female | 17.0 | 4 | 2 | 3101281 | 7.9250 | NaN | S |
119 | 120 | 0 | 3 | Andersson, Miss. Ellis Anna Maria | female | 2.0 | 4 | 2 | 347082 | 31.2750 | NaN | S |
233 | 234 | 1 | 3 | Asplund, Miss. Lillian Gertrud | female | 5.0 | 4 | 2 | 347077 | 31.3875 | NaN | S |
541 | 542 | 0 | 3 | Andersson, Miss. Ingeborg Constanzia | female | 9.0 | 4 | 2 | 347082 | 31.2750 | NaN | S |
542 | 543 | 0 | 3 | Andersson, Miss. Sigrid Elisabeth | female | 11.0 | 4 | 2 | 347082 | 31.2750 | NaN | S |
#写下你的思考兄弟姐妹越多的,存活率越低,男性可能比女性存活率低
1.6.3 任务三:利用Pandas进行算术计算,计算两个DataFrame数据相加结果
# 具体请看《利用Python进行数据分析》第五章 算术运算与数据对齐 部分#自己构建两个都为数字的DataFrame数据"""我们举了一个例子:frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3), columns=['a', 'b', 'c'], index=['one', 'two', 'three'])frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3), columns=['a', 'e', 'c'], index=['first', 'one', 'two', 'second'])frame1_a"""
#代码frame1_a = pd.DataFrame(np.arange(9.).reshape(3,3),columns=['a','b','c'],index=['one','two','three'])frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),columns=['a', 'e', 'c'], index=['first', 'one', 'two', 'second'])
将frame_a和frame_b进行相加
#代码frame1_a
a | b | c | |
---|---|---|---|
one | 0.0 | 1.0 | 2.0 |
two | 3.0 | 4.0 | 5.0 |
three | 6.0 | 7.0 | 8.0 |
【提醒】两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。
当然,DataFrame还有很多算术运算,如减法,除法等,有兴趣的同学可以看《利用Python进行数据分析》第五章 算术运算与数据对齐 部分,多在网络上查找相关学习资料。
frame1_b
a | e | c | |
---|---|---|---|
first | 0.0 | 1.0 | 2.0 |
one | 3.0 | 4.0 | 5.0 |
two | 6.0 | 7.0 | 8.0 |
second | 9.0 | 10.0 | 11.0 |
frame1_a + frame1_b
a | b | c | e | |
---|---|---|---|---|
first | NaN | NaN | NaN | NaN |
one | 3.0 | NaN | 7.0 | NaN |
second | NaN | NaN | NaN | NaN |
three | NaN | NaN | NaN | NaN |
two | 9.0 | NaN | 13.0 | NaN |
1.6.4 任务四:通过泰坦尼克号数据如何计算出在船上最大的家族有多少人?
'''还是用之前导入的chinese_train.csv如果我们想看看在船上,最大的家族有多少人(‘兄弟姐妹个数’+‘父母子女个数’),我们该怎么做呢?'''max(train_data['兄弟姐妹个数']+train_data['父母子女个数'])
10
【提醒】我们只需找出”兄弟姐妹个数“和”父母子女个数“之和最大的数,当然你还可以想出很多方法和思考角度,欢迎你来说出你的看法。
多做几个数据的相加,看看你能分析出什么?
1.6.5 任务五:学会使用Pandas describe()函数查看数据基本统计信息
#(1) 关键知识点示例做一遍(简单数据)# 具体请看《利用Python进行数据分析》第五章 汇总和计算描述统计 部分#自己构建一个有数字有空值的DataFrame数据"""我们举了一个例子:frame2 = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3] ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])frame2"""
#代码frame2 = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3] ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])frame2
one | two | |
---|---|---|
a | 1.40 | NaN |
b | 7.10 | -4.5 |
c | NaN | NaN |
d | 0.75 | -1.3 |
调用 describe 函数,观察frame2的数据基本信息
#代码frame2.describe()
one | two | |
---|---|---|
count | 3.000000 | 2.000000 |
mean | 3.083333 | -2.900000 |
std | 3.493685 | 2.262742 |
min | 0.750000 | -4.500000 |
25% | 1.075000 | -3.700000 |
50% | 1.400000 | -2.900000 |
75% | 4.250000 | -2.100000 |
max | 7.100000 | -1.300000 |
1.6.6 任务六:分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据,你能发现什么?
'''看看泰坦尼克号数据集中 票价 这列数据的基本统计数据'''
#代码train_data['票价'].describe()
count 891.000000mean 32.204208std 49.693429min 0.00000025% 7.91040050% 14.45420075% 31.000000max 512.329200Name: 票价, dtype: float64
train_data['父母子女个数'].describe()
count 891.000000mean 0.381594std 0.806057min 0.00000025% 0.00000050% 0.00000075% 0.000000max 6.000000Name: 父母子女个数, dtype: float64
【思考】从上面数据我们可以看出,试试在下面写出你的看法。然后看看我们给出的答案。
【思考】从上面数据我们可以看出,
一共有891个票价数据,
平均值约为:32.20,
标准差约为49.69,说明票价波动特别大,
25%的人的票价是低于7.91的,50%的人的票价低于14.45,75%的人的票价低于31.00,
票价最大值约为512.33,最小值为0。
75%的人没有子女或父母,说明出玩人员大部分都孤身一身
当然,答案只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。
多做几个组数据的统计,看看你能分析出什么?
# 写下你的其他分析
【思考】有更多想法,欢迎写在你的学习笔记中。
【总结】本节中我们通过Pandas的一些内置函数对数据进行了初步统计查看,这个过程最重要的不是大家得掌握这些函数,而是看懂从这些函数出来的数据,构建自己的数据分析思维,这也是第一章最重要的点,希望大家学完第一章能对数据有个基本认识,了解自己在做什么,为什么这么做,后面的章节我们将开始对数据进行清洗,进一步分析。