task01
**复习:**这门课程得主要目的是通过真实的数据,以实战的方式了解数据分析的流程和熟悉数据分析python的基本操作。知道了课程的目的之后,我们接下来我们要正式的开始数据分析的实战教学,完成kaggle上泰坦尼克的任务 ,实战数据分析全流程。 这里有两份资料: 教材《Python for Data Analysis》第六章和 baidu.com & google.com(善用搜索引擎)
第一章:数据载入及初步观察
1.1 载入数据
数据集下载 https://www.kaggle.com/c/titanic/overview
1.1.1 任务一:导入numpy和pandas
import numpy as np
import pandas as pd
np. __version__, pd. __version__
('1.18.1', '1.0.5')
!pip install pandas== 1.0 .5
Requirement already satisfied: pandas==1.0.5 in d:\programdata\anaconda3\lib\site-packages (1.0.5)
Requirement already satisfied: python-dateutil>=2.6.1 in d:\programdata\anaconda3\lib\site-packages (from pandas==1.0.5) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in d:\programdata\anaconda3\lib\site-packages (from pandas==1.0.5) (2019.3)
Requirement already satisfied: numpy>=1.13.3 in d:\programdata\anaconda3\lib\site-packages (from pandas==1.0.5) (1.18.1)
Requirement already satisfied: six>=1.5 in d:\programdata\anaconda3\lib\site-packages (from python-dateutil>=2.6.1->pandas==1.0.5) (1.14.0)
【提示】如果加载失败,学会如何在你的python环境下安装numpy和pandas这两个库
1.1.2 任务二:载入数据
(1) 使用相对路径载入数据 (2) 使用绝对路径载入数据
absolute_path = r'D:\Py\hands-on-data-analysis\titanic\train.csv'
relative_path = r'..\titanic\train.csv'
df = pd. read_csv( absolute_path)
df. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df = pd. read_csv( relative_path)
df. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
【提示】相对路径载入报错时,尝试使用os.getcwd()查看当前工作目录。 【思考】知道数据加载的方法后,试试pd.read_csv()和pd.read_table()的不同,如果想让他们效果一样,需要怎么做?了解一下’.tsv’和’.csv’的不同,如何加载这两个数据集? 【总结】加载的数据是所有工作的第一步,我们的工作会接触到不同的数据格式(eg:.csv;.tsv;.xlsx),但是加载的方法和思路都是一样的,在以后工作和做项目的过程中,遇到之前没有碰到的问题,要多多查资料吗,使用googel,了解业务逻辑,明白输入和输出是什么。
1.1.3 任务三:每1000行为一个数据模块,逐块读取
pd. read_csv??
chunk100 = pd. read_csv( 'train.csv' , chunksize= 100 )
type ( chunk100)
pandas.io.parsers.TextFileReader
for chunk in chunk100:
display( chunk. head( 2 ) )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 100 101 0 3 Petranec, Miss. Matilda female 28.0 0 0 349245 7.8958 NaN S 101 102 0 3 Petroff, Mr. Pastcho ("Pentcho") male NaN 0 0 349215 7.8958 NaN S
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 200 201 0 3 Vande Walle, Mr. Nestor Cyriel male 28.0 0 0 345770 9.50 NaN S 201 202 0 3 Sage, Mr. Frederick male NaN 8 2 CA. 2343 69.55 NaN S
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 300 301 1 3 Kelly, Miss. Anna Katherine "Annie Kate" female NaN 0 0 9234 7.75 NaN Q 301 302 1 3 McCoy, Mr. Bernard male NaN 2 0 367226 23.25 NaN Q
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 400 401 1 3 Niskanen, Mr. Juha male 39.0 0 0 STON/O 2. 3101289 7.925 NaN S 401 402 0 3 Adams, Mr. John male 26.0 0 0 341826 8.050 NaN S
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 500 501 0 3 Calic, Mr. Petar male 17.0 0 0 315086 8.6625 NaN S 501 502 0 3 Canavan, Miss. Mary female 21.0 0 0 364846 7.7500 NaN Q
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 600 601 1 2 Jacobsohn, Mrs. Sidney Samuel (Amy Frances Chr... female 24.0 2 1 243847 27.0000 NaN S 601 602 0 3 Slabenoff, Mr. Petco male NaN 0 0 349214 7.8958 NaN S
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 700 701 1 1 Astor, Mrs. John Jacob (Madeleine Talmadge Force) female 18.0 1 0 PC 17757 227.5250 C62 C64 C 701 702 1 1 Silverthorne, Mr. Spencer Victor male 35.0 0 0 PC 17475 26.2875 E24 S
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 800 801 0 2 Ponesell, Mr. Martin male 34.0 0 0 250647 13.00 NaN S 801 802 1 2 Collyer, Mrs. Harvey (Charlotte Annie Tate) female 31.0 1 1 C.A. 31921 26.25 NaN S
【思考】什么是逐块读取?为什么要逐块读取呢?
答: 对于较大的文件, 直接读取整个文件会非常慢, 并且超过电脑内存也无法正常读取到. 这时候可以使用 chunksize 参数进行分块读取. 由于并不是直接将数据读取到内存, 而只是建立了每一块的索引, 所以这时候并不会存在内存问题, 但这时由于并未把数据读取到内存, 读取到的结果也并不是一个df, 而是一个可迭代对象. 如果需要将各个部分合并成一个df, 可以使用df对象的append方法—但要注意, 如果数据总大小已经超过了内存, 则合并时同样可能导致内存不足的问题. 这时候可以选择将数据以 hdf5 格式写入到磁盘, 然后将每个chunk添加到 h5 文件中.
1.1.4 任务四:将表头改成中文,索引改为乘客ID [对于某些英文资料,我们可以通过翻译来更直观的熟悉我们的数据]
PassengerId => 乘客ID Survived => 是否幸存 Pclass => 乘客等级(1/2/3等舱位) Name => 乘客姓名 Sex => 性别 Age => 年龄 SibSp => 堂兄弟/妹个数 Parch => 父母与小孩个数 Ticket => 船票信息 Fare => 票价 Cabin => 客舱 Embarked => 登船港口
df. columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
df. columns = [ '乘客ID' , '是否幸存' , '乘客等级(1/2/3等舱位)' , '乘客姓名' , '性别' , '年龄' , '堂兄弟/妹个数' , '父母与小孩个数' , '船票信息' , '票价' , '客舱' , '登船港口' ]
df. head( )
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
【思考】所谓将表头改为中文其中一个思路是:将英文额度表头替换成中文。还有其他的方法吗?
答: 有其他方法. 方法一: 在读取数据的时候, 使用 name 参数指定列名. 方法二: 对列名使用replace方法,传入英文列名到中文列名的映射字典.
1.2 初步观察
导入数据后,你可能要对数据的整体结构和样例进行概览,比如说,数据大小、有多少列,各列都是什么格式的,是否包含null等
1.2.1 任务一:查看数据的基本信息
df. info( )
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 乘客ID 891 non-null int64
1 是否幸存 891 non-null int64
2 乘客等级(1/2/3等舱位) 891 non-null int64
3 乘客姓名 891 non-null object
4 性别 891 non-null object
5 年龄 714 non-null float64
6 堂兄弟/妹个数 891 non-null int64
7 父母与小孩个数 891 non-null int64
8 船票信息 891 non-null object
9 票价 891 non-null float64
10 客舱 204 non-null object
11 登船港口 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
df. describe( )
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 年龄 堂兄弟/妹个数 父母与小孩个数 票价 count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
【提示】有多个函数可以这样做,你可以做一下总结
1.2.2 任务二:观察表格前10行的数据和后15行的数据
df. head( 10 )
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S 8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S 9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
df. tail( 15 )
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口 876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S 877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S 878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S 879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C 880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S 881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S 882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S 883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S 884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S 885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q 886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S 887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S 888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S 889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C 890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q
1.2.4 任务三:判断数据是否为空,为空的地方返回True,其余地方返回False
df. isnull( )
乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口 0 False False False False False False False False False False True False 1 False False False False False False False False False False False False 2 False False False False False False False False False False True False 3 False False False False False False False False False False False False 4 False False False False False False False False False False True False ... ... ... ... ... ... ... ... ... ... ... ... ... 886 False False False False False False False False False False True False 887 False False False False False False False False False False False False 888 False False False False False True False False False False True False 889 False False False False False False False False False False False False 890 False False False False False False False False False False True False
891 rows × 12 columns
【总结】上面的操作都是数据分析中对于数据本身的观察
【思考】对于一个数据,还可以从哪些方面来观察?找找答案,这个将对下面的数据分析有很大的帮助
答: 还可以观察各列的唯一值数量, 以及每个唯一值的出现次数.
for col in df. columns:
print ( df[ col] . value_counts( ) )
891 1
293 1
304 1
303 1
302 1
..
591 1
590 1
589 1
588 1
1 1
Name: 乘客ID, Length: 891, dtype: int64
0 549
1 342
Name: 是否幸存, dtype: int64
3 491
1 216
2 184
Name: 乘客等级(1/2/3等舱位), dtype: int64
Davies, Master. John Morgan Jr 1
Carr, Miss. Helen "Ellen" 1
Sjostedt, Mr. Ernst Adolf 1
Norman, Mr. Robert Douglas 1
Giglio, Mr. Victor 1
..
Baxter, Mr. Quigg Edmond 1
Phillips, Miss. Kate Florence ("Mrs Kate Louise Phillips Marshall") 1
Sivola, Mr. Antti Wilhelm 1
Jensen, Mr. Hans Peder 1
McGowan, Miss. Anna "Annie" 1
Name: 乘客姓名, Length: 891, dtype: int64
male 577
female 314
Name: 性别, dtype: int64
24.00 30
22.00 27
18.00 26
19.00 25
30.00 25
..
55.50 1
70.50 1
66.00 1
23.50 1
0.42 1
Name: 年龄, Length: 88, dtype: int64
0 608
1 209
2 28
4 18
3 16
8 7
5 5
Name: 堂兄弟/妹个数, dtype: int64
0 678
1 118
2 80
5 5
3 5
4 4
6 1
Name: 父母与小孩个数, dtype: int64
1601 7
CA. 2343 7
347082 7
CA 2144 6
347088 6
..
3101276 1
35852 1
237671 1
F.C. 12750 1
PC 17756 1
Name: 船票信息, Length: 681, dtype: int64
8.0500 43
13.0000 42
7.8958 38
7.7500 34
26.0000 31
..
8.4583 1
9.8375 1
8.3625 1
14.1083 1
17.4000 1
Name: 票价, Length: 248, dtype: int64
C23 C25 C27 4
B96 B98 4
G6 4
E101 3
D 3
..
C7 1
C82 1
E49 1
E46 1
E31 1
Name: 客舱, Length: 147, dtype: int64
S 644
C 168
Q 77
Name: 登船港口, dtype: int64
1.3 保存数据
1.3.1 任务一:将你加载并做出改变的数据,在工作目录下保存为一个新文件train_chinese.csv
df. to_csv( 'train_chinese.csv' )
【总结】数据的加载以及入门,接下来就要接触数据本身的运算,我们将主要掌握numpy和pandas在工作和项目场景的运用。
复习:数据分析的第一步,加载数据我们已经学习完毕了。当数据展现在我们面前的时候,我们所要做的第一步就是认识他,今天我们要学习的就是 了解字段含义以及初步观察数据 。
1 第一章:数据载入及初步观察
1.4 知道你的数据叫什么
教材《Python for Data Analysis》第五章
开始前导入numpy和pandas
import numpy as np
import pandas as pd
np. __version__, pd. __version__
('1.18.1', '1.0.5')
1.4.1 任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子🌰[开放题]
s = pd. Series( np. random. randn( 5 ) )
s. index
RangeIndex(start=0, stop=5, step=1)
s. to_frame( )
0 0 -0.195489 1 -0.507748 2 -0.805316 3 0.196314 4 2.694240
df = pd. DataFrame( np. random. randn( 24 ) . reshape( ( 4 , 6 ) ) )
df
0 1 2 3 4 5 0 -0.205274 -0.288531 1.345097 -2.155370 1.496531 -0.240410 1 -0.592067 -0.367727 -0.516193 0.170073 0.156564 -0.017960 2 -0.283009 -0.207576 1.127535 1.574086 -0.907149 0.479012 3 0.274194 -0.534830 -0.110950 -0.836543 0.031358 0.175774
1.4.2 任务二:根据上节课的方法载入"train.csv"文件和上一节课保存的"train_chinese.csv"文件
df = pd. read_csv( 'train.csv' )
df. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
我们在通过翻译版train_chinese.csv熟悉了这个数据集,然后我们对trian.csv来进行操作
1.4.3 任务三:查看DataFrame数据的每列的项
df. columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
1.4.4任务四:查看"cabin"这列的所有项 [有多种方法]
df[ 'Cabin' ]
0 NaN
1 C85
2 NaN
3 C123
4 NaN
...
886 NaN
887 B42
888 NaN
889 C148
890 NaN
Name: Cabin, Length: 891, dtype: object
df. Cabin
0 NaN
1 C85
2 NaN
3 C123
4 NaN
...
886 NaN
887 B42
888 NaN
889 C148
890 NaN
Name: Cabin, Length: 891, dtype: object
df. iloc[ : , 10 ]
0 NaN
1 C85
2 NaN
3 C123
4 NaN
...
886 NaN
887 B42
888 NaN
889 C148
890 NaN
Name: Cabin, Length: 891, dtype: object
df. loc[ : , 'Cabin' ]
0 NaN
1 C85
2 NaN
3 C123
4 NaN
...
886 NaN
887 B42
888 NaN
889 C148
890 NaN
Name: Cabin, Length: 891, dtype: object
经过我们的观察发现一个测试集test_1.csv有一列是多余的,我们需要将这个多余的行删去
1.4.5 任务五:加载文件"test_1.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除
test_1 = pd. read_csv( 'test_1.csv' )
test_1. head( )
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked a 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 100 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 100 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 100 3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 100 4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S 100
train = pd. read_csv( 'train.csv' )
train. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
test_1. columns
Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked', 'a'],
dtype='object')
del test_1[ 'a' ]
test_1. head( )
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
1.4.6 任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素
df. drop( [ 'PassengerId' , 'Name' , 'Age' , 'Ticket' ] , axis= 1 ) . head( )
Survived Pclass Sex SibSp Parch Fare Cabin Embarked 0 0 3 male 1 0 7.2500 NaN S 1 1 1 female 1 0 71.2833 C85 C 2 1 3 female 0 0 7.9250 NaN S 3 1 1 female 1 0 53.1000 C123 S 4 0 3 male 0 0 8.0500 NaN S
【思考】对比任务五和任务六,是不是使用了不一样的方法(函数),如果使用一样的函数如何完成上面的不同的要求呢?
test_1 = pd. read_csv( 'test_1.csv' )
train = pd. read_csv( 'train.csv' )
test_1[ train. columns] . head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
df[ [ col for col in list ( df. columns) if col not in [ 'PassengerId' , 'Name' , 'Age' , 'Ticket' ] ] ] . head( )
Survived Pclass Sex SibSp Parch Fare Cabin Embarked 0 0 3 male 1 0 7.2500 NaN S 1 1 1 female 1 0 71.2833 C85 C 2 1 3 female 0 0 7.9250 NaN S 3 1 1 female 1 0 53.1000 C123 S 4 0 3 male 0 0 8.0500 NaN S
1.5 轴的逻辑
1.5.1 任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。
df[ df. Age< 10 ] . head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S 10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S 16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q 24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S 43 44 1 2 Laroche, Miss. Simonne Marie Anne Andree female 3.0 1 2 SC/Paris 2123 41.5792 NaN C
1.5.2 任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage
midage = df[ ( df. Age> 10 ) & ( df. Age< 50 ) ]
midage. head( )
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
【提示】了解pandas的条件筛选方式以及如何使用交集和凝集操作
1.5.3 任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来
midage. iloc[ 100 , : ] . loc[ [ 'Pclass' , 'Sex' ] ]
Pclass 2
Sex male
Name: 149, dtype: object
midage. loc[ [ 100 ] , [ 'Pclass' , 'Sex' ] ]
1.5.4 任务四:将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
midage. loc[ [ 100 , 105 , 108 ] , [ 'Pclass' , 'Name' , 'Sex' ] ]
Pclass Name Sex 100 3 Petranec, Miss. Matilda female 105 3 Mionoff, Mr. Stoytcho male 108 3 Rekic, Mr. Tido male
【提示】使用pandas提出的简单方式,你可以看看loc方法
1.5.5 任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
midage. iloc[ [ 100 , 105 , 108 ] , [ 2 , 3 , 4 ] ]
Pclass Name Sex 149 2 Byles, Rev. Thomas Roussel Davids male 160 3 Cribb, Mr. John Hatfield male 163 3 Calic, Mr. Jovo male
复习:在前面我们已经学习了Pandas基础,知道利用Pandas读取csv数据的增删查改,今天我们要学习的就是 探索性数据分析 ,主要介绍如何利用Pandas进行排序、算术计算以及计算描述函数describe()的使用。
1 第一章:探索性数据分析
开始之前,导入numpy、pandas包和数据
import numpy as np
import pandas as pd
np. __version__, pd. __version__
('1.18.1', '1.0.5')
text = pd. read_csv( 'train_chinese.csv' )
text. head( )
Unnamed: 0 乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口 0 0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S 1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C 2 2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S 3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S 4 4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
1.6 了解你的数据吗?
教材《Python for Data Analysis》第五章
1.6.1 任务一:利用Pandas对示例数据进行排序,要求升序
df = pd. DataFrame( np. arange( 24 ) . reshape( ( 4 , 6 ) ) ,
index= [ 5 , 6 , 4 , 9 ] ,
columns= [ 'd' , 'f' , 'b' , 'c' , 'g' , 'a' ] )
df
d f b c g a 5 0 1 2 3 4 5 6 6 7 8 9 10 11 4 12 13 14 15 16 17 9 18 19 20 21 22 23
df. sort_values( by= 'c' , ascending= False )
d f b c g a 9 18 19 20 21 22 23 4 12 13 14 15 16 17 6 6 7 8 9 10 11 5 0 1 2 3 4 5
【思考】通过书本你能说出Pandas对DataFrame数据有几种函数吗?分别如何使用,有什么区别吗?
【总结】下面将不同的排序方式做一个小总结
df. sort_index( )
d f b c g a 4 12 13 14 15 16 17 5 0 1 2 3 4 5 6 6 7 8 9 10 11 9 18 19 20 21 22 23
df. sort_index( axis= 1 )
a b c d f g 5 5 2 3 0 1 4 6 11 8 9 6 7 10 4 17 14 15 12 13 16 9 23 20 21 18 19 22
frame. sort_index( axis= 1 , ascending= False )
df. sort_values( by= [ 'a' , 'c' ] )
d f b c g a 5 0 1 2 3 4 5 6 6 7 8 9 10 11 4 12 13 14 15 16 17 9 18 19 20 21 22 23
1.6.2 任务二:对泰坦尼克号数据(trian.csv)按票价和年龄两列进行综合排序(降序排列),从数据中你能发现什么
text. sort_values( by= [ '票价' , '年龄' ] , ascending= False )
Unnamed: 0 乘客ID 是否幸存 乘客等级(1/2/3等舱位) 乘客姓名 性别 年龄 堂兄弟/妹个数 父母与小孩个数 船票信息 票价 客舱 登船港口 679 679 680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C 258 258 259 1 1 Ward, Miss. Anna female 35.0 0 0 PC 17755 512.3292 NaN C 737 737 738 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 B101 C 438 438 439 0 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0000 C23 C25 C27 S 341 341 342 1 1 Fortune, Miss. Alice Elizabeth female 24.0 3 2 19950 263.0000 C23 C25 C27 S ... ... ... ... ... ... ... ... ... ... ... ... ... ... 481 481 482 0 2 Frost, Mr. Anthony Wood "Archie" male NaN 0 0 239854 0.0000 NaN S 633 633 634 0 1 Parr, Mr. William Henry Marsh male NaN 0 0 112052 0.0000 NaN S 674 674 675 0 2 Watson, Mr. Ennis Hastings male NaN 0 0 239856 0.0000 NaN S 732 732 733 0 2 Knight, Mr. Robert J male NaN 0 0 239855 0.0000 NaN S 815 815 816 0 1 Fry, Mr. Richard male NaN 0 0 112058 0.0000 B102 S
891 rows × 13 columns
【思考】排序后,如果我们不仅仅关注年龄和票价两列,我会发现票价越高的应该客舱越好,所以我们会明显看出,票价前20的乘客中存活的有14人,相当高的一个比例,那么我们后面是不是可以进一步分析一下票价和存活之间的关系,年龄和存活之间的关系呢? 当然,这只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。
1.6.3 任务三:利用Pandas进行算术计算,计算两个DataFrame数据相加结果
df1 = pd. DataFrame( np. arange( 9 ) . reshape( 3 , 3 ) ,
columns= [ 'a' , 'b' , 'c' ] ,
index= [ 'one' , 'two' , 'three' ] )
df2 = pd. DataFrame( np. arange( 12 ) . reshape( 4 , 3 ) ,
columns= [ 'a' , 'e' , 'c' ] ,
index= [ 'first' , 'one' , 'two' , 'second' ] )
df1
df2
a e c first 0 1 2 one 3 4 5 two 6 7 8 second 9 10 11
df1 + df2
a b c e first NaN NaN NaN NaN one 3.0 NaN 7.0 NaN second NaN NaN NaN NaN three NaN NaN NaN NaN two 9.0 NaN 13.0 NaN
【提醒】两个DataFrame相加后,会返回一个新的DataFrame,对应的行和列的值会相加,没有对应的会变成空值NaN。 当然,DataFrame还有很多算术运算,如减法,除法等,有兴趣的同学可以看《利用Python进行数据分析》第五章 算术运算与数据对齐 部分,多在网络上查找相关学习资料。
1.6.4 任务四:通过泰坦尼克号数据如何计算出在船上最大的家族有多少人?
text. columns
Index(['Unnamed: 0', '乘客ID', '是否幸存', '乘客等级(1/2/3等舱位)', '乘客姓名', '性别', '年龄',
'堂兄弟/妹个数', '父母与小孩个数', '船票信息', '票价', '客舱', '登船港口'],
dtype='object')
'''
还是用之前导入的chinese_train.csv如果我们想看看在船上,最大的家族有多少人(‘兄弟姐妹个数’+‘父母子女个数’),我们该怎么做呢?
'''
max ( text[ '堂兄弟/妹个数' ] + text[ '父母与小孩个数' ] )
10
是的,如上,很简单,我们只需找出兄弟姐妹个数和父母子女个数之和最大的数就行,先让这两列相加返回一个DataFrame,然后用max函数求出最大值,当然你还可以想出很多方法和思考角度,欢迎你来说出你的看法。
1.6.5 任务五:学会使用Pandas describe()函数查看数据基本统计信息
text. describe( )
Unnamed: 0 乘客ID 是否幸存 乘客等级(1/2/3等舱位) 年龄 堂兄弟/妹个数 父母与小孩个数 票价 count 891.000000 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000 mean 445.000000 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208 std 257.353842 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429 min 0.000000 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000 25% 222.500000 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400 50% 445.000000 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200 75% 667.500000 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000 max 890.000000 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
1.6.6 任务六:分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据,你能发现什么?
text[ '票价' ] . describe( )
count 891.000000
mean 32.204208
std 49.693429
min 0.000000
25% 7.910400
50% 14.454200
75% 31.000000
max 512.329200
Name: 票价, dtype: float64
【思考】从上面数据我们可以看出, 一共有891个票价数据, 平均值约为:32.20, 标准差约为49.69,说明票价波动特别大, 25%的人的票价是低于7.91的,50%的人的票价低于14.45,75%的人的票价低于31.00, 票价最大值约为512.33,最小值为0。 当然,这只是我的想法,你还可以有更多想法,欢迎写在你的学习笔记中。
text[ '父母与小孩个数' ] . describe( )
count 891.000000
mean 0.381594
std 0.806057
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 6.000000
Name: 父母与小孩个数, dtype: float64
【思考】有更多想法,欢迎写在你的学习笔记中。
【总结】本节中我们通过Pandas的一些内置函数对数据进行了初步统计查看,这个过程最重要的不是大家得掌握这些函数,而是看懂从这些函数出来的数据,构建自己的数据分析思维,这也是第一章最重要的点,希望大家学完第一章能对数据有个基本认识,了解自己在做什么,为什么这么做,后面的章节我们将开始对数据进行清洗,进一步分析。