复习:在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了数据的清洗,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。
开始之前,导入numpy、pandas包和数据
# 导入基本库
import numpy as np
import pandas as pd
# 载入data文件中的:train-left-up.csv
train_left_up_csv=pd.read_csv('./data/train-left-up.csv')
train_left_up_csv
PassengerId | Survived | Pclass | Name | |
---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
4 | 5 | 0 | 3 | Allen, Mr. William Henry |
... | ... | ... | ... | ... |
434 | 435 | 0 | 1 | Silvey, Mr. William Baird |
435 | 436 | 1 | 1 | Carter, Miss. Lucile Polk |
436 | 437 | 0 | 3 | Ford, Miss. Doolina Margaret "Daisy" |
437 | 438 | 1 | 2 | Richards, Mrs. Sidney (Emily Hocking) |
438 | 439 | 0 | 1 | Fortune, Mr. Mark |
439 rows × 4 columns
2 第二章:数据重构
2.4 数据的合并
2.4.1 任务一:将data文件夹里面的所有数据都载入,观察数据的之间的关系
#写入代码
train_left_down_csv=pd.read_csv('./data/train-left-down.csv')
train_left_down_csv.head(20)
PassengerId | Survived | Pclass | Name | |
---|---|---|---|---|
0 | 440 | 0 | 2 | Kvillner, Mr. Johan Henrik Johannesson |
1 | 441 | 1 | 2 | Hart, Mrs. Benjamin (Esther Ada Bloomfield) |
2 | 442 | 0 | 3 | Hampe, Mr. Leon |
3 | 443 | 0 | 3 | Petterson, Mr. Johan Emil |
4 | 444 | 1 | 2 | Reynaldo, Ms. Encarnacion |
5 | 445 | 1 | 3 | Johannesen-Bratthammer, Mr. Bernt |
6 | 446 | 1 | 1 | Dodge, Master. Washington |
7 | 447 | 1 | 2 | Mellinger, Miss. Madeleine Violet |
8 | 448 | 1 | 1 | Seward, Mr. Frederic Kimber |
9 | 449 | 1 | 3 | Baclini, Miss. Marie Catherine |
10 | 450 | 1 | 1 | Peuchen, Major. Arthur Godfrey |
11 | 451 | 0 | 2 | West, Mr. Edwy Arthur |
12 | 452 | 0 | 3 | Hagland, Mr. Ingvald Olai Olsen |
13 | 453 | 0 | 1 | Foreman, Mr. Benjamin Laventall |
14 | 454 | 1 | 1 | Goldenberg, Mr. Samuel L |
15 | 455 | 0 | 3 | Peduzzi, Mr. Joseph |
16 | 456 | 1 | 3 | Jalsevac, Mr. Ivan |
17 | 457 | 0 | 1 | Millet, Mr. Francis Davis |
18 | 458 | 1 | 1 | Kenyon, Mrs. Frederick R (Marion) |
19 | 459 | 1 | 2 | Toomey, Miss. Ellen |
#写入代码
train_right_down_csv=pd.read_csv('./data/train-right-down.csv')
train_right_down_csv.head(20)
Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|
0 | male | 31.0 | 0 | 0 | C.A. 18723 | 10.5000 | NaN | S |
1 | female | 45.0 | 1 | 1 | F.C.C. 13529 | 26.2500 | NaN | S |
2 | male | 20.0 | 0 | 0 | 345769 | 9.5000 | NaN | S |
3 | male | 25.0 | 1 | 0 | 347076 | 7.7750 | NaN | S |
4 | female | 28.0 | 0 | 0 | 230434 | 13.0000 | NaN | S |
5 | male | NaN | 0 | 0 | 65306 | 8.1125 | NaN | S |
6 | male | 4.0 | 0 | 2 | 33638 | 81.8583 | A34 | S |
7 | female | 13.0 | 0 | 1 | 250644 | 19.5000 | NaN | S |
8 | male | 34.0 | 0 | 0 | 113794 | 26.5500 | NaN | S |
9 | female | 5.0 | 2 | 1 | 2666 | 19.2583 | NaN | C |
10 | male | 52.0 | 0 | 0 | 113786 | 30.5000 | C104 | S |
11 | male | 36.0 | 1 | 2 | C.A. 34651 | 27.7500 | NaN | S |
12 | male | NaN | 1 | 0 | 65303 | 19.9667 | NaN | S |
13 | male | 30.0 | 0 | 0 | 113051 | 27.7500 | C111 | C |
14 | male | 49.0 | 1 | 0 | 17453 | 89.1042 | C92 | C |
15 | male | NaN | 0 | 0 | A/5 2817 | 8.0500 | NaN | S |
16 | male | 29.0 | 0 | 0 | 349240 | 7.8958 | NaN | C |
17 | male | 65.0 | 0 | 0 | 13509 | 26.5500 | E38 | S |
18 | female | NaN | 1 | 0 | 17464 | 51.8625 | D21 | S |
19 | female | 50.0 | 0 | 0 | F.C.C. 13531 | 10.5000 | NaN | S |
train_left_up_csv=pd.read_csv('./data/train-left-up.csv')
train_left_up_csv.head(20)
PassengerId | Survived | Pclass | Name | |
---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) |
4 | 5 | 0 | 3 | Allen, Mr. William Henry |
5 | 6 | 0 | 3 | Moran, Mr. James |
6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J |
7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard |
8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) |
9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) |
10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut |
11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth |
12 | 13 | 0 | 3 | Saundercock, Mr. William Henry |
13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan |
14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina |
15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) |
16 | 17 | 0 | 3 | Rice, Master. Eugene |
17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene |
18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... |
19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima |
train_right_up_csv=pd.read_csv('./data/train-right-up.csv')
train_right_up_csv.head(20)
Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|
0 | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
10 | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
11 | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
12 | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
13 | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
14 | female | 14.0 | 0 | 0 | 350406 | 7.8542 | NaN | S |
15 | female | 55.0 | 0 | 0 | 248706 | 16.0000 | NaN | S |
16 | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
17 | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
18 | female | 31.0 | 1 | 0 | 345763 | 18.0000 | NaN | S |
19 | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
【提示】结合之前我们加载的train.csv数据,大致预测一下上面的数据是什么
2.4.2:任务二:使用concat方法:将数据train-left-up.csv和train-right-up.csv横向合并为一张表,并保存这张表为result_up
#写入代码 当axis = 1的时候,concat就是行对齐,然后将不同列名称的两张表合并
# 加上join参数的属性,如果为’inner’得到的是两表的交集,如果是outer,得到的是两表的并集。
result_up_csv=pd.concat([train_left_up_csv,train_right_up_csv],axis=1,join='inner')
result_up_csv.to_csv('./data/result_up.csv')
2.4.3 任务三:使用concat方法:将train-left-down和train-right-down横向合并为一张表,并保存这张表为result_down。然后将上边的result_up和result_down纵向合并为result。
#写入代码
result_down_csv=pd.concat([train_left_down_csv,train_right_down_csv],axis=1,join='inner')
result_down_csv.to_csv('./data/result_down.csv')
result=pd.concat([result_up_csv,result_down_csv],axis=0)
result
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
447 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
448 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
449 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
450 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
451 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
2.4.4 任务四:使用DataFrame自带的方法join方法和append:完成任务二和任务三的任务
#写入代码 join中没有axis参数,所以该方法只能实现按行拼接。
# append 向DataFrame中加入新的行,如果列名不在对象中,则加入新的列。
result_up = train_left_up_csv.join(train_right_up_csv)
result_down = train_left_down_csv.join(train_right_down_csv)
result = result_up.append(result_down)
result
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
447 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
448 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
449 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
450 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
451 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
2.4.5 任务五:使用Panads的merge方法和DataFrame的append方法:完成任务二和任务三的任务
#写入代码
# merge()(通常以left.merge(right)方式使用)方法和join()功能上有些类似,但merge()方法中caller可以和right中的任意index或column进行匹配
result_up=train_left_up_csv.merge(train_right_up_csv,left_index=True,right_index=True)
result_down=train_left_down_csv.merge(train_right_down_csv,left_index=True,right_index=True)
result=resul_up.append(result_down)
result
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
447 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
448 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
449 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
450 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
451 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 12 columns
【思考】对比merge、join以及concat的方法的不同以及相同。思考一下在任务四和任务五的情况下,为什么都要求使用DataFrame的append方法,如何只要求使用merge或者join可不可以完成任务四和任务五呢?
2.4.6 任务六:完成的数据保存为result.csv
#写入代码
result.to_csv('./data/result.csv')
2.5 换一种角度看数据
2.5.1 任务一:将我们的数据变为Series类型的数据
#写入代码
train_result=pd.read_csv('./data/result.csv')
train_result
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 447 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 448 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 449 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 450 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 451 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 13 columns
#写入代码
unit_result=train_result.stack()
type(unit_result.head(20))
pandas.core.series.Series
unit_result.to_csv('./data/unit_result.csv')
unit_result_csv=pd.read_csv('./data/unit_result.csv')
unit_result_csv
Unnamed: 0 | Unnamed: 1 | 0 | |
---|---|---|---|
0 | 0 | Unnamed: 0 | 0 |
1 | 0 | PassengerId | 1 |
2 | 0 | Survived | 0 |
3 | 0 | Pclass | 3 |
4 | 0 | Name | Braund, Mr. Owen Harris |
... | ... | ... | ... |
10712 | 890 | SibSp | 0 |
10713 | 890 | Parch | 0 |
10714 | 890 | Ticket | 370376 |
10715 | 890 | Fare | 7.75 |
10716 | 890 | Embarked | Q |
10717 rows × 3 columns
# 笔记 stack和unstack两个函数。stack的意思是堆叠,堆积,unstack即“不要堆叠”,
unit_result_csv_csv=unit_result.unstack()
unit_result_csv_csv.to_csv('./data/unit_result_csv_csv.csv')
unit_result_csv_csv=pd.read_csv('./data/unit_result_csv_csv.csv')
unit_result_csv_csv.head(20)
Unnamed: 0 | Unnamed: 0.1 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
5 | 5 | 5 | 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
6 | 6 | 6 | 7 | 0 | 1 | McCarthy, Mr. Timothy J | male | 54.0 | 0 | 0 | 17463 | 51.8625 | E46 | S |
7 | 7 | 7 | 8 | 0 | 3 | Palsson, Master. Gosta Leonard | male | 2.0 | 3 | 1 | 349909 | 21.0750 | NaN | S |
8 | 8 | 8 | 9 | 1 | 3 | Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) | female | 27.0 | 0 | 2 | 347742 | 11.1333 | NaN | S |
9 | 9 | 9 | 10 | 1 | 2 | Nasser, Mrs. Nicholas (Adele Achem) | female | 14.0 | 1 | 0 | 237736 | 30.0708 | NaN | C |
10 | 10 | 10 | 11 | 1 | 3 | Sandstrom, Miss. Marguerite Rut | female | 4.0 | 1 | 1 | PP 9549 | 16.7000 | G6 | S |
11 | 11 | 11 | 12 | 1 | 1 | Bonnell, Miss. Elizabeth | female | 58.0 | 0 | 0 | 113783 | 26.5500 | C103 | S |
12 | 12 | 12 | 13 | 0 | 3 | Saundercock, Mr. William Henry | male | 20.0 | 0 | 0 | A/5. 2151 | 8.0500 | NaN | S |
13 | 13 | 13 | 14 | 0 | 3 | Andersson, Mr. Anders Johan | male | 39.0 | 1 | 5 | 347082 | 31.2750 | NaN | S |
14 | 14 | 14 | 15 | 0 | 3 | Vestrom, Miss. Hulda Amanda Adolfina | female | 14.0 | 0 | 0 | 350406 | 7.8542 | NaN | S |
15 | 15 | 15 | 16 | 1 | 2 | Hewlett, Mrs. (Mary D Kingcome) | female | 55.0 | 0 | 0 | 248706 | 16.0000 | NaN | S |
16 | 16 | 16 | 17 | 0 | 3 | Rice, Master. Eugene | male | 2.0 | 4 | 1 | 382652 | 29.1250 | NaN | Q |
17 | 17 | 17 | 18 | 1 | 2 | Williams, Mr. Charles Eugene | male | NaN | 0 | 0 | 244373 | 13.0000 | NaN | S |
18 | 18 | 18 | 19 | 0 | 3 | Vander Planke, Mrs. Julius (Emelia Maria Vande... | female | 31.0 | 1 | 0 | 345763 | 18.0000 | NaN | S |
19 | 19 | 19 | 20 | 1 | 3 | Masselmani, Mrs. Fatima | female | NaN | 0 | 0 | 2649 | 7.2250 | NaN | C |
复习:在前面我们已经学习了Pandas基础,第二章我们开始进入数据分析的业务部分,在第二章第一节的内容中,我们学习了数据的清洗,这一部分十分重要,只有数据变得相对干净,我们之后对数据的分析才可以更有力。而这一节,我们要做的是数据重构,数据重构依旧属于数据理解(准备)的范围。
开始之前,导入numpy、pandas包和数据
# 导入基本库
import numpy as np
import pandas as pd
# 载入上一个任务人保存的文件中:result.csv,并查看这个文件
result_csv=pd.read_csv('./data/result.csv')
result_csv
Unnamed: 0 | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
886 | 447 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.0000 | NaN | S |
887 | 448 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.0000 | B42 | S |
888 | 449 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.4500 | NaN | S |
889 | 450 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.0000 | C148 | C |
890 | 451 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.7500 | NaN | Q |
891 rows × 13 columns
2 第二章:数据重构
第一部分:数据聚合与运算
2.6 数据运用
2.6.1 任务一:通过教材《Python for Data Analysis》P303、Google or anything来学习了解GroupBy机制
#写入心得
2.4.2:任务二:计算泰坦尼克号男性与女性的平均票价
# 写入代码
result_csv_Sex=result_csv['Fare'].groupby(result_csv['Sex'])
result_csv_Sex=result_csv_Sex.mean()
result_csv_Sex
Sex
female 44.479818
male 25.523893
Name: Fare, dtype: float64
在了解GroupBy机制之后,运用这个机制完成一系列的操作,来达到我们的目的。
下面通过几个任务来熟悉GroupBy机制。
2.4.3:任务三:统计泰坦尼克号中男女的存活人数
# 写入代码
result_Group=result_csv['Survived'].groupby(result_csv['Sex'])
result_Group=result_Group.sum()
result_Group
Sex
female 233
male 109
Name: Survived, dtype: int64
2.4.4:任务四:计算客舱不同等级的存活人数
# 写入代码
result_Group_Ticket=result_csv['Survived'].groupby(result_csv['Pclass'])
result_Group_Ticket.sum()
Pclass
1 136
2 87
3 119
Name: Survived, dtype: int64
【提示:】表中的存活那一栏,可以发现如果还活着记为1,死亡记为0
【思考】从数据分析的角度,上面的统计结果可以得出那些结论
#思考心得
【思考】从任务二到任务三中,这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。你可以按照提示写出这个过程吗?
#思考心得 agg()函数是聚合函数 如果要不同列用求不同统计量,则用字典{‘行名/列名’,‘函数名’}指定。
result_csv.groupby('Sex').agg({'Fare': 'mean', 'Pclass': 'sum'}).rename(columns=
{'Fare': 'mean_fare', 'Pclass': 'count_pclass'})
mean_fare | count_pclass | |
---|---|---|
Sex | ||
female | 44.479818 | 678 |
male | 25.523893 | 1379 |
2.4.5:任务五:统计在不同等级的票中的不同年龄的船票花费的平均值
# 写入代码
result_Group_Ticket_age=result_csv['Fare'].groupby([result_csv['Pclass'],result_csv['Age']])
result_Group_Ticket_age.mean()
Pclass Age
1 0.92 151.5500
2.00 151.5500
4.00 81.8583
11.00 120.0000
14.00 120.0000
...
3 61.00 6.2375
63.00 9.5875
65.00 7.7500
70.50 7.7500
74.00 7.7750
Name: Fare, Length: 182, dtype: float64
2.4.6:任务六:将任务二和任务三的数据合并,并保存到sex_fare_survived.csv
# 写入代码
result = pd.merge(result_csv_Sex,result_Group,on='Sex')
result
Fare | Survived | |
---|---|---|
Sex | ||
female | 44.479818 | 233 |
male | 25.523893 | 109 |
2.4.7:任务七:得出不同年龄的总的存活人数,然后找出存活人数最多的年龄段,最后计算存活人数最高的存活率(存活人数/总人数)
# 写入代码
survived_age = result_csv['Survived'].groupby(result_csv['Age']).sum()
survived_age.head()
Age
0.42 1
0.67 1
0.75 2
0.83 2
0.92 1
Name: Survived, dtype: int64
# 写入代码
survived_age[survived_age.values==survived_age.max()]
Age
24.0 15
Name: Survived, dtype: int64
# 写入代码
_sum = result_csv['Survived'].sum()
print(_sum)
342
# 写入代码
precetn =survived_age.max()/_sum
precetn
0.043859649122807015