数据分析（三）数据重构-CSDN博客

text_left_up = pd.read_csv("data2/data/train-left-up.csv")
text_left_down = pd.read_csv("data2/data/train-left-down.csv")
text_right_up = pd.read_csv("data2/data/train-right-up.csv")
text_right_down = pd.read_csv("data2/data/train-right-down.csv")
text_left_up.head()
text_left_down.head()
text_right_up.head()
text_right_down.head()

任务二：使用concat方法：将数据train-left-up.csv和train-right-up.csv横向合并为一张表，并保存这张表为result_up

concat方法：

result = pd.concat([df1, df2], axis= ' ')
①axis=0，纵向拼接（默认）
②axis=1，横向拼接
df_concat = pd.concat([df1, df2], keys=['one', 'two'],ignore_index = True) 
# 纵向拼接一般需要重新生成新的索引
df_concat

list_up = [text_left_up,text_right_up]
result_up = pd.concat(list_up,axis=1)
result_up.head()

任务三：使用concat方法：将train-left-down和train-right-down横向合并为一张表，并保存这张表为result_down。然后将上边的result_up和result_down纵向合并为result。

list_down = [text_left_down,text_right_down]
result_down = pd.concat(list_down,axis=1)
result = pd.concat([result_up,result_down],ignore_index = True)#重新定义索引
result

任务四：使用DataFrame自带的方法join方法和append：完成任务二和任务三的任务

join方法：

Dataframe内置join方法是一种快速合并的方法，它默认以index作为对齐的列。

join(other，on=None，how''left''，lsuffix=" ''，rsuffix='' "，sort=False)

上述方法参数表示的含义如下：

on：用于连接名。如果两个表中行索引和列索引重叠，那么当使用join()方法进行合并时，使用参数on指定重叠的列名即可。
how：可以从{“left”，“right”，" outer"，“inner”} 中任选一个，默认使用left的方式。
lsuffix：接收字符串，用于在左侧重叠的列名后添加后缀名。
rsuffix：接收字符串，用于在右侧重叠的列名后添加后缀名。
sort：接收布尔值，根据连接键对合并的数据进行排序，默认为False。
join()方法默认使用的左连接方式，即以左表为基准，join()方法进行合并后左表的数据会全部展示。
Pandas中的join()合并数据方法_KJ.JK的博客-CSDN博客_pandas的join函数

resul_up = text_left_up.join(text_right_up) # 默认left链接
result_down = text_left_down.join(text_right_down)
result = result_up.append(result_down,ignore_index = True)
result

任务五：使用Panads的merge方法和DataFrame的append方法：完成任务二和任务三的任务

merge方法：

left：左表
right：右表
how：连接方式，inner、left、right、outer，默认为inner
on：用于连接的列名称
left_on：左表用于连接的列名
right_on：右表用于连接的列名
left_index：是否使用左表的行索引作为连接键，默认False
right_index：是否使用右表的行索引作为连接键，默认False
sort：默认为False，将合并的数据进行排序
copy：默认为True，总是将数据复制到数据结构中，设置为False可以提高性能
suffixes：存在相同列名时在列名后面添加的后缀，默认为(’_x’, ‘_y’)
indicator：显示合并数据中数据来自哪个表

pandas的merge方法详解_trayvontang的博客-CSDN博客_pandas的merge

result_up = pd.merge(text_left_up,text_right_up,left_index=True,right_index=True)
result_down = pd.merge(text_left_down,text_right_down,left_index=True,right_index=True)
#通过设置left_index、right_index的值为True来使用索引连接。
result = result_up.append(result_down,ignore_index = True)
result

任务：将我们的数据变为Series类型的数据

stack函数会将数据从”表格结构“变成”花括号结构“，即将其行索引变成列索引，反之，unstack函数将数据从”花括号结构“变成”表格结构“，即要将其中一层的列索引变成行索引。

text = pd.read_csv('result.csv')
text.head()
# 代码写在这里
unit_result=text.stack().head(20)
unit_result.head()

-----------------------------------------------------------------------------------------
'''输出
0  Unnamed: 0                           0
   PassengerId                          1
   Survived                             0
   Pclass                               3
   Name           Braund, Mr. Owen Harris
dtype: object'''

数据聚合与运算

数据运用

Groupby机制

Python数据分析 – GroupBy机制 - Python数据分析教程 - 炫意HTML5

深入理解和运用Pandas的GroupBy机制——理解篇

Python数据分析 | (28) GroupBy机制_CoreJT的博客-CSDN博客_groupby机制

任务二：计算泰坦尼克号男性与女性的平均票价

grouped  = text['Fare'].groupby(text['Sex'])# 以sex对fare进行分组，分别进行平均值计算
mf=grouped.mean()
mf

任务三：统计泰坦尼克号中男女的存活人数

survival  = text['Survived'].groupby(text['Sex'])
survival_sex = survival.sum()
survival_sex

任务四：计算客舱不同等级的存活人数

groupby：返回的是一个DataFrameGroupBy结构，这个结构必须调用聚合函数（如sum）之后，才会得到结构为Series的数据结果。每次用groupby函数后，都是要接类似sum、mean等聚合函数才能输出。

【思考】从任务二到任务三中，这些运算可以通过agg()函数来同时计算。并且可以使用rename函数修改列名。

agg：是DataFrame的直接方法，返回的也是一个DataFrame。当然，很多功能用sum、mean等等也可以实现。但是agg更加简洁, 而且传给它的函数可以是字符串，也可以自定义，参数是column对应的子DataFrame。

如何使用agg函数对数据进行分组聚合 - 知乎

text.groupby('Sex').agg({'Fare': 'mean', 'Pclass': 'count'}).rename(columns=
                            {'Fare': 'mean_fare', 'Pclass': 'count_pclass'})
                            #以字典方式实现聚合

任务五：统计在不同等级的票中的不同年龄的船票花费的平均值

text.groupby(['Pclass','Age'])['Fare'].mean()

任务六：将任务二和任务三的数据合并，并保存到sex_fare_survived.csv

result = pd.merge(mf,survival_sex,on='Sex')#on：用于连接的列名称
result
result.to_csv('sex_fare_survived.csv')

任务七：得出不同年龄的总的存活人数，然后找出存活人数最多的年龄段，最后计算存活人数最高的存活率（存活人数/总人数）

#不同年龄的存活人数
survived_age = text['Survived'].groupby(text['Age']).sum()
survived_age.head()

#找出最大值的年龄段
survived_age[survived_age.values==survived_age.max()]

_sum = text['Survived'].sum()
print(_sum)

#首先计算总人数
_sum = text['Survived'].sum()
print("sum of person:"+str(_sum))
precetn =survived_age.max()/_sum
print("最大存活率："+str(precetn))