03 探索数据

最新推荐文章于 2024-10-18 00:00:00 发布

玛卡巴卡的手推车

最新推荐文章于 2024-10-18 00:00:00 发布

阅读量167

点赞数

分类专栏：小路文章标签： python

本文链接：https://blog.csdn.net/leileixiang/article/details/108109256

版权

小路专栏收录该内容

6 篇文章 0 订阅

订阅专栏

导入numpy、pandas包和数据

import pandas as pd
text = pd.read_csv('train_chinese.csv')
print(text.head())
#    乘客ID  是否幸存  仓位等级  ...       票价    客舱  登船港口
# 0     1     0     3  ...   7.2500   NaN     S
# 1     2     1     1  ...  71.2833   C85     C
# 2     3     1     3  ...   7.9250   NaN     S
# 3     4     1     1  ...  53.1000  C123     S
# 4     5     0     3  ...   8.0500   NaN     S
# 
# [5 rows x 12 columns]

利用Pandas对示例数据进行排序，要求升序

pd.DataFrame() ：创建一个DataFrame对象 
np.arange(8).reshape((2, 4)) : 生成一个二维数组（2*4）,第一列：0，1，2，3 第二列：4，5，6，7
index=[2，1] ：DataFrame 对象的索引列
columns=['d', 'a', 'b', 'c'] ：DataFrame 对象的索引行

frame = pd.DataFrame(np.arange(8).reshape((2, 4)),
                     index=['2', '1'],
                     columns=['d', 'a', 'b', 'c'])
print(frame)
#    d  a  b  c
# 2  0  1  2  3
# 1  4  5  6  7

pd.DataFrame() ：创建一个DataFrame对象
np.arange(8).reshape((2, 4)) : 生成一个二维数组（2*4）,第一列：0，1，2，3 第二列：4，5，6，7
index=['2, 1] ：DataFrame 对象的索引列
columns=['d', 'a', 'b', 'c'] ：DataFrame 对象的索引行

print(frame.sort_values(by='c', ascending=False))
#    d  a  b  c
# 1  4  5  6  7
# 2  0  1  2  3
print(frame)
#    d  a  b  c
# 2  0  1  2  3
# 1  4  5  6  7

可以看到sort_values这个函数中by参数指向要排列的列，ascending参数指向排序的方式（升序还是降序）

# 让行索引升序排序
print(frame.sort_index())
#    d  a  b  c
# 1  4  5  6  7
# 2  0  1  2  3

# 让列索引升序排序
print(frame.sort_index(axis=1))
#    d  a  b  c
# 2  0  1  2  3
# 1  4  5  6  7

# 让列索引降序排序
frame.sort_index(axis=1, ascending=False)
#    d  a  b  c
# 1  4  5  6  7
# 2  0  1  2  3
# 让任选两列数据同时降序排序
frame.sort_values(by=['a', 'c'])
#    a  b  c  d
# 2  1  2  3  0
# 1  5  6  7  4

对泰坦尼克号数据（trian.csv）按票价和年龄两列进行综合排序（降序排列），从数据中你能发现什么

print(text.sort_values(by=['票价', '年龄'], ascending=False).head(3))
#      乘客ID  是否幸存  仓位等级  ...        票价           客舱  登船港口
# 679   680     1     1  ...  512.3292  B51 B53 B55     C
# 258   259     1     1  ...  512.3292          NaN     C
# 737   738     1     1  ...  512.3292         B101     C
#
# [3 rows x 12 columns]

利用Pandas进行算术计算，计算两个DataFrame数据相加结果

'''利用Pandas进行算术计算，计算两个DataFrame数据相加结果'''
frame1_a = pd.DataFrame(np.arange(9.).reshape(3, 3),
                     columns=['a', 'b', 'c'],
                     index=['one', 'two', 'three'])
print(frame1_a)
#          a    b    c
# one    0.0  1.0  2.0
# two    3.0  4.0  5.0
# three  6.0  7.0  8.0

frame1_b = pd.DataFrame(np.arange(12.).reshape(4, 3),
                     columns=['a', 'e', 'c'],
                     index=['first', 'one', 'two', 'second'])
print(frame1_b)
#           a     e     c
# first   0.0   1.0   2.0
# one     3.0   4.0   5.0
# two     6.0   7.0   8.0
# second  9.0  10.0  11.0

print(frame1_a + frame1_b)
#           a   b     c   e
# first   NaN NaN   NaN NaN
# one     3.0 NaN   7.0 NaN
# second  NaN NaN   NaN NaN
# three   NaN NaN   NaN NaN
# two     9.0 NaN  13.0 NaN

两个DataFrame相加后，会返回一个新的DataFrame，对应的行和列的值会相加，没有对应的会变成空值NaN。

通过泰坦尼克号数据如何计算出在船上最大的家族有多少人

'''通过泰坦尼克号数据如何计算出在船上最大的家族有多少人'''
print(max(text['兄弟姐妹个数'] + text['父母子女个数']))
# 10

学会使用Pandas describe()函数查看数据基本统计信息

'''学会使用Pandas describe()函数查看数据基本统计信息'''
frame2 = pd.DataFrame([[1.4, np.nan],
                       [7.1, -4.5],
                       [np.nan, np.nan],
                       [0.75, -1.3]
                      ], index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
print(frame2)
#     one  two
# a  1.40  NaN
# b  7.10 -4.5
# c   NaN  NaN
# d  0.75 -1.3
print(frame2.describe())
#             one       two
# count  3.000000  2.000000
# mean   3.083333 -2.900000
# std    3.493685  2.262742
# min    0.750000 -4.500000
# 25%    1.075000 -3.700000
# 50%    1.400000 -2.900000
# 75%    4.250000 -2.100000
# max    7.100000 -1.300000

分别看看泰坦尼克号数据集中票价、父母子女这列数据的基本统计数据

'''分别看看泰坦尼克号数据集中 票价、父母子女 这列数据的基本统计数据'''
print(text['票价'].describe())
# count    891.000000
# mean      32.204208
# std       49.693429
# min        0.000000
# 25%        7.910400
# 50%       14.454200
# 75%       31.000000
# max      512.329200
# Name: 票价, dtype: float64
print(text['父母子女个数'].describe())
# count    891.000000
# mean       0.381594
# std        0.806057
# min        0.000000
# 25%        0.000000
# 50%        0.000000
# 75%        0.000000
# max        6.000000
# Name: 父母子女个数, dtype: float64

[1]https://www.cnblogs.com/recoverableTi/p/13526941.html
[2]https://blog.csdn.net/weixin_41903171/article/details/108053159
[3]https://nbviewer.jupyter.org/github/andongBlue/hands-on-data-analysis/blob/master/%E7%AC%AC%E4%B8%80%E5%8D%95%E5%85%83%E9%A1%B9%E7%9B%AE%E9%9B%86%E5%90%88/%E7%AC%AC%E4%B8%80%E7%AB%A0%EF%BC%9A%E7%AC%AC%E4%B8%89%E8%8A%82%E6%8E%A2%E7%B4%A2%E6%80%A7%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90.ipynb