数据分析(第二天)

数据类型

我们使用 pandas 读取之后,它是什么类型的对象哪?接下来我们进行讨论

任务一:pandas中有两个数据类型DateFrame和Series,通过查找简单了解他们。然后自己写一个关于这两个数据类型的小例子?[开放题]

Series 是一维的数组型对象,它包含了一个值序列(与NumPy中的类型相似),并且包含了数据标签,称为索引(index)。最简单的序列可以仅有一个数组组成。

import pandas as pd

obj = pd.Series([4, 7, -5, 3])
obj
0    4
1    7
2   -5
3    3
dtype: int64
obj.index   # 获取索引
obj.values  # 获取对象的值
# 可以类比一维数组
array([ 4,  7, -5,  3], dtype=int64)

DataFrame 表示的是矩阵的数据表,它包含已排序的集合,每一列可以是不同的值类型(数值、字符串、布尔值等)。DataFrame 既有行索引也有列索引,它可以被视为一个共享相同索引的 Series 的字典。

# 方法一: 使用 [[]]  二维数组 按照行 初始化

data = pd.DataFrame([['C语言', 80],
                     ['数据结构', 90],
                     ['计算机组成原理', 88],
                     ['计算机网络', 92],
                     ['数据库系统概论', 86]],
                    columns=['课程名称', '成绩'])
data.index        # 行索引
data.columns      # 列索引  : Index(['课程名称', '成绩'], dtype='object')
data

# 方法二:使用 字典 按照列 初始化
data = pd.DataFrame({'课程名称': ['C语言', '数据结构', '计算机组成原理', '计算机网络', '数据库系统概论'],
                     '成绩' : [80, 90, 88, 92, 86]})
data
课程名称成绩
0C语言80
1数据结构90
2计算机组成原理88
3计算机网络92
4数据库系统概论86
任务二:根据上节课的方法载入"train.csv"文件
data = pd.read_csv(r'data\train.csv')
data.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
任务四:查看"Cabin"这列的所有值[有多种方法]
data['Cabin']
0       NaN
1       C85
2       NaN
3      C123
4       NaN
       ... 
886     NaN
887     B42
888     NaN
889    C148
890     NaN
Name: Cabin, Length: 891, dtype: object
data.loc[:, 'Cabin']    # 行切片, 列名 ,注意 loc()方法,使用的是列名
0       NaN
1       C85
2       NaN
3      C123
4       NaN
       ... 
886     NaN
887     B42
888     NaN
889    C148
890     NaN
Name: Cabin, Length: 891, dtype: object
data.iloc[:, 10]    # 行切片,列名,注意:iloc() 使用的是 索引。
0       NaN
1       C85
2       NaN
3      C123
4       NaN
       ... 
886     NaN
887     B42
888     NaN
889    C148
890     NaN
Name: Cabin, Length: 891, dtype: object
任务五:加载文件"test.csv",然后对比"train.csv",看看有哪些多出的列,然后将多出的列删除
# 这里数据列没有多余的,我们增加一列之后再删除
data = pd.read_csv(r'data\test.csv')
df = data['Embarked']
new_df = pd.concat([data, df], axis=1)
new_df

# 删除new_df重复的列
del new_df['Embarked']      # 改变了原始数据
new_df.head()  # 与该名称相同的列都别删掉  
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabin
08923Kelly, Mr. Jamesmale34.5003309117.8292NaN
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaN
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaN
38953Wirz, Mr. Albertmale27.0003151548.6625NaN
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaN
new_df.drop('Sex', axis=1) # 删除时不改变原来的对象 new_df 的值,返回删除的值为新的对象
new_df
0AgeCabinFareNameParchPassengerIdPclassSexSibSpTicket
0NaN34.5NaN7.8292Kelly, Mr. James0.0892.03.0male0.0330911
1NaN47.0NaN7.0000Wilkes, Mrs. James (Ellen Needs)0.0893.03.0female1.0363272
2NaN62.0NaN9.6875Myles, Mr. Thomas Francis0.0894.02.0male0.0240276
3NaN27.0NaN8.6625Wirz, Mr. Albert0.0895.03.0male0.0315154
4NaN22.0NaN12.2875Hirvonen, Mrs. Alexander (Helga E Lindqvist)1.0896.03.0female1.03101298
....................................
413SNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
414CNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
415SNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
416SNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
417CNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

2090 rows × 11 columns

任务六: 将[‘PassengerId’,‘Name’,‘Age’,‘Ticket’]这几个列元素隐藏,只观察其他几个列元素
data = pd.read_csv(r'data\test.csv')
data.head()
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
data.drop(['PassengerId','Name','Age','Ticket'], axis=1).head()
PclassSexSibSpParchFareCabinEmbarked
03male007.8292NaNQ
13female107.0000NaNS
22male009.6875NaNQ
33male008.6625NaNS
43female1112.2875NaNS
data
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
....................................
41313053Spector, Mr. WoolfmaleNaN00A.5. 32368.0500NaNS
41413061Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
41513073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS
41613083Ware, Mr. FrederickmaleNaN003593098.0500NaNS
41713093Peter, Master. Michael JmaleNaN11266822.3583NaNC

418 rows × 11 columns

如果想要完全的删除你的数据结构,使用inplace=True,因为使用inplace就将原数据覆盖了,所以这里没有用

筛选的逻辑

表格数据中,最重要的一个功能就是要具有可筛选的能力,选出我所需要的信息,丢弃无用的信息。

下面我们还是用实战来学习pandas这个功能。

任务一: 我们以"Age"为筛选条件,显示年龄在10岁以下的乘客信息。
data = pd.read_csv(r'data\test.csv')
data.head()
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28942Myles, Mr. Thomas Francismale62.0002402769.6875NaNQ
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
data[data['Age'] < 10]
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
219133Olsen, Master. Artur Karlmale9.0001C 173683.1708NaNS
809723Boulos, Master. Akarmale6.0011267815.2458NaNC
899812Wells, Master. Ralph Lestermale2.00112910323.0000NaNS
11710093Sandstrom, Miss. Beatrice Irenefemale1.0011PP 954916.7000G6S
16110533Touma, Master. Georges Youssefmale7.0011265015.2458NaNC
19410862Drew, Master. Marshall Brinesmale8.00022822032.5000NaNS
19610881Spedden, Master. Robert Douglasmale6.000216966134.5000E34C
20110933Danbom, Master. Gilbert Sigvard Emanuelmale0.330234708014.4000NaNS
20310952Quick, Miss. Winifred Verafemale8.00112636026.0000NaNS
25011422West, Miss. Barbara Jfemale0.9212C.A. 3465127.7500NaNS
26311553Klasen, Miss. Gertrud Emiliafemale1.001135040512.1833NaNS
28111733Peacock, Master. Alfred Edwardmale0.7511SOTON/O.Q. 310131513.7750NaNS
28311753Touma, Miss. Maria Yousseffemale9.0011265015.2458NaNC
28411763Rosblom, Miss. Salli Helenafemale2.001137012920.2125NaNS
29611882Laroche, Miss. Louisefemale1.0012SC/Paris 212341.5792NaNC
30711993Aks, Master. Philip Frankmale0.83013920919.3500NaNS
35412463Dean, Miss. Elizabeth Gladys Millvina""female0.1712C.A. 231520.5750NaNS
37912713Asplund, Master. Carl Edgarmale5.004234707731.3875NaNS
38912813Palsson, Master. Paul Folkemale6.003134990921.0750NaNS
40913013Peacock, Miss. Treasteallfemale3.0011SOTON/O.Q. 310131513.7750NaNS

任务二: 以"Age"为条件,将年龄在10岁以上和50岁以下的乘客信息显示出来,并将这个数据命名为midage

# 方法一:
midage = data[(data['Age']>10) & (data['Age']<50)]   # 这里 交集 使用的 是 &
midage
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
58973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
....................................
40612982Ware, Mr. William Jefferymale23.0102866610.5000NaNS
41113031Minahan, Mrs. William Edward (Lillian E Thorpe)female37.0101992890.0000C78Q
41213043Henriksson, Miss. Jenny Lovisafemale28.0003470867.7750NaNS
41413061Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
41513073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS

274 rows × 11 columns

data['Age']>10
0       True
1       True
2       True
3       True
4       True
       ...  
413    False
414     True
415     True
416    False
417    False
Name: Age, Length: 418, dtype: bool
data['Age']<50
0       True
1       True
2      False
3       True
4       True
       ...  
413    False
414     True
415     True
416    False
417    False
Name: Age, Length: 418, dtype: bool
(data['Age']>10) & (data['Age']<50)  # 注意:pandas 中 逻辑与:&, 逻辑或:|, 逻辑非:~ ,使用 and or not 则会报错
0       True
1       True
2      False
3       True
4       True
       ...  
413    False
414     True
415     True
416    False
417    False
Name: Age, Length: 418, dtype: bool
任务三:将midage的数据中第100行的"Pclass"和"Sex"的数据显示出来

【提示】在抽取数据中,我们希望数据的相对顺序保持不变,用什么函数可以达到这个效果呢?

midage = data[(data['Age']>10) & (data['Age']<50)]
midage
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
38953Wirz, Mr. Albertmale27.0003151548.6625NaNS
48963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
58973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
....................................
40612982Ware, Mr. William Jefferymale23.0102866610.5000NaNS
41113031Minahan, Mrs. William Edward (Lillian E Thorpe)female37.0101992890.0000C78Q
41213043Henriksson, Miss. Jenny Lovisafemale28.0003470867.7750NaNS
41413061Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
41513073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS

274 rows × 11 columns

观察以上数据,我们看到 midage 是筛选出来的数据,因此,它的索引是不连续的,我们使用iloc() 或者 iloc() 时,会因为索引不连续出问题。因此,我们使用 reset_index() 或者 reindex() 方法来重置索引。

When we reset the index, the old index is added as a column, and a new sequential index is used:

midage.reset_index() # 旧索引变成了一列, 生成新的对象,不改变原对象的值
indexPassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
008923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
118933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
228953Wirz, Mr. Albertmale27.0003151548.6625NaNS
338963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
448973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
.......................................
26926912982Ware, Mr. William Jefferymale23.0102866610.5000NaNS
27027013031Minahan, Mrs. William Edward (Lillian E Thorpe)female37.0101992890.0000C78Q
27127113043Henriksson, Miss. Jenny Lovisafemale28.0003470867.7750NaNS
27227213061Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
27327313073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS

274 rows × 12 columns

We can use the drop parameter to avoid the old index being added as a column:

midage.reset_index(drop=True)  # 去除旧索引 
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
08923Kelly, Mr. Jamesmale34.5003309117.8292NaNQ
18933Wilkes, Mrs. James (Ellen Needs)female47.0103632727.0000NaNS
28953Wirz, Mr. Albertmale27.0003151548.6625NaNS
38963Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.011310129812.2875NaNS
48973Svensson, Mr. Johan Cervinmale14.00075389.2250NaNS
....................................
26912982Ware, Mr. William Jefferymale23.0102866610.5000NaNS
27013031Minahan, Mrs. William Edward (Lillian E Thorpe)female37.0101992890.0000C78Q
27113043Henriksson, Miss. Jenny Lovisafemale28.0003470867.7750NaNS
27213061Oliva y Ocana, Dona. Ferminafemale39.000PC 17758108.9000C105C
27313073Saether, Mr. Simon Sivertsenmale38.500SOTON/O.Q. 31012627.2500NaNS

274 rows × 11 columns

midage = midage.reindex(index=range(midage.shape[0])) # 这里 旧索引为 变成了一列
midage
PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0892.03.0Kelly, Mr. Jamesmale34.50.00.03309117.8292NaNQ
1893.03.0Wilkes, Mrs. James (Ellen Needs)female47.01.00.03632727.0000NaNS
2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3895.03.0Wirz, Mr. Albertmale27.00.00.03151548.6625NaNS
4896.03.0Hirvonen, Mrs. Alexander (Helga E Lindqvist)female22.01.01.0310129812.2875NaNS
....................................
2701162.01.0McCaffry, Mr. Thomas Francismale46.00.00.01305075.2417C6C
271NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2721164.01.0Clark, Mrs. Walter Miller (Virginia McDowell)female26.01.00.013508136.7792C89C
273NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
274NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

275 rows × 11 columns

midage.loc[[99], ['Pclass', 'Sex']]
PclassSex
993male
midage.iloc[[99], [1, 3]]
PclassSex
993male

注意:iloc(操作行, 操作列) 和 ioc() 方法的区别:iloc() 操作时只能是行索引(index), 和列索引, loc操作的是列名(cloumns)。

任务四:使用loc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
midage.loc[[100, 105, 108], ['Pclass', 'Name', 'Sex']]
PclassNameSex
1002Lahtinen, Rev. Williammale
1051Bird, Miss. Ellenfemale
1083Peacock, Mrs. Benjamin (Edith Nile)female
任务五:使用iloc方法将midage的数据中第100,105,108行的"Pclass","Name"和"Sex"的数据显示出来
midage.iloc[[100, 105, 108], [1, 2, 3]]
PclassNameSex
1002Lahtinen, Rev. Williammale
1051Bird, Miss. Ellenfemale
1083Peacock, Mrs. Benjamin (Edith Nile)female

第二天任务已完成!

复习:在前面我们已经学习了Pandas基础,知道利用Pandas读取csv数据的增删查改,今天我们要学习的就是探索性数据分析,主要介绍如何利用Pandas进行排序、算术计算以及计算描述函数describe()的使用。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值