第二章 pandas基础

第二章 pandas基础

一、文件的读取和写入

1. 文件读取

【a】pandas读取csv, excel, txt文件:

import numpy as np
import pandas as pd
df_csv = pd.read_csv('../data/my_csv.csv')
df_csv
col1col2col3col4col5
02a1.4apple2020/1/1
13b3.4banana2020/1/2
26c2.5orange2020/1/5
35d3.2lemon2020/1/7
df_txt = pd.read_table('../data/my_table.txt')
df_txt
col1col2col3col4
02a1.4apple 2020/1/1
13b3.4banana 2020/1/2
26c2.5orange 2020/1/5
35d3.2lemon 2020/1/7
df_excel = pd.read_excel('../data/my_excel.xlsx')
df_excel
col1col2col3col4col5
02a1.4apple2020/1/1
13b3.4banana2020/1/2
26c2.5orange2020/1/5
35d3.2lemon2020/1/7

【b】常用参数:headerusecolsnrows,对 txt、csv、excel 均适用。

header=None表示第一行不作为列名:

pd.read_table('../data/my_table.txt', header=None)
0123
0col1col2col3col4
12a1.4apple 2020/1/1
23b3.4banana 2020/1/2
36c2.5orange 2020/1/5
45d3.2lemon 2020/1/7

usecols表示读取列的集合,默认读取所有的列:

pd.read_table('../data/my_table.txt', usecols=['col1', 'col2'])
col1col2
02a
13b
26c
35d

nrows表示读取的数据行数:

pd.read_table('../data/my_table.txt', nrows=2)
col1col2col3col4
02a1.4apple 2020/1/1
13b3.4banana 2020/1/2

【c】对于分隔符非空格的txt文件,分割参数sep可以自定义分隔符号。

pd.read_table('../data/my_table_special_sep.txt')
col1 |||| col2
0TS |||| This is an apple.
1GQ |||| My name is Bob.
2WT |||| Well done!
3PT |||| May I help you?

使用sep设置分隔符,同时指定引擎为python

PS:使用read_table的时,参数sep中使用的是正则表达式,因此需要对|进行转义变成\|,否则无法读取到正确的结果。

pd.read_table('../data/my_table_special_sep.txt', sep=' \|\|\|\| ', engine='python')
col1col2
0TSThis is an apple.
1GQMy name is Bob.
2WTWell done!
3PTMay I help you?

2. 文件写入

【a】index=False,在保存文件的时候去除索引:

df_csv.to_csv('../data/my_csv_saved.csv', index=True)
df_excel.to_excel('../data/my_excel_saved.xlsx', index=False)
无索引

在这里插入图片描述

没有索引

在这里插入图片描述

【b】pandas中没有定义to_table函数,txt文件也使用to_csv进行保存:

df_txt.to_csv('../data/my_txt_saved.txt', sep='\t', index=False)

【c】表格转markdownlatex

df_csv
col1col2col3col4col5
02a1.4apple2020/1/1
13b3.4banana2020/1/2
26c2.5orange2020/1/5
35d3.2lemon2020/1/7
print(df_csv.to_markdown())
|    |   col1 | col2   |   col3 | col4   | col5     |
|---:|-------:|:-------|-------:|:-------|:---------|
|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |
|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |
|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |
|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |
print(df_csv.to_latex())
\begin{tabular}{lrlrll}
\toprule
{} &  col1 & col2 &  col3 &    col4 &      col5 \\
\midrule
0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\
1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\
2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\
3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\
\bottomrule
\end{tabular}

二、基本数据结构

1. Series

【a】Series存储一维values,由序列的值data、索引index、存储类型dtype、序列的名字name组成:

s = pd.Series(data = [['dawang','yiyi'],{'dawang':21},'yiyi_19'],
              index = ['list','dict','str'],
              # index = pd.Index(['list','dict','str'], name='index_name'),
              dtype = 'object',
              name = 'dawang_yiyi')  # `object`是一种混合类型,正如次例中存储了列表、字典和字符串。
s
list    [dawang, yiyi]
dict    {'dawang': 21}
str            yiyi_19
Name: dawang_yiyi, dtype: object

【b】获取 Series 属性:

s.values
array([list(['dawang', 'yiyi']), {'dawang': 21}, 'yiyi_19'], dtype=object)
s.index
Index(['list', 'dict', 'str'], dtype='object')
s.dtype
dtype('O')
s.name
'dawang_yiyi'
s.shape  # 获取序列的长度
(3,)

2. DataFrame

DataFrame存储二维values,在Series的基础上增加了列索引columns

【a】二维数组 + 行、列索引构造DataFrame

data = [['dawang',21,'image identification'], ['yiyi',19,'clinical medicine']]
df = pd.DataFrame(data = data,
                  index = ['row_%d'%i for i in range(2)],
                  columns=['col_0', 'col_1', 'col_2'])
df
col_0col_1col_2
row_0dawang21image identification
row_1yiyi19clinical medicine

【b】列索引名→数据,映射构造DataFrame,同时加上行索引:

df = pd.DataFrame(data = {'col_0': ['dawang','yiyi'],
                          'col_1': [21, 19],
                          'col_2': ['imageidentification', 'clinicalmedicine']},
                  index = ['row_%d'%i for i in range(2)])
df
col_0col_1col_2
row_0dawang21imageidentification
row_1yiyi19clinicalmedicine

【c】DataFrame + [col_name]取出相应列得到 Series

df['col_0']
row_0    dawang
row_1      yiyi
Name: col_0, dtype: object

【d】DataFrame + [col_list]取出多列得到 DataFrame

df[['col_0', 'col_1']]
col_0col_1
row_0dawang21
row_1yiyi19

【e】获取 DataFrame 属性:

df.values
array([['dawang', 21, 'imageidentification'],
       ['yiyi', 19, 'clinicalmedicine']], dtype=object)
df.index
Index(['row_0', 'row_1'], dtype='object')
df.columns
Index(['col_0', 'col_1', 'col_2'], dtype='object')
df.dtypes  # 返回的是值为相应列数据类型的Series
col_0    object
col_1     int64
col_2    object
dtype: object
df.shape
(2, 3)
df.T  # `DataFrame`转置
row_0row_1
col_0dawangyiyi
col_12119
col_2imageidentificationclinicalmedicine

三、常用基本函数

为了举例说明,下面使用一份learn_pandas.csv的虚拟数据集,它记录了四所学校学生的体测个人信息。

df = pd.read_csv('../data/learn_pandas.csv')
df.columns
Index(['School', 'Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer',
       'Test_Number', 'Test_Date', 'Time_Record'],
      dtype='object')

上述列名依次代表学校、年级、姓名、性别、身高、体重、是否为转系生、体测场次、测试时间、1000米成绩,本章仅使用其中的前七列。

df = df[df.columns[:7]]
df
SchoolGradeNameGenderHeightWeightTransfer
0Shanghai Jiao Tong UniversityFreshmanGaopeng YangFemale158.946.0N
1Peking UniversityFreshmanChangqiang YouMale166.570.0N
2Shanghai Jiao Tong UniversitySeniorMei SunMale188.989.0N
3Fudan UniversitySophomoreXiaojuan SunFemaleNaN41.0N
4Fudan UniversitySophomoreGaojuan YouMale174.074.0N
........................
195Fudan UniversityJuniorXiaojuan SunFemale153.946.0N
196Tsinghua UniversitySeniorLi ZhaoFemale160.950.0N
197Shanghai Jiao Tong UniversitySeniorChengqiang ChuFemale153.945.0N
198Shanghai Jiao Tong UniversitySeniorChengmei ShenMale175.371.0N
199Tsinghua UniversitySophomoreChunpeng LvMale155.751.0N

200 rows × 7 columns

1. 汇总函数

【a】head, tail函数分别返回表的前n行和后n行,其中n默认为5:

df.head(2)
SchoolGradeNameGenderHeightWeightTransfer
0Shanghai Jiao Tong UniversityFreshmanGaopeng YangFemale158.946.0N
1Peking UniversityFreshmanChangqiang YouMale166.570.0N
df.tail(2)
SchoolGradeNameGenderHeightWeightTransfer
198Shanghai Jiao Tong UniversitySeniorChengmei ShenMale175.371.0N
199Tsinghua UniversitySophomoreChunpeng LvMale155.751.0N

【b】info返回表的信息概况:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   School    200 non-null    object 
 1   Grade     200 non-null    object 
 2   Name      200 non-null    object 
 3   Gender    200 non-null    object 
 4   Height    183 non-null    float64
 5   Weight    189 non-null    float64
 6   Transfer  188 non-null    object 
dtypes: float64(2), object(5)
memory usage: 11.1+ KB

【c】describe返回表中 “数值” 列对应的主要统计量,此处 “数值” 列仅有 “Height” 和 “Weight” :

df.describe()
HeightWeight
count183.000000189.000000
mean163.21803355.015873
std8.60887912.824294
min145.40000034.000000
25%157.15000046.000000
50%161.90000051.000000
75%167.50000065.000000
max193.90000089.000000

2. 特征统计函数

sum, mean, median, var, std, max, min 为例:

df_demo = df[['Height', 'Weight']]
df_demo
HeightWeight
0158.946.0
1166.570.0
2188.989.0
3NaN41.0
4174.074.0
.........
195153.946.0
196160.950.0
197153.945.0
198175.371.0
199155.751.0

200 rows × 2 columns

df_demo.mean()  # 平均值
Height    163.218033
Weight     55.015873
dtype: float64
df_demo.max()  # 最大值
Height    193.9
Weight     89.0
dtype: float64
df_demo.quantile(0.75)  # 75%分位数
Height    167.5
Weight     65.0
Name: 0.75, dtype: float64
df_demo.count()  # 非缺失值个数
Height    183
Weight    189
dtype: int64
df_demo.idxmax() # 最大值对应的索引值
Height    193
Weight      2
dtype: int64

上述函数都有一个聚合参数axisaxis = 0代表逐列聚合,axis = 1代表逐行聚合,axis默认为0:

df_demo.mean(axis=1).head(2) # 表示计算前两行的平均值
0    102.45
1    118.25
dtype: float64

3. 唯一值函数

【a】 unique:找出单个属性唯一值,nunique:统计唯一值个数。

df['School']  # 所有人的 School 属性 
0      Shanghai Jiao Tong University
1                  Peking University
2      Shanghai Jiao Tong University
3                   Fudan University
4                   Fudan University
                   ...              
195                 Fudan University
196              Tsinghua University
197    Shanghai Jiao Tong University
198    Shanghai Jiao Tong University
199              Tsinghua University
Name: School, Length: 200, dtype: object
df['School'].unique()  # 找出 School 属性的唯一值
array(['Shanghai Jiao Tong University', 'Peking University',
       'Fudan University', 'Tsinghua University'], dtype=object)
df['School'].nunique()  # 统计 School 属性唯一值的个数
4

【b】value_counts:得到唯一值和其对应出现的频数。

df['School'].value_counts()
Tsinghua University              69
Shanghai Jiao Tong University    57
Fudan University                 40
Peking University                34
Name: School, dtype: int64

【c】 drop_duplicates找出单个属性唯一值或者多个属性组合的唯一值。

关键参数是keepfirst表示每个组合保留第一次出现的所在行,last表示保留最后一次出现的所在行,False表示把所有重复组合所在的行剔除。

对于单个属性唯一值:

df['School'].drop_duplicates(keep='first')
0    Shanghai Jiao Tong University
1                Peking University
3                 Fudan University
5              Tsinghua University
Name: School, dtype: object

对于多个属性组合的唯一值:

df_demo = df[['Gender','Transfer','Name']]
df_demo
GenderTransferName
0FemaleNGaopeng Yang
1MaleNChangqiang You
2MaleNMei Sun
3FemaleNXiaojuan Sun
4MaleNGaojuan You
............
195FemaleNXiaojuan Sun
196FemaleNLi Zhao
197FemaleNChengqiang Chu
198MaleNChengmei Shen
199MaleNChunpeng Lv

200 rows × 3 columns

df_demo.drop_duplicates(['Gender', 'Transfer'], keep='first')
GenderTransferName
0FemaleNGaopeng Yang
1MaleNChangqiang You
12FemaleNaNPeng You
21MaleNaNXiaopeng Shen
36MaleYXiaojuan Qin
43FemaleYGaoli Feng
df_demo.drop_duplicates(['Gender', 'Transfer'], keep='last')
GenderTransferName
147MaleNaNJuan You
150MaleYChengpeng You
169FemaleYChengquan Qin
194FemaleNaNYanmei Qian
197FemaleNChengqiang Chu
199MaleNChunpeng Lv
df_demo.drop_duplicates(['Gender', 'Transfer'], keep=False)
GenderTransferName

4. 替换函数

替换操作通常针对单个属性进行,因此下面的例子都以Series举例。

【a】映射替换:replace

df['Grade'].head()
0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: object
df['Grade'].replace({'Freshman':1, 'Sophomore':2, 'Junior':3, 'Senior':4}).head()  # 字典传参
0    1
1    1
2    4
3    2
4    2
Name: Grade, dtype: int64
df['Grade'].replace(['Freshman','Sophomore','Junior','Senior'], [1,2,3,4]).head()  # 列表传参
0    1
1    1
2    4
3    2
4    2
Name: Grade, dtype: int64

特殊参数:method

s = pd.Series(['dawang','man','yiyi','woman','woman','dawang'])  # 下面展示用 'dawang'和'yiyi'替换'man'和'woman'
s.replace(['man','woman'], method='ffill')  # `method` = `ffill`,则为用前面一个最近的未被替换的值进行替换
0    dawang
1    dawang
2      yiyi
3      yiyi
4      yiyi
5    dawang
dtype: object
s.replace(['man','woman'], method='bfill')  # `method` = `bfill`,则为使用后面最近的未被替换的值进行替换
0    dawang
1      yiyi
2      yiyi
3    dawang
4    dawang
5    dawang
dtype: object

【b】逻辑替换:wheremask

s = pd.Series([-1,1,-1,1,1,-1])

where:在传入条件为False的对应行进行替换

s.where(s<0)  # 不指定替换值,替换为缺失值NaN
0   -1.0
1    NaN
2   -1.0
3    NaN
4    NaN
5   -1.0
dtype: float64
s.where(s<0,'此处大于0')  # 指定替换值
0       -1
1    此处大于0
2       -1
3    此处大于0
4    此处大于0
5       -1
dtype: object

mask:在传入条件为True的对应行进行替换

s.mask(s<0)
0    NaN
1    1.0
2    NaN
3    1.0
4    1.0
5    NaN
dtype: float64
s.mask(s<0, '此处小于0')
0    此处小于0
1        1
2    此处小于0
3        1
4        1
5    此处小于0
dtype: object

【c】数值替换:roundabsclip

s = 0.666 * pd.Series(range(-2,3))
s
0   -1.332
1   -0.666
2    0.000
3    0.666
4    1.332
dtype: float64
s.clip(-0.5, 1) # 小于-0.5的值替换为0.5,大于1的值替换为1,在-0.5→1之间的值不变
0   -0.500
1   -0.500
2    0.000
3    0.666
4    1.000
dtype: float64
s.round(2)  # 保留两位小数
0   -1.33
1   -0.67
2    0.00
3    0.67
4    1.33
dtype: float64
s.abs()  # 取绝对值
0    1.332
1    0.666
2    0.000
3    0.666
4    1.332
dtype: float64

5. 排序函数

排序函数常包括值排序和索引排序,下面先用set_index方法把年级和姓名两列作为索引:

df_demo = df[['Grade', 'Name', 'Height', 'Weight']].set_index(['Grade','Name'])
df_demo.head()
HeightWeight
GradeName
FreshmanGaopeng Yang158.946.0
Changqiang You166.570.0
SeniorMei Sun188.989.0
SophomoreXiaojuan SunNaN41.0
Gaojuan You174.074.0

【a】值排序:sort_values

df_demo.sort_values('Height').head()  # 对身高进行升序排序
HeightWeight
GradeName
JuniorXiaoli Chu145.434.0
SeniorGaomei Lv147.334.0
SophomorePeng Han147.834.0
SeniorChangli Lv148.741.0
SophomoreChangjuan You150.540.0
df_demo.sort_values('Height', ascending=False).head()  # 对身高进行降序排序
HeightWeight
GradeName
SeniorXiaoqiang Qin193.979.0
Mei Sun188.989.0
Gaoli Zhao186.583.0
FreshmanQiang Han185.387.0
SeniorQiang Zheng183.987.0

多列排序:体重不同时,按体重降序排序;体重相同时,按身高升序排列。

df_demo.sort_values(['Weight','Height'],ascending=[False,True]).head()
HeightWeight
GradeName
SeniorMei Sun188.989.0
Qiang Zheng183.987.0
FreshmanQiang Han185.387.0
Chunli Zhao180.283.0
Changpeng Zhao181.383.0

【b】索引排序:sort_index

索引排序,元素的值在索引中,用参数level指定索引层的名字或者层号。PS:字符串的排列顺序由英文字母顺序决定。

df_demo.sort_index(level=['Grade','Name'],ascending=[False,True]).head()  # 'Grade'降序排序(z→a),'Name'升序排序(a→z)
HeightWeight
GradeName
SophomoreChangjuan You150.540.0
Changmei Xu151.643.0
Changqiang Qian167.664.0
Chengli You164.157.0
Chengqiang Lv166.853.0

6. apply方法

使用 apply 方法,可以同时应用多种运算:

df_demo.apply(lambda x:(x-x.mean()).std())  # 先取均值,然后与原来值相减,最后求标准差
Height     8.608879
Weight    12.824294
dtype: float64

若指定axis=1,那么每次传入函数的就是行元素组成的Series,其结果与之前的逐行均值结果一致。

df_demo.apply(lambda x:(x-x.mean()).std(), axis=1).head()
0    79.832356
1    68.235804
2    70.639967
3          NaN
4    70.710678
dtype: float64

四、窗口对象

pandas中有3类窗口,用得较多的是滑动窗口rolling和扩张窗口expanding

1. 滑窗对象

.rolling:得到滑窗对象,参数 window 为窗口大小。

s = pd.Series([1,2,3,4,5])
roller = s.rolling(window = 2)
roller
Rolling [window=2,center=False,axis=0]

在得到了滑窗对象后,能够使用相应的聚合函数进行计算。每次选取元素的个数即为window的值。下面以求均值的步骤举例:

第一个位置,窗口还未完全进入,故为缺失值

第二个位置,窗口包含第一、二个元素,求均值为(1+2)/2=1.5

第三个位置,窗口包含第二、三个元素,求均值为(2+3)/2=2.5

以此类推。

roller.mean()
0    NaN
1    1.5
2    2.5
3    3.5
4    4.5
dtype: float64
roller.apply(lambda x:x.mean())  # 等价于求均值
0    NaN
1    1.5
2    2.5
3    3.5
4    4.5
dtype: float64
roller.sum()  # 与求均值的流程一致
0    NaN
1    3.0
2    5.0
3    7.0
4    9.0
dtype: float64

2. 扩张窗口

扩张窗口又称累计窗口,其窗口的大小就是从序列开始处到具体操作的对应位置,其使用的聚合函数会作用于这些逐步扩张的窗口上。下面仍以求均值的步骤举例:

第一个位置求均值,窗口大小为1均值为:1/1=1

第二个位置求均值,窗口大小为2,均值为:(1+3)/2=2

第三个位置求均值,窗口大小为3,均值为:(1+3+6)/3=3.3333

以此类推。

s = pd.Series([1, 3, 6, 10])
s.expanding().mean()
0    1.000000
1    2.000000
2    3.333333
3    5.000000
dtype: float64
s.ewm(alpha=0.2).mean().head()
0   -1.000000
1   -1.000000
2   -1.409836
3   -1.609756
4   -1.725845
dtype: float64

本文参考datawhale组最学习相关资料!!!

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值