第二章 pandas基础
一、文件的读取和写入
1. 文件读取
【a】pandas
读取csv, excel, txt
文件:
import numpy as np
import pandas as pd
df_csv = pd.read_csv('../data/my_csv.csv')
df_csv
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | 2 | a | 1.4 | apple | 2020/1/1 |
1 | 3 | b | 3.4 | banana | 2020/1/2 |
2 | 6 | c | 2.5 | orange | 2020/1/5 |
3 | 5 | d | 3.2 | lemon | 2020/1/7 |
df_txt = pd.read_table('../data/my_table.txt')
df_txt
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 2 | a | 1.4 | apple 2020/1/1 |
1 | 3 | b | 3.4 | banana 2020/1/2 |
2 | 6 | c | 2.5 | orange 2020/1/5 |
3 | 5 | d | 3.2 | lemon 2020/1/7 |
df_excel = pd.read_excel('../data/my_excel.xlsx')
df_excel
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | 2 | a | 1.4 | apple | 2020/1/1 |
1 | 3 | b | 3.4 | banana | 2020/1/2 |
2 | 6 | c | 2.5 | orange | 2020/1/5 |
3 | 5 | d | 3.2 | lemon | 2020/1/7 |
【b】常用参数:header
、usecols
、nrows
,对 txt、csv、excel 均适用。
header=None
表示第一行不作为列名:
pd.read_table('../data/my_table.txt', header=None)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | col1 | col2 | col3 | col4 |
1 | 2 | a | 1.4 | apple 2020/1/1 |
2 | 3 | b | 3.4 | banana 2020/1/2 |
3 | 6 | c | 2.5 | orange 2020/1/5 |
4 | 5 | d | 3.2 | lemon 2020/1/7 |
usecols
表示读取列的集合,默认读取所有的列:
pd.read_table('../data/my_table.txt', usecols=['col1', 'col2'])
col1 | col2 | |
---|---|---|
0 | 2 | a |
1 | 3 | b |
2 | 6 | c |
3 | 5 | d |
nrows
表示读取的数据行数:
pd.read_table('../data/my_table.txt', nrows=2)
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 2 | a | 1.4 | apple 2020/1/1 |
1 | 3 | b | 3.4 | banana 2020/1/2 |
【c】对于分隔符非空格的txt文件,分割参数sep
可以自定义分隔符号。
pd.read_table('../data/my_table_special_sep.txt')
col1 |||| col2 | |
---|---|
0 | TS |||| This is an apple. |
1 | GQ |||| My name is Bob. |
2 | WT |||| Well done! |
3 | PT |||| May I help you? |
使用sep
设置分隔符,同时指定引擎为python
:
PS:使用read_table
的时,参数sep
中使用的是正则表达式,因此需要对|
进行转义变成\|
,否则无法读取到正确的结果。
pd.read_table('../data/my_table_special_sep.txt', sep=' \|\|\|\| ', engine='python')
col1 | col2 | |
---|---|---|
0 | TS | This is an apple. |
1 | GQ | My name is Bob. |
2 | WT | Well done! |
3 | PT | May I help you? |
2. 文件写入
【a】index
=False
,在保存文件的时候去除索引:
df_csv.to_csv('../data/my_csv_saved.csv', index=True)
df_excel.to_excel('../data/my_excel_saved.xlsx', index=False)
【b】pandas
中没有定义to_table
函数,txt
文件也使用to_csv
进行保存:
df_txt.to_csv('../data/my_txt_saved.txt', sep='\t', index=False)
【c】表格转markdown
、latex
:
df_csv
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | 2 | a | 1.4 | apple | 2020/1/1 |
1 | 3 | b | 3.4 | banana | 2020/1/2 |
2 | 6 | c | 2.5 | orange | 2020/1/5 |
3 | 5 | d | 3.2 | lemon | 2020/1/7 |
print(df_csv.to_markdown())
| | col1 | col2 | col3 | col4 | col5 |
|---:|-------:|:-------|-------:|:-------|:---------|
| 0 | 2 | a | 1.4 | apple | 2020/1/1 |
| 1 | 3 | b | 3.4 | banana | 2020/1/2 |
| 2 | 6 | c | 2.5 | orange | 2020/1/5 |
| 3 | 5 | d | 3.2 | lemon | 2020/1/7 |
print(df_csv.to_latex())
\begin{tabular}{lrlrll}
\toprule
{} & col1 & col2 & col3 & col4 & col5 \\
\midrule
0 & 2 & a & 1.4 & apple & 2020/1/1 \\
1 & 3 & b & 3.4 & banana & 2020/1/2 \\
2 & 6 & c & 2.5 & orange & 2020/1/5 \\
3 & 5 & d & 3.2 & lemon & 2020/1/7 \\
\bottomrule
\end{tabular}
二、基本数据结构
1. Series
【a】Series
存储一维values
,由序列的值data
、索引index
、存储类型dtype
、序列的名字name
组成:
s = pd.Series(data = [['dawang','yiyi'],{'dawang':21},'yiyi_19'],
index = ['list','dict','str'],
# index = pd.Index(['list','dict','str'], name='index_name'),
dtype = 'object',
name = 'dawang_yiyi') # `object`是一种混合类型,正如次例中存储了列表、字典和字符串。
s
list [dawang, yiyi]
dict {'dawang': 21}
str yiyi_19
Name: dawang_yiyi, dtype: object
【b】获取 Series
属性:
s.values
array([list(['dawang', 'yiyi']), {'dawang': 21}, 'yiyi_19'], dtype=object)
s.index
Index(['list', 'dict', 'str'], dtype='object')
s.dtype
dtype('O')
s.name
'dawang_yiyi'
s.shape # 获取序列的长度
(3,)
2. DataFrame
DataFrame
存储二维values
,在Series
的基础上增加了列索引columns
。
【a】二维数组 + 行、列索引构造DataFrame
:
data = [['dawang',21,'image identification'], ['yiyi',19,'clinical medicine']]
df = pd.DataFrame(data = data,
index = ['row_%d'%i for i in range(2)],
columns=['col_0', 'col_1', 'col_2'])
df
col_0 | col_1 | col_2 | |
---|---|---|---|
row_0 | dawang | 21 | image identification |
row_1 | yiyi | 19 | clinical medicine |
【b】列索引名→数据,映射构造DataFrame
,同时加上行索引:
df = pd.DataFrame(data = {'col_0': ['dawang','yiyi'],
'col_1': [21, 19],
'col_2': ['imageidentification', 'clinicalmedicine']},
index = ['row_%d'%i for i in range(2)])
df
col_0 | col_1 | col_2 | |
---|---|---|---|
row_0 | dawang | 21 | imageidentification |
row_1 | yiyi | 19 | clinicalmedicine |
【c】DataFrame
+ [col_name]
取出相应列得到 Series
:
df['col_0']
row_0 dawang
row_1 yiyi
Name: col_0, dtype: object
【d】DataFrame
+ [col_list]
取出多列得到 DataFrame
:
df[['col_0', 'col_1']]
col_0 | col_1 | |
---|---|---|
row_0 | dawang | 21 |
row_1 | yiyi | 19 |
【e】获取 DataFrame
属性:
df.values
array([['dawang', 21, 'imageidentification'],
['yiyi', 19, 'clinicalmedicine']], dtype=object)
df.index
Index(['row_0', 'row_1'], dtype='object')
df.columns
Index(['col_0', 'col_1', 'col_2'], dtype='object')
df.dtypes # 返回的是值为相应列数据类型的Series
col_0 object
col_1 int64
col_2 object
dtype: object
df.shape
(2, 3)
df.T # `DataFrame`转置
row_0 | row_1 | |
---|---|---|
col_0 | dawang | yiyi |
col_1 | 21 | 19 |
col_2 | imageidentification | clinicalmedicine |
三、常用基本函数
为了举例说明,下面使用一份learn_pandas.csv
的虚拟数据集,它记录了四所学校学生的体测个人信息。
df = pd.read_csv('../data/learn_pandas.csv')
df.columns
Index(['School', 'Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer',
'Test_Number', 'Test_Date', 'Time_Record'],
dtype='object')
上述列名依次代表学校、年级、姓名、性别、身高、体重、是否为转系生、体测场次、测试时间、1000米成绩,本章仅使用其中的前七列。
df = df[df.columns[:7]]
df
School | Grade | Name | Gender | Height | Weight | Transfer | |
---|---|---|---|---|---|---|---|
0 | Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 158.9 | 46.0 | N |
1 | Peking University | Freshman | Changqiang You | Male | 166.5 | 70.0 | N |
2 | Shanghai Jiao Tong University | Senior | Mei Sun | Male | 188.9 | 89.0 | N |
3 | Fudan University | Sophomore | Xiaojuan Sun | Female | NaN | 41.0 | N |
4 | Fudan University | Sophomore | Gaojuan You | Male | 174.0 | 74.0 | N |
... | ... | ... | ... | ... | ... | ... | ... |
195 | Fudan University | Junior | Xiaojuan Sun | Female | 153.9 | 46.0 | N |
196 | Tsinghua University | Senior | Li Zhao | Female | 160.9 | 50.0 | N |
197 | Shanghai Jiao Tong University | Senior | Chengqiang Chu | Female | 153.9 | 45.0 | N |
198 | Shanghai Jiao Tong University | Senior | Chengmei Shen | Male | 175.3 | 71.0 | N |
199 | Tsinghua University | Sophomore | Chunpeng Lv | Male | 155.7 | 51.0 | N |
200 rows × 7 columns
1. 汇总函数
【a】head, tail
函数分别返回表的前n
行和后n
行,其中n
默认为5:
df.head(2)
School | Grade | Name | Gender | Height | Weight | Transfer | |
---|---|---|---|---|---|---|---|
0 | Shanghai Jiao Tong University | Freshman | Gaopeng Yang | Female | 158.9 | 46.0 | N |
1 | Peking University | Freshman | Changqiang You | Male | 166.5 | 70.0 | N |
df.tail(2)
School | Grade | Name | Gender | Height | Weight | Transfer | |
---|---|---|---|---|---|---|---|
198 | Shanghai Jiao Tong University | Senior | Chengmei Shen | Male | 175.3 | 71.0 | N |
199 | Tsinghua University | Sophomore | Chunpeng Lv | Male | 155.7 | 51.0 | N |
【b】info
返回表的信息概况:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 School 200 non-null object
1 Grade 200 non-null object
2 Name 200 non-null object
3 Gender 200 non-null object
4 Height 183 non-null float64
5 Weight 189 non-null float64
6 Transfer 188 non-null object
dtypes: float64(2), object(5)
memory usage: 11.1+ KB
【c】describe
返回表中 “数值” 列对应的主要统计量,此处 “数值” 列仅有 “Height” 和 “Weight” :
df.describe()
Height | Weight | |
---|---|---|
count | 183.000000 | 189.000000 |
mean | 163.218033 | 55.015873 |
std | 8.608879 | 12.824294 |
min | 145.400000 | 34.000000 |
25% | 157.150000 | 46.000000 |
50% | 161.900000 | 51.000000 |
75% | 167.500000 | 65.000000 |
max | 193.900000 | 89.000000 |
2. 特征统计函数
以 sum, mean, median, var, std, max, min
为例:
df_demo = df[['Height', 'Weight']]
df_demo
Height | Weight | |
---|---|---|
0 | 158.9 | 46.0 |
1 | 166.5 | 70.0 |
2 | 188.9 | 89.0 |
3 | NaN | 41.0 |
4 | 174.0 | 74.0 |
... | ... | ... |
195 | 153.9 | 46.0 |
196 | 160.9 | 50.0 |
197 | 153.9 | 45.0 |
198 | 175.3 | 71.0 |
199 | 155.7 | 51.0 |
200 rows × 2 columns
df_demo.mean() # 平均值
Height 163.218033
Weight 55.015873
dtype: float64
df_demo.max() # 最大值
Height 193.9
Weight 89.0
dtype: float64
df_demo.quantile(0.75) # 75%分位数
Height 167.5
Weight 65.0
Name: 0.75, dtype: float64
df_demo.count() # 非缺失值个数
Height 183
Weight 189
dtype: int64
df_demo.idxmax() # 最大值对应的索引值
Height 193
Weight 2
dtype: int64
上述函数都有一个聚合参数axis
,axis
= 0代表逐列聚合,axis
= 1代表逐行聚合,axis
默认为0:
df_demo.mean(axis=1).head(2) # 表示计算前两行的平均值
0 102.45
1 118.25
dtype: float64
3. 唯一值函数
【a】 unique
:找出单个属性唯一值,nunique
:统计唯一值个数。
df['School'] # 所有人的 School 属性
0 Shanghai Jiao Tong University
1 Peking University
2 Shanghai Jiao Tong University
3 Fudan University
4 Fudan University
...
195 Fudan University
196 Tsinghua University
197 Shanghai Jiao Tong University
198 Shanghai Jiao Tong University
199 Tsinghua University
Name: School, Length: 200, dtype: object
df['School'].unique() # 找出 School 属性的唯一值
array(['Shanghai Jiao Tong University', 'Peking University',
'Fudan University', 'Tsinghua University'], dtype=object)
df['School'].nunique() # 统计 School 属性唯一值的个数
4
【b】value_counts
:得到唯一值和其对应出现的频数。
df['School'].value_counts()
Tsinghua University 69
Shanghai Jiao Tong University 57
Fudan University 40
Peking University 34
Name: School, dtype: int64
【c】 drop_duplicates
找出单个属性唯一值或者多个属性组合的唯一值。
关键参数是keep
,first
表示每个组合保留第一次出现的所在行,last
表示保留最后一次出现的所在行,False
表示把所有重复组合所在的行剔除。
对于单个属性唯一值:
df['School'].drop_duplicates(keep='first')
0 Shanghai Jiao Tong University
1 Peking University
3 Fudan University
5 Tsinghua University
Name: School, dtype: object
对于多个属性组合的唯一值:
df_demo = df[['Gender','Transfer','Name']]
df_demo
Gender | Transfer | Name | |
---|---|---|---|
0 | Female | N | Gaopeng Yang |
1 | Male | N | Changqiang You |
2 | Male | N | Mei Sun |
3 | Female | N | Xiaojuan Sun |
4 | Male | N | Gaojuan You |
... | ... | ... | ... |
195 | Female | N | Xiaojuan Sun |
196 | Female | N | Li Zhao |
197 | Female | N | Chengqiang Chu |
198 | Male | N | Chengmei Shen |
199 | Male | N | Chunpeng Lv |
200 rows × 3 columns
df_demo.drop_duplicates(['Gender', 'Transfer'], keep='first')
Gender | Transfer | Name | |
---|---|---|---|
0 | Female | N | Gaopeng Yang |
1 | Male | N | Changqiang You |
12 | Female | NaN | Peng You |
21 | Male | NaN | Xiaopeng Shen |
36 | Male | Y | Xiaojuan Qin |
43 | Female | Y | Gaoli Feng |
df_demo.drop_duplicates(['Gender', 'Transfer'], keep='last')
Gender | Transfer | Name | |
---|---|---|---|
147 | Male | NaN | Juan You |
150 | Male | Y | Chengpeng You |
169 | Female | Y | Chengquan Qin |
194 | Female | NaN | Yanmei Qian |
197 | Female | N | Chengqiang Chu |
199 | Male | N | Chunpeng Lv |
df_demo.drop_duplicates(['Gender', 'Transfer'], keep=False)
Gender | Transfer | Name |
---|
4. 替换函数
替换操作通常针对单个属性进行,因此下面的例子都以Series
举例。
【a】映射替换:replace
df['Grade'].head()
0 Freshman
1 Freshman
2 Senior
3 Sophomore
4 Sophomore
Name: Grade, dtype: object
df['Grade'].replace({'Freshman':1, 'Sophomore':2, 'Junior':3, 'Senior':4}).head() # 字典传参
0 1
1 1
2 4
3 2
4 2
Name: Grade, dtype: int64
df['Grade'].replace(['Freshman','Sophomore','Junior','Senior'], [1,2,3,4]).head() # 列表传参
0 1
1 1
2 4
3 2
4 2
Name: Grade, dtype: int64
特殊参数:method
s = pd.Series(['dawang','man','yiyi','woman','woman','dawang']) # 下面展示用 'dawang'和'yiyi'替换'man'和'woman'
s.replace(['man','woman'], method='ffill') # `method` = `ffill`,则为用前面一个最近的未被替换的值进行替换
0 dawang
1 dawang
2 yiyi
3 yiyi
4 yiyi
5 dawang
dtype: object
s.replace(['man','woman'], method='bfill') # `method` = `bfill`,则为使用后面最近的未被替换的值进行替换
0 dawang
1 yiyi
2 yiyi
3 dawang
4 dawang
5 dawang
dtype: object
【b】逻辑替换:where
、mask
s = pd.Series([-1,1,-1,1,1,-1])
where
:在传入条件为False
的对应行进行替换
s.where(s<0) # 不指定替换值,替换为缺失值NaN
0 -1.0
1 NaN
2 -1.0
3 NaN
4 NaN
5 -1.0
dtype: float64
s.where(s<0,'此处大于0') # 指定替换值
0 -1
1 此处大于0
2 -1
3 此处大于0
4 此处大于0
5 -1
dtype: object
mask
:在传入条件为True
的对应行进行替换
s.mask(s<0)
0 NaN
1 1.0
2 NaN
3 1.0
4 1.0
5 NaN
dtype: float64
s.mask(s<0, '此处小于0')
0 此处小于0
1 1
2 此处小于0
3 1
4 1
5 此处小于0
dtype: object
【c】数值替换:round
、abs
、clip
s = 0.666 * pd.Series(range(-2,3))
s
0 -1.332
1 -0.666
2 0.000
3 0.666
4 1.332
dtype: float64
s.clip(-0.5, 1) # 小于-0.5的值替换为0.5,大于1的值替换为1,在-0.5→1之间的值不变
0 -0.500
1 -0.500
2 0.000
3 0.666
4 1.000
dtype: float64
s.round(2) # 保留两位小数
0 -1.33
1 -0.67
2 0.00
3 0.67
4 1.33
dtype: float64
s.abs() # 取绝对值
0 1.332
1 0.666
2 0.000
3 0.666
4 1.332
dtype: float64
5. 排序函数
排序函数常包括值排序和索引排序,下面先用set_index
方法把年级和姓名两列作为索引:
df_demo = df[['Grade', 'Name', 'Height', 'Weight']].set_index(['Grade','Name'])
df_demo.head()
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Freshman | Gaopeng Yang | 158.9 | 46.0 |
Changqiang You | 166.5 | 70.0 | |
Senior | Mei Sun | 188.9 | 89.0 |
Sophomore | Xiaojuan Sun | NaN | 41.0 |
Gaojuan You | 174.0 | 74.0 |
【a】值排序:sort_values
df_demo.sort_values('Height').head() # 对身高进行升序排序
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Junior | Xiaoli Chu | 145.4 | 34.0 |
Senior | Gaomei Lv | 147.3 | 34.0 |
Sophomore | Peng Han | 147.8 | 34.0 |
Senior | Changli Lv | 148.7 | 41.0 |
Sophomore | Changjuan You | 150.5 | 40.0 |
df_demo.sort_values('Height', ascending=False).head() # 对身高进行降序排序
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Senior | Xiaoqiang Qin | 193.9 | 79.0 |
Mei Sun | 188.9 | 89.0 | |
Gaoli Zhao | 186.5 | 83.0 | |
Freshman | Qiang Han | 185.3 | 87.0 |
Senior | Qiang Zheng | 183.9 | 87.0 |
多列排序:体重不同时,按体重降序排序;体重相同时,按身高升序排列。
df_demo.sort_values(['Weight','Height'],ascending=[False,True]).head()
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Senior | Mei Sun | 188.9 | 89.0 |
Qiang Zheng | 183.9 | 87.0 | |
Freshman | Qiang Han | 185.3 | 87.0 |
Chunli Zhao | 180.2 | 83.0 | |
Changpeng Zhao | 181.3 | 83.0 |
【b】索引排序:sort_index
索引排序,元素的值在索引中,用参数level
指定索引层的名字或者层号。PS:字符串的排列顺序由英文字母顺序决定。
df_demo.sort_index(level=['Grade','Name'],ascending=[False,True]).head() # 'Grade'降序排序(z→a),'Name'升序排序(a→z)
Height | Weight | ||
---|---|---|---|
Grade | Name | ||
Sophomore | Changjuan You | 150.5 | 40.0 |
Changmei Xu | 151.6 | 43.0 | |
Changqiang Qian | 167.6 | 64.0 | |
Chengli You | 164.1 | 57.0 | |
Chengqiang Lv | 166.8 | 53.0 |
6. apply方法
使用 apply
方法,可以同时应用多种运算:
df_demo.apply(lambda x:(x-x.mean()).std()) # 先取均值,然后与原来值相减,最后求标准差
Height 8.608879
Weight 12.824294
dtype: float64
若指定axis=1
,那么每次传入函数的就是行元素组成的Series
,其结果与之前的逐行均值结果一致。
df_demo.apply(lambda x:(x-x.mean()).std(), axis=1).head()
0 79.832356
1 68.235804
2 70.639967
3 NaN
4 70.710678
dtype: float64
四、窗口对象
pandas
中有3类窗口,用得较多的是滑动窗口rolling
和扩张窗口expanding
。
1. 滑窗对象
.rolling
:得到滑窗对象,参数 window
为窗口大小。
s = pd.Series([1,2,3,4,5])
roller = s.rolling(window = 2)
roller
Rolling [window=2,center=False,axis=0]
在得到了滑窗对象后,能够使用相应的聚合函数进行计算。每次选取元素的个数即为window
的值。下面以求均值的步骤举例:
第一个位置,窗口还未完全进入,故为缺失值
第二个位置,窗口包含第一、二个元素,求均值为(1+2)/2=1.5
第三个位置,窗口包含第二、三个元素,求均值为(2+3)/2=2.5
以此类推。
roller.mean()
0 NaN
1 1.5
2 2.5
3 3.5
4 4.5
dtype: float64
roller.apply(lambda x:x.mean()) # 等价于求均值
0 NaN
1 1.5
2 2.5
3 3.5
4 4.5
dtype: float64
roller.sum() # 与求均值的流程一致
0 NaN
1 3.0
2 5.0
3 7.0
4 9.0
dtype: float64
2. 扩张窗口
扩张窗口又称累计窗口,其窗口的大小就是从序列开始处到具体操作的对应位置,其使用的聚合函数会作用于这些逐步扩张的窗口上。下面仍以求均值的步骤举例:
第一个位置求均值,窗口大小为1均值为:1/1=1
第二个位置求均值,窗口大小为2,均值为:(1+3)/2=2
第三个位置求均值,窗口大小为3,均值为:(1+3+6)/3=3.3333
以此类推。
s = pd.Series([1, 3, 6, 10])
s.expanding().mean()
0 1.000000
1 2.000000
2 3.333333
3 5.000000
dtype: float64
s.ewm(alpha=0.2).mean().head()
0 -1.000000
1 -1.000000
2 -1.409836
3 -1.609756
4 -1.725845
dtype: float64