第二章 pandas基础

最新推荐文章于 2024-07-24 14:05:30 发布

怡颜悦色

最新推荐文章于 2024-07-24 14:05:30 发布

阅读量286

点赞数

分类专栏： Pandas 文章标签：机器学习

本文链接：https://blog.csdn.net/qq_42967904/article/details/111400074

版权

Pandas 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

第二章 pandas基础

一、文件的读取和写入

1. 文件读取

【a】pandas读取csv, excel, txt文件：

import numpy as np
import pandas as pd
df_csv = pd.read_csv('../data/my_csv.csv')
df_csv

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

df_txt = pd.read_table('../data/my_table.txt')
df_txt

	col1	col2	col3	col4
0	2	a	1.4	apple 2020/1/1
1	3	b	3.4	banana 2020/1/2
2	6	c	2.5	orange 2020/1/5
3	5	d	3.2	lemon 2020/1/7

df_excel = pd.read_excel('../data/my_excel.xlsx')
df_excel

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

【b】常用参数：header、usecols、nrows，对 txt、csv、excel 均适用。

header=None表示第一行不作为列名：

pd.read_table('../data/my_table.txt', header=None)

	0	1	2	3
0	col1	col2	col3	col4
1	2	a	1.4	apple 2020/1/1
2	3	b	3.4	banana 2020/1/2
3	6	c	2.5	orange 2020/1/5
4	5	d	3.2	lemon 2020/1/7

usecols表示读取列的集合，默认读取所有的列：

pd.read_table('../data/my_table.txt', usecols=['col1', 'col2'])

	col1	col2
0	2	a
1	3	b
2	6	c
3	5	d

nrows表示读取的数据行数：

pd.read_table('../data/my_table.txt', nrows=2)

	col1	col2	col3	col4
0	2	a	1.4	apple 2020/1/1
1	3	b	3.4	banana 2020/1/2

【c】对于分隔符非空格的txt文件，分割参数sep可以自定义分隔符号。

pd.read_table('../data/my_table_special_sep.txt')

	col1 \|\|\|\| col2
0	TS \|\|\|\| This is an apple.
1	GQ \|\|\|\| My name is Bob.
2	WT \|\|\|\| Well done!
3	PT \|\|\|\| May I help you?

使用sep设置分隔符，同时指定引擎为python：

PS：使用read_table的时，参数sep中使用的是正则表达式，因此需要对|进行转义变成\|，否则无法读取到正确的结果。

pd.read_table('../data/my_table_special_sep.txt', sep=' \|\|\|\| ', engine='python')

	col1	col2
0	TS	This is an apple.
1	GQ	My name is Bob.
2	WT	Well done!
3	PT	May I help you?

2. 文件写入

【a】index=False，在保存文件的时候去除索引：

df_csv.to_csv('../data/my_csv_saved.csv', index=True)
df_excel.to_excel('../data/my_excel_saved.xlsx', index=False)

无索引

在这里插入图片描述

没有索引

在这里插入图片描述

【b】pandas中没有定义to_table函数，txt文件也使用to_csv进行保存：

df_txt.to_csv('../data/my_txt_saved.txt', sep='\t', index=False)

【c】表格转markdown、latex：

df_csv

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

print(df_csv.to_markdown())

|    |   col1 | col2   |   col3 | col4   | col5     |
|---:|-------:|:-------|-------:|:-------|:---------|
|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |
|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |
|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |
|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |

print(df_csv.to_latex())

\begin{tabular}{lrlrll}
\toprule
{} &  col1 & col2 &  col3 &    col4 &      col5 \\
\midrule
0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\
1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\
2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\
3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\
\bottomrule
\end{tabular}

二、基本数据结构

1. Series

【a】Series存储一维values，由序列的值data、索引index、存储类型dtype、序列的名字name组成：

s = pd.Series(data = [['dawang','yiyi'],{'dawang':21},'yiyi_19'],
              index = ['list','dict','str'],
              # index = pd.Index(['list','dict','str'], name='index_name'),
              dtype = 'object',
              name = 'dawang_yiyi')  # `object`是一种混合类型，正如次例中存储了列表、字典和字符串。
s

list    [dawang, yiyi]
dict    {'dawang': 21}
str            yiyi_19
Name: dawang_yiyi, dtype: object

【b】获取 Series 属性：

s.values

array([list(['dawang', 'yiyi']), {'dawang': 21}, 'yiyi_19'], dtype=object)

s.index

Index(['list', 'dict', 'str'], dtype='object')

s.dtype

dtype('O')

s.name

'dawang_yiyi'

s.shape  # 获取序列的长度

(3,)

2. DataFrame

DataFrame存储二维values，在Series的基础上增加了列索引columns。

【a】二维数组 + 行、列索引构造DataFrame：

data = [['dawang',21,'image identification'], ['yiyi',19,'clinical medicine']]
df = pd.DataFrame(data = data,
                  index = ['row_%d'%i for i in range(2)],
                  columns=['col_0', 'col_1', 'col_2'])
df

	col_0	col_1	col_2
row_0	dawang	21	image identification
row_1	yiyi	19	clinical medicine

【b】列索引名→数据，映射构造DataFrame，同时加上行索引：

df = pd.DataFrame(data = {'col_0': ['dawang','yiyi'],
                          'col_1': [21, 19],
                          'col_2': ['imageidentification', 'clinicalmedicine']},
                  index = ['row_%d'%i for i in range(2)])
df

	col_0	col_1	col_2
row_0	dawang	21	imageidentification
row_1	yiyi	19	clinicalmedicine

【c】DataFrame + [col_name]取出相应列得到 Series：

df['col_0']

row_0    dawang
row_1      yiyi
Name: col_0, dtype: object

【d】DataFrame + [col_list]取出多列得到 DataFrame：

df[['col_0', 'col_1']]

	col_0	col_1
row_0	dawang	21
row_1	yiyi	19

【e】获取 DataFrame 属性：

df.values

array([['dawang', 21, 'imageidentification'],
       ['yiyi', 19, 'clinicalmedicine']], dtype=object)

df.index

Index(['row_0', 'row_1'], dtype='object')

df.columns

Index(['col_0', 'col_1', 'col_2'], dtype='object')

df.dtypes  # 返回的是值为相应列数据类型的Series

col_0    object
col_1     int64
col_2    object
dtype: object

df.shape

(2, 3)

df.T  # `DataFrame`转置

	row_0	row_1
col_0	dawang	yiyi
col_1	21	19
col_2	imageidentification	clinicalmedicine

三、常用基本函数

为了举例说明，下面使用一份learn_pandas.csv的虚拟数据集，它记录了四所学校学生的体测个人信息。

df = pd.read_csv('../data/learn_pandas.csv')
df.columns

Index(['School', 'Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer',
       'Test_Number', 'Test_Date', 'Time_Record'],
      dtype='object')

上述列名依次代表学校、年级、姓名、性别、身高、体重、是否为转系生、体测场次、测试时间、1000米成绩，本章仅使用其中的前七列。

df = df[df.columns[:7]]
df

	School	Grade	Name	Gender	Height	Weight	Transfer
0	Shanghai Jiao Tong University	Freshman	Gaopeng Yang	Female	158.9	46.0	N
1	Peking University	Freshman	Changqiang You	Male	166.5	70.0	N
2	Shanghai Jiao Tong University	Senior	Mei Sun	Male	188.9	89.0	N
3	Fudan University	Sophomore	Xiaojuan Sun	Female	NaN	41.0	N
4	Fudan University	Sophomore	Gaojuan You	Male	174.0	74.0	N
...	...	...	...	...	...	...	...
195	Fudan University	Junior	Xiaojuan Sun	Female	153.9	46.0	N
196	Tsinghua University	Senior	Li Zhao	Female	160.9	50.0	N
197	Shanghai Jiao Tong University	Senior	Chengqiang Chu	Female	153.9	45.0	N
198	Shanghai Jiao Tong University	Senior	Chengmei Shen	Male	175.3	71.0	N
199	Tsinghua University	Sophomore	Chunpeng Lv	Male	155.7	51.0	N

200 rows × 7 columns

1. 汇总函数

【a】head, tail函数分别返回表的前n行和后n行，其中n默认为5：

df.head(2)

	School	Grade	Name	Gender	Height	Weight	Transfer
0	Shanghai Jiao Tong University	Freshman	Gaopeng Yang	Female	158.9	46.0	N
1	Peking University	Freshman	Changqiang You	Male	166.5	70.0	N

df.tail(2)

	School	Grade	Name	Gender	Height	Weight	Transfer
198	Shanghai Jiao Tong University	Senior	Chengmei Shen	Male	175.3	71.0	N
199	Tsinghua University	Sophomore	Chunpeng Lv	Male	155.7	51.0	N

【b】info返回表的信息概况：

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   School    200 non-null    object 
 1   Grade     200 non-null    object 
 2   Name      200 non-null    object 
 3   Gender    200 non-null    object 
 4   Height    183 non-null    float64
 5   Weight    189 non-null    float64
 6   Transfer  188 non-null    object 
dtypes: float64(2), object(5)
memory usage: 11.1+ KB

【c】describe返回表中 “数值” 列对应的主要统计量，此处 “数值” 列仅有 “Height” 和 “Weight” ：

df.describe()

	Height	Weight
count	183.000000	189.000000
mean	163.218033	55.015873
std	8.608879	12.824294
min	145.400000	34.000000
25%	157.150000	46.000000
50%	161.900000	51.000000
75%	167.500000	65.000000
max	193.900000	89.000000

2. 特征统计函数

以 sum, mean, median, var, std, max, min 为例：

df_demo = df[['Height', 'Weight']]
df_demo

	Height	Weight
0	158.9	46.0
1	166.5	70.0
2	188.9	89.0
3	NaN	41.0
4	174.0	74.0
...	...	...
195	153.9	46.0
196	160.9	50.0
197	153.9	45.0
198	175.3	71.0
199	155.7	51.0

200 rows × 2 columns

df_demo.mean()  # 平均值

Height    163.218033
Weight     55.015873
dtype: float64

df_demo.max()  # 最大值

Height    193.9
Weight     89.0
dtype: float64

df_demo.quantile(0.75)  # 75%分位数

Height    167.5
Weight     65.0
Name: 0.75, dtype: float64

df_demo.count()  # 非缺失值个数

Height    183
Weight    189
dtype: int64

df_demo.idxmax() # 最大值对应的索引值

Height    193
Weight      2
dtype: int64

上述函数都有一个聚合参数axis，axis = 0代表逐列聚合，axis = 1代表逐行聚合，axis默认为0：

df_demo.mean(axis=1).head(2) # 表示计算前两行的平均值

0    102.45
1    118.25
dtype: float64

3. 唯一值函数

【a】 unique：找出单个属性唯一值，nunique：统计唯一值个数。

df['School']  # 所有人的 School 属性

0      Shanghai Jiao Tong University
1                  Peking University
2      Shanghai Jiao Tong University
3                   Fudan University
4                   Fudan University
                   ...              
195                 Fudan University
196              Tsinghua University
197    Shanghai Jiao Tong University
198    Shanghai Jiao Tong University
199              Tsinghua University
Name: School, Length: 200, dtype: object

df['School'].unique()  # 找出 School 属性的唯一值

array(['Shanghai Jiao Tong University', 'Peking University',
       'Fudan University', 'Tsinghua University'], dtype=object)

df['School'].nunique()  # 统计 School 属性唯一值的个数

【b】value_counts：得到唯一值和其对应出现的频数。

df['School'].value_counts()

Tsinghua University              69
Shanghai Jiao Tong University    57
Fudan University                 40
Peking University                34
Name: School, dtype: int64

【c】 drop_duplicates找出单个属性唯一值或者多个属性组合的唯一值。

关键参数是keep，first表示每个组合保留第一次出现的所在行，last表示保留最后一次出现的所在行，False表示把所有重复组合所在的行剔除。

对于单个属性唯一值：

df['School'].drop_duplicates(keep='first')

0    Shanghai Jiao Tong University
1                Peking University
3                 Fudan University
5              Tsinghua University
Name: School, dtype: object

对于多个属性组合的唯一值：

df_demo = df[['Gender','Transfer','Name']]
df_demo

	Gender	Transfer	Name
0	Female	N	Gaopeng Yang
1	Male	N	Changqiang You
2	Male	N	Mei Sun
3	Female	N	Xiaojuan Sun
4	Male	N	Gaojuan You
...	...	...	...
195	Female	N	Xiaojuan Sun
196	Female	N	Li Zhao
197	Female	N	Chengqiang Chu
198	Male	N	Chengmei Shen
199	Male	N	Chunpeng Lv

200 rows × 3 columns

df_demo.drop_duplicates(['Gender', 'Transfer'], keep='first')

	Gender	Transfer	Name
0	Female	N	Gaopeng Yang
1	Male	N	Changqiang You
12	Female	NaN	Peng You
21	Male	NaN	Xiaopeng Shen
36	Male	Y	Xiaojuan Qin
43	Female	Y	Gaoli Feng

df_demo.drop_duplicates(['Gender', 'Transfer'], keep='last')

	Gender	Transfer	Name
147	Male	NaN	Juan You
150	Male	Y	Chengpeng You
169	Female	Y	Chengquan Qin
194	Female	NaN	Yanmei Qian
197	Female	N	Chengqiang Chu
199	Male	N	Chunpeng Lv

df_demo.drop_duplicates(['Gender', 'Transfer'], keep=False)

	Gender	Transfer	Name

4. 替换函数

替换操作通常针对单个属性进行，因此下面的例子都以Series举例。

【a】映射替换：replace

df['Grade'].head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: object

df['Grade'].replace({'Freshman':1, 'Sophomore':2, 'Junior':3, 'Senior':4}).head()  # 字典传参

0    1
1    1
2    4
3    2
4    2
Name: Grade, dtype: int64

df['Grade'].replace(['Freshman','Sophomore','Junior','Senior'], [1,2,3,4]).head()  # 列表传参

0    1
1    1
2    4
3    2
4    2
Name: Grade, dtype: int64

特殊参数：method

s = pd.Series(['dawang','man','yiyi','woman','woman','dawang'])  # 下面展示用 'dawang'和'yiyi'替换'man'和'woman'

s.replace(['man','woman'], method='ffill')  # `method` = `ffill`，则为用前面一个最近的未被替换的值进行替换

0    dawang
1    dawang
2      yiyi
3      yiyi
4      yiyi
5    dawang
dtype: object

s.replace(['man','woman'], method='bfill')  # `method` = `bfill`，则为使用后面最近的未被替换的值进行替换

0    dawang
1      yiyi
2      yiyi
3    dawang
4    dawang
5    dawang
dtype: object

【b】逻辑替换：where、mask

s = pd.Series([-1,1,-1,1,1,-1])

where：在传入条件为False的对应行进行替换

s.where(s<0)  # 不指定替换值，替换为缺失值NaN

0   -1.0
1    NaN
2   -1.0
3    NaN
4    NaN
5   -1.0
dtype: float64

s.where(s<0,'此处大于0')  # 指定替换值

0       -1
1    此处大于0
2       -1
3    此处大于0
4    此处大于0
5       -1
dtype: object

mask：在传入条件为True的对应行进行替换

s.mask(s<0)

0    NaN
1    1.0
2    NaN
3    1.0
4    1.0
5    NaN
dtype: float64

s.mask(s<0, '此处小于0')

0    此处小于0
1        1
2    此处小于0
3        1
4        1
5    此处小于0
dtype: object

【c】数值替换：round、abs、clip

s = 0.666 * pd.Series(range(-2,3))
s

0   -1.332
1   -0.666
2    0.000
3    0.666
4    1.332
dtype: float64

s.clip(-0.5, 1) # 小于-0.5的值替换为0.5，大于1的值替换为1，在-0.5→1之间的值不变

0   -0.500
1   -0.500
2    0.000
3    0.666
4    1.000
dtype: float64

s.round(2)  # 保留两位小数

0   -1.33
1   -0.67
2    0.00
3    0.67
4    1.33
dtype: float64

s.abs()  # 取绝对值

0    1.332
1    0.666
2    0.000
3    0.666
4    1.332
dtype: float64

5. 排序函数

排序函数常包括值排序和索引排序，下面先用set_index方法把年级和姓名两列作为索引：

df_demo = df[['Grade', 'Name', 'Height', 'Weight']].set_index(['Grade','Name'])
df_demo.head()

		Height	Weight
Grade	Name
Freshman	Gaopeng Yang	158.9	46.0
Freshman	Changqiang You	166.5	70.0
Senior	Mei Sun	188.9	89.0
Sophomore	Xiaojuan Sun	NaN	41.0
Sophomore	Gaojuan You	174.0	74.0

【a】值排序：sort_values

df_demo.sort_values('Height').head()  # 对身高进行升序排序

		Height	Weight
Grade	Name
Junior	Xiaoli Chu	145.4	34.0
Senior	Gaomei Lv	147.3	34.0
Sophomore	Peng Han	147.8	34.0
Senior	Changli Lv	148.7	41.0
Sophomore	Changjuan You	150.5	40.0

df_demo.sort_values('Height', ascending=False).head()  # 对身高进行降序排序

		Height	Weight
Grade	Name
Senior	Xiaoqiang Qin	193.9	79.0
	Mei Sun	188.9	89.0
	Gaoli Zhao	186.5	83.0
Freshman	Qiang Han	185.3	87.0
Senior	Qiang Zheng	183.9	87.0

多列排序：体重不同时，按体重降序排序；体重相同时，按身高升序排列。

df_demo.sort_values(['Weight','Height'],ascending=[False,True]).head()

		Height	Weight
Grade	Name
Senior	Mei Sun	188.9	89.0
Senior	Qiang Zheng	183.9	87.0
Freshman	Qiang Han	185.3	87.0
	Chunli Zhao	180.2	83.0
	Changpeng Zhao	181.3	83.0

【b】索引排序：sort_index

索引排序，元素的值在索引中，用参数level指定索引层的名字或者层号。PS：字符串的排列顺序由英文字母顺序决定。

df_demo.sort_index(level=['Grade','Name'],ascending=[False,True]).head()  # 'Grade'降序排序（z→a），'Name'升序排序（a→z）

		Height	Weight
Grade	Name
Sophomore	Changjuan You	150.5	40.0
	Changmei Xu	151.6	43.0
	Changqiang Qian	167.6	64.0
	Chengli You	164.1	57.0
	Chengqiang Lv	166.8	53.0

6. apply方法

使用 apply 方法，可以同时应用多种运算：

df_demo.apply(lambda x:(x-x.mean()).std())  # 先取均值，然后与原来值相减，最后求标准差

Height     8.608879
Weight    12.824294
dtype: float64

若指定axis=1，那么每次传入函数的就是行元素组成的Series，其结果与之前的逐行均值结果一致。

df_demo.apply(lambda x:(x-x.mean()).std(), axis=1).head()

0    79.832356
1    68.235804
2    70.639967
3          NaN
4    70.710678
dtype: float64

四、窗口对象

pandas中有3类窗口，用得较多的是滑动窗口rolling和扩张窗口expanding。

1. 滑窗对象

.rolling：得到滑窗对象，参数 window 为窗口大小。

s = pd.Series([1,2,3,4,5])
roller = s.rolling(window = 2)
roller

Rolling [window=2,center=False,axis=0]

在得到了滑窗对象后，能够使用相应的聚合函数进行计算。每次选取元素的个数即为window的值。下面以求均值的步骤举例：

第一个位置，窗口还未完全进入，故为缺失值

第二个位置，窗口包含第一、二个元素，求均值为（1+2）/2=1.5

第三个位置，窗口包含第二、三个元素，求均值为（2+3）/2=2.5

以此类推。

roller.mean()

0    NaN
1    1.5
2    2.5
3    3.5
4    4.5
dtype: float64

roller.apply(lambda x:x.mean())  # 等价于求均值

0    NaN
1    1.5
2    2.5
3    3.5
4    4.5
dtype: float64

roller.sum()  # 与求均值的流程一致

0    NaN
1    3.0
2    5.0
3    7.0
4    9.0
dtype: float64

2. 扩张窗口

扩张窗口又称累计窗口，其窗口的大小就是从序列开始处到具体操作的对应位置，其使用的聚合函数会作用于这些逐步扩张的窗口上。下面仍以求均值的步骤举例：

第一个位置求均值，窗口大小为1均值为：1/1=1

第二个位置求均值，窗口大小为2，均值为：（1+3）/2=2

第三个位置求均值，窗口大小为3，均值为：（1+3+6）/3=3.3333

以此类推。

s = pd.Series([1, 3, 6, 10])
s.expanding().mean()

0    1.000000
1    2.000000
2    3.333333
3    5.000000
dtype: float64

s.ewm(alpha=0.2).mean().head()

0   -1.000000
1   -1.000000
2   -1.409836
3   -1.609756
4   -1.725845
dtype: float64

本文参考datawhale组最学习相关资料！！！

怡颜悦色

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
第二章 pandas基础

第二章 pandas基础一、文件的读取和写入1. 文件读取【a】pandas读取csv, excel, txt文件：import numpy as npimport pandas as pddf_csv = pd.read_csv('../data/my_csv.csv')df_csv col1 col2 col3 col4 col5 0 2
复制链接

扫一扫