第二章 pandas 基础

最新推荐文章于 2024-01-07 10:25:41 发布

Shudddd

最新推荐文章于 2024-01-07 10:25:41 发布

阅读量128

点赞数

分类专栏：第二章学习笔记文章标签： python

本文链接：https://blog.csdn.net/Shudddd/article/details/111398150

版权

学习笔记同时被 2 个专栏收录

4 篇文章 0 订阅

订阅专栏

第二章

1 篇文章 0 订阅

订阅专栏

第二章 pandas 基础

1 文件读取与写入
- 1.1 文件读取
- 1.2 数据写入
2 基本数据结构

文件读取与写入

文件读取

主要介绍scv、excel、txt文件的读取方式

import numpy as np
import pandas as pd

# conda update pandas更新pandas包到最新版本
# 查看pandas版本
pd.__version__

'1.1.5'

df_csv = pd.read_csv('my_csv.csv')

df_txt = pd.read_table('my_table.txt')

df_excel = pd.read_excel('my_excel.xlsx')
df_excel

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

【header=None】表示第一行不作为列名
【index_col】表示把某一列或几列作为索引
【usecols】表示读取列的集合，默认读取所有的列
【parse_dates】表示需要转化为时间的列
【nrows】表示读取的数据行数

# 【header=None】 表示第一行不作为列名
pd.read_table('my_table.txt',header=None)

	0	1	2	3
0	col1	col2	col3	col4
1	2	a	1.4	apple 2020/1/1
2	3	b	3.4	banana 2020/1/2
3	6	c	2.5	orange 2020/1/5
4	5	d	3.2	lemon 2020/1/7

# 【index_col】表示把某一列或几列作为索引
pd.read_csv('my_csv.csv',index_col=['col1','col2'])

		col3	col4	col5
col1	col2
2	a	1.4	apple	2020/1/1
3	b	3.4	banana	2020/1/2
6	c	2.5	orange	2020/1/5
5	d	3.2	lemon	2020/1/7

# 【usecols】 表示读取列的集合，默认读取所有的列
pd.read_csv('my_csv.csv',usecols=['col1','col2'])

	col1	col2
0	2	a
1	3	b
2	6	c
3	5	d

# 【parse_dates】 表示需要转化为时间的列
pd.read_csv('my_csv.csv',parse_dates=['col5'])

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020-01-01
1	3	b	3.4	banana	2020-01-02
2	6	c	2.5	orange	2020-01-05
3	5	d	3.2	lemon	2020-01-07

# 【nrows】 表示读取的数据行数
pd.read_csv('my_csv.csv',nrows=3)
# 等价于.head(3)

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5

txt文件自定义分隔符号

pd.read_table('my_table_special_sep.txt')

	col1 \|\|\|\| col2
0	TS \|\|\|\| This is an apple.
1	GQ \|\|\|\| My name is Bob.
2	WT \|\|\|\| Well done!
3	PT \|\|\|\| May I help you?

# sep正则参数，使用的是正则表达式，需要对|进行转义变成 \|。【正则表达式第八章】
# sep定义分割符号为||||，同时制定引擎为python
pd.read_table('my_table_special_sep.txt',sep='\|\|\|\|',engine='python')

	col1	col2
0	TS	This is an apple.
1	GQ	My name is Bob.
2	WT	Well done!
3	PT	May I help you?

数据写入

在数据写入中，最常用的操作是把 index 设置为 False。
特别当索引没有特殊意义的时候，这样的行为能把索引在保存的时候去除。

df_csv.to_csv('my_csv_saved.csv', index=False)

pandas中没有定义to_table函数，但是to_csv可以保存为txt文件，并且允许自定义分隔符，常用制表符\t分割

df_txt.to_csv('my_txt_saved.txt', sep='\t', index=False)

如果想要把表格快速转换为 markdown 和 latex 语言，可以使用 to_markdown 和 to_latex 函数，此处需要安装 tabulate 包。

# conda install tabulate 安装 tabulate 包
print(df_csv.to_markdown())

|    |   col1 | col2   |   col3 | col4   | col5     |
|---:|-------:|:-------|-------:|:-------|:---------|
|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |
|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |
|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |
|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |

print(df_csv.to_latex())

\begin{tabular}{lrlrll}
\toprule
{} &  col1 & col2 &  col3 &    col4 &      col5 \\
\midrule
0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\
1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\
2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\
3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\
\bottomrule
\end{tabular}

基本数据结构

Series

Series 一般由四个部分组成，分别是序列的值 data 、索引 index 、存储类型 dtype 、序列的名字 name 。

s = pd.Series(data = [100, 'a', {'dic1':5}], 
              index = pd.Index(['id1', 20, 'third'], name='my_idx'),
              dtype = 'object', name = 'my_name')
s

my_idx
id1              100
20                 a
third    {'dic1': 5}
Name: my_name, dtype: object

object 代表了一种混合数据类型（整数、字符串、字典等）
属性可通过.方式获取
s.values
s.index
s.dtype
s.name
【s.shape】获取序列的长度
【index_item】取出单个索引对应的值

# 【index_item】取出单个索引对应的值
s['third']

{'dic1': 5}

DataFrame

DataFrame 在 Series 的基础上增加了列索引

3*3二维矩阵框
data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.2]]
data

[[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.2]]

df = pd.DataFrame(data = data,
                  index = ['row_%d'%i for i in range(3)],
#                   第二个%不知道什么意思
                  columns=['col_0', 'col_1', 'col_2'])
df

	col_0	col_1	col_2
row_0	1	a	1.2
row_1	2	b	2.2
row_2	3	c	3.2

更多的时候会采用从列索引名到数据的映射来构造数据框，同时再加上行索引

df = pd.DataFrame(data = {'col_0': [1,2,3], 'col_1':list('abc'),
                          'col_2': [1.2, 2.2, 3.2]},
                  index = ['row_%d'%i for i in range(3)])
df

	col_0	col_1	col_2
row_0	1	a	1.2
row_1	2	b	2.2
row_2	3	c	3.2

# 取series
df['col_0']

row_0    1
row_1    2
row_2    3
Name: col_0, dtype: int64

# 取DataFrame
df[['col_0', 'col_1']]

	col_0	col_1
row_0	1	a
row_1	2	b
row_2	3	c

# 取相应的属性
df.values
df.index
df.columns
df.dtypes
df.shape

(3, 3)

常用基本函数

df = pd.read_csv('learn_pandas.csv')
df.columns
# 学校、年级、姓名、性别、身高、体重、是否为转系生、体测场次、测试时间、1000 米成 绩

Index(['School', 'Grade', 'Name', 'Gender', 'Height', 'Weight', 'Transfer',
       'Test_Number', 'Test_Date', 'Time_Record'],
      dtype='object')

# 本章只需使用其中的前七列
df = df[df.columns[:7]]
df[:3]

	School	Grade	Name	Gender	Height	Weight	Transfer
0	Shanghai Jiao Tong University	Freshman	Gaopeng Yang	Female	158.9	46.0	N
1	Peking University	Freshman	Changqiang You	Male	166.5	70.0	N
2	Shanghai Jiao Tong University	Senior	Mei Sun	Male	188.9	89.0	N

汇总函数

head, tail 函数分别表示返回表或者序列的前 n 行和后 n 行，其中 n 默认为 5

df.head(2)

	School	Grade	Name	Gender	Height	Weight	Transfer
0	Shanghai Jiao Tong University	Freshman	Gaopeng Yang	Female	158.9	46.0	N
1	Peking University	Freshman	Changqiang You	Male	166.5	70.0	N

# tail返回表或者序列的后 n 行
df.tail(3)

	School	Grade	Name	Gender	Height	Weight	Transfer
197	Shanghai Jiao Tong University	Senior	Chengqiang Chu	Female	153.9	45.0	N
198	Shanghai Jiao Tong University	Senior	Chengmei Shen	Male	175.3	71.0	N
199	Tsinghua University	Sophomore	Chunpeng Lv	Male	155.7	51.0	N

info, describe 分别返回表的信息概况和表中数值列对应的主要统计量

# 【.info()】information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   School    200 non-null    object 
 1   Grade     200 non-null    object 
 2   Name      200 non-null    object 
 3   Gender    200 non-null    object 
 4   Height    183 non-null    float64
 5   Weight    189 non-null    float64
 6   Transfer  188 non-null    object 
dtypes: float64(2), object(5)
memory usage: 11.1+ KB

df.describe()

	Height	Weight
count	183.000000	189.000000
mean	163.218033	55.015873
std	8.608879	12.824294
min	145.400000	34.000000
25%	157.150000	46.000000
50%	161.900000	51.000000
75%	167.500000	65.000000
max	193.900000	89.000000

info, describe 只能实现较少信息的展示
如果想要对一份数据集进行全面且有效的观察，特别是在列较多的情况下，推荐使用 pandas-profiling 包

特征统计函数

在 Series 和 DataFrame 上定义了许多统计函数，最常见的是
sum
mean
median
var
std
max
min 【.quantile(0.75)】0.75分位数
【.count()】非缺失值个数
【.idxmax()】最大值对应的索引
【.idxmin()】最小值对应的索引上面这些所有的函数，由于操作后返回的是标量，所以又称为聚合函数
它们有一个公共参数 axis ，默认为 0 代表逐列聚合，如果设置为 1 则表示逐行聚合

df_demo = df[['Height', 'Weight']]
df_demo.mean(axis=1).head()

0    102.45
1    118.25
2    138.95
3     41.00
4    124.00
dtype: float64

唯一值函数

【unique】唯一值列表
【nunique 】唯一值个数
可以分别得到序列唯一值组成的列表和唯一值的个数

df['School'].unique()

array(['Shanghai Jiao Tong University', 'Peking University',
       'Fudan University', 'Tsinghua University'], dtype=object)

df['School'].nunique()

【value_counts】唯一值和其对应出现的频数

df['School'].value_counts()

Tsinghua University              69
Shanghai Jiao Tong University    57
Fudan University                 40
Peking University                34
Name: School, dtype: int64

【drop_duplicates】观察多个列组合的唯一值
参数【keep】：默认值 first
first：表示保留第一次出现的所在行
last：表示保留最后一次出现的所在行
False：表示把所有重复组合所在行剔除

df_demo = df[['Gender','Transfer','Name']]
df_demo.drop_duplicates(['Gender', 'Transfer'])
# 性别，是否为转校生相同的状态剔除

	Gender	Transfer	Name
0	Female	N	Gaopeng Yang
1	Male	N	Changqiang You
12	Female	NaN	Peng You
21	Male	NaN	Xiaopeng Shen
36	Male	Y	Xiaojuan Qin
43	Female	Y	Gaoli Feng

df_demo.drop_duplicates(['Gender', 'Transfer'], keep='last')

	Gender	Transfer	Name
147	Male	NaN	Juan You
150	Male	Y	Chengpeng You
169	Female	Y	Chengquan Qin
194	Female	NaN	Yanmei Qian
197	Female	N	Chengqiang Chu
199	Male	N	Chunpeng Lv

df_demo.drop_duplicates(['Name', 'Gender'],
                        keep=False).head() 
# 保留只出现过一次的姓名和性别组合,keep=False剔除姓名和性别重复的组合

	Gender	Transfer	Name
0	Female	N	Gaopeng Yang
1	Male	N	Changqiang You
2	Male	N	Mei Sun
4	Male	N	Gaojuan You
5	Female	N	Xiaoli Qian

df['School'].drop_duplicates() 
# 在 Series 上也可以使用

0    Shanghai Jiao Tong University
1                Peking University
3                 Fudan University
5              Tsinghua University
Name: School, dtype: object

【duplicated】返回了是否为唯一值的布尔列表
keep 参数一致，重复元素为 True ，否则为 False 。
【drop_duplicates】 = 【duplicated】为 True 的对应行剔除。

df_demo.duplicated(['Gender', 'Transfer']).head()

0    False
1    False
2     True
3     True
4     True
dtype: bool

df['School'].duplicated().head() 
# 在 Series 上也可以使用

0    False
1    False
2     True
3    False
4     True
Name: School, dtype: bool

替换函数

替换操作针对某一个列进行，因此以 Series 举例
三类替换函数：映射替换、逻辑替换、数值替换
映射替换：【replace】【str.replace】【cat.codes】
【replace】可以通过字典构造，或者传入两个列表来进行替换

# 字典替换
df['Gender'].replace({'Female':0, 'Male':1}).head()

0    0
1    1
2    1
3    0
4    1
Name: Gender, dtype: int64

# 两列表替换
df['Gender'].replace(['Female', 'Male'], [0, 1]).head()

0    0
1    1
2    1
3    0
4    1
Name: Gender, dtype: int64

【replace】特殊的方向替换
【method = ‘ffill’】用前面一个最近的未被替换的值进行替换
【method = ‘bfill’】用后面最近的未被替换的值进行替换

s = pd.Series(['a', 1, 'b', 2, 1, 1, 'a'])
s

0    a
1    1
2    b
3    2
4    1
5    1
6    a
dtype: object

s.replace([1, 2], method='ffill')

0    a
1    a
2    b
3    b
4    b
5    b
6    a
dtype: object

s.replace([1, 2], method='bfill')

0    a
1    b
2    b
3    a
4    a
5    a
6    a
dtype: object

【str.replace】对于 string 类型的正则替换（第八章）
逻辑替换：【where】【mask】
【where】在传入条件为 False 的对应行进行替换
【mask】在传入条件为 True 的对应行进行替换
不指定替换值时，替换为缺失值。

s = pd.Series([-1, 1.2345, 100, -50])
s.where(s<0)

0    -1.0
1     NaN
2     NaN
3   -50.0
dtype: float64

s.where(s<0, 100)

0     -1.0
1    100.0
2    100.0
3    -50.0
dtype: float64

s.mask(s<0)

0         NaN
1      1.2345
2    100.0000
3         NaN
dtype: float64

s.mask(s<0, -50)

0    -50.0000
1      1.2345
2    100.0000
3    -50.0000
dtype: float64

数值替换：【round】【abs】【clip】
【round】取整
【abs】取绝对值
【clip】截断

s = pd.Series([-1, 1.2345, 100, -50])
s.round(2)
# 保留2位小数

0     -1.00
1      1.23
2    100.00
3    -50.00
dtype: float64

s.clip(0, 2)
# 0,2两个数分别表示上下截断边界，超出边界的数赋值为边界值

0    0.0000
1    1.2345
2    2.0000
3    0.0000
dtype: float64

# 【clip】中，超过边界的只能截断为边界值
# 如果要把超出边界的替换为自定义的值，用替换函数进行替换
s.clip(1, 2).replace([1, 2], [0, 100])

0      0.0000
1      1.2345
2    100.0000
3      0.0000
dtype: float64

排序函数

【sort_values】值排序
【sort_index】索引排序
1、【sort_values】值排序

# 利用 set_index 把年级和姓名两列作为索引
df_demo = df[['Grade', 'Name', 'Height','Weight']].set_index(['Grade','Name'])
df_demo

		Height	Weight
Grade	Name
Freshman	Gaopeng Yang	158.9	46.0
Freshman	Changqiang You	166.5	70.0
Senior	Mei Sun	188.9	89.0
Sophomore	Xiaojuan Sun	NaN	41.0
Sophomore	Gaojuan You	174.0	74.0
...	...	...	...
Junior	Xiaojuan Sun	153.9	46.0
Senior	Li Zhao	160.9	50.0
	Chengqiang Chu	153.9	45.0
	Chengmei Shen	175.3	71.0
Sophomore	Chunpeng Lv	155.7	51.0

200 rows × 2 columns

# 对身高进行排序，默认参数 ascending=True 为升序
df_demo.sort_values('Height').head()

		Height	Weight
Grade	Name
Junior	Xiaoli Chu	145.4	34.0
Senior	Gaomei Lv	147.3	34.0
Sophomore	Peng Han	147.8	34.0
Senior	Changli Lv	148.7	41.0
Sophomore	Changjuan You	150.5	40.0

# ascending=False 为降序
df_demo.sort_values('Height', ascending=False).head()

		Height	Weight
Grade	Name
Senior	Xiaoqiang Qin	193.9	79.0
	Mei Sun	188.9	89.0
	Gaoli Zhao	186.5	83.0
Freshman	Qiang Han	185.3	87.0
Senior	Qiang Zheng	183.9	87.0

# 多重排序：体重相同的情况下，对身高降序，整体体重升序排列
df_demo.sort_values(['Weight','Height'],ascending=[True,False]).head()

		Height	Weight
Grade	Name
Sophomore	Peng Han	147.8	34.0
Senior	Gaomei Lv	147.3	34.0
Junior	Xiaoli Chu	145.4	34.0
Sophomore	Qiang Zhou	150.5	36.0
Freshman	Yanqiang Xu	152.4	38.0

2、【sort_index】索引排序
排序元素在索引中，需要指定索引层的名字或者层号，用参数 level 表示。
字符串的排列顺序由字母顺序决定。

df_demo.sort_index(level=['Grade','Name'],ascending=[True,False])

		Height	Weight
Grade	Name
Freshman	Yanquan Wang	163.5	55.0
	Yanqiang Xu	152.4	38.0
	Yanqiang Feng	162.3	51.0
	Yanpeng Lv	NaN	65.0
	Yanli Zhang	165.1	52.0
...	...	...	...
Sophomore	Chengqiang Lv	166.8	53.0
	Chengli You	164.1	57.0
	Changqiang Qian	167.6	64.0
	Changmei Xu	151.6	43.0
	Changjuan You	150.5	40.0

200 rows × 2 columns

apply 方法

【apply】传入自定义函数
用于 DataFrame 的行迭代或者列迭代
其参数往往是一个以序列为输入的函数
只有在确实存在自定义需求的情境下才考虑使用 apply

df_demo = df[['Height', 'Weight']]
def my_mean(x):
    res = x.mean()
    return res
df_demo.apply(my_mean)

Height    163.218033
Weight     55.015873
dtype: float64

【lambda】表达式使得书写简洁，x 指代被调用的 df_demo 表中逐个输入的序列：

df_demo.apply(lambda x:x.mean())

Height    163.218033
Weight     55.015873
dtype: float64

【axis=1】每次传入函数的是行元素组成的 Series
与逐行均值结果一致

df_demo.apply(lambda x:x.mean(), axis=1).head()

0    102.45
1    118.25
2    138.95
3     41.00
4    124.00
dtype: float64

【mad】函数返回的是一个序列中偏离该序列均值的绝对值大小的均值
例如序列 1,3,7,10 ，均值为 5.25
每一个元素偏离的绝对值为 4.25,2.25,1.75,4.75，这个偏离序列的均值为 3.25

df_demo.apply(lambda x:(x-x.mean()).abs().mean())

Height     6.707229
Weight    10.391870
dtype: float64

# 与使用内置的mad函数计算结果一致
df_demo.mad()

Height     6.707229
Weight    10.391870
dtype: float64

窗口对象

【rolling】滑动窗口
【expanding】扩张窗口
【ewm】指数加权窗口

滑窗对象

要使用滑窗函数，就必须先要对一个序列使用 .rolling 得到滑窗对象，其最重要的参数为窗口大小 window

s = pd.Series([1,2,3,4,5])
roller = s.rolling(window = 3)
roller

Rolling [window=3,center=False,axis=0]

在得到了滑窗对象后，能够使用相应的聚合函数进行计算，需要注意的是窗口包含当前行所在的元素。
例如：在第四个位置进行均值运算时，应当计算 (2+3+4)/3，而不是 (1+2+3)/3

roller.mean()

0    NaN
1    NaN
2    2.0
3    3.0
4    4.0
dtype: float64

roller.sum()

0     NaN
1     NaN
2     6.0
3     9.0
4    12.0
dtype: float64

滑动相关系数
滑动协方差

s2 = pd.Series([1,2,6,16,30])
roller.cov(s2)

0     NaN
1     NaN
2     2.5
3     7.0
4    12.0
dtype: float64

roller.corr(s2)

0         NaN
1         NaN
2    0.944911
3    0.970725
4    0.995402
dtype: float64

apply 传入自定义函数
其传入值是对应窗口的 Series

roller.apply(lambda x:x.mean())

0    NaN
1    NaN
2    2.0
3    3.0
4    4.0
dtype: float64

shift, diff, pct_change 是一组类滑窗函数
它们的公共参数为 periods=n ，默认为 1

【shift】表示取向前第 n 个元素的值
【diff】向前第 n 个元素做差（与 Numpy 中不同，后者表示 n 阶差分）
【pct_change】向前第 n 个元素相比计算增长率

这里的 n 可以为负，表示反方向的类似操作

s = pd.Series([1,3,6,10,15])
s.shift(2)
# .shift(2)取向前第 2 个元素的值

0    NaN
1    NaN
2    1.0
3    3.0
4    6.0
dtype: float64

s = pd.Series([1,3,6,10,15])
s.diff(3)
# .diff(3)向前第 3 个元素做差

0     NaN
1     NaN
2     NaN
3     9.0
4    12.0
dtype: float64

s = pd.Series([1,3,6,10,15])
s.pct_change()
# 与前第 1 个元素相比计算增长率
# 相比于之前一个数，增长了百分之几

0         NaN
1    2.000000
2    1.000000
3    0.666667
4    0.500000
dtype: float64

s = pd.Series([1,3,6,10,15])
s.shift(-1)
# n 可以为负，表示反方向的类似操作
# .shift(-1)取后第 1 个元素的值

0     3.0
1     6.0
2    10.0
3    15.0
4     NaN
dtype: float64

将其视作类滑窗函数的原因：
它们的功能可以用窗口大小为 n+1 的 rolling 方法等价代替

# s.shift(2)的rolling等价替代
s.rolling(3).apply(lambda x:list(x)[0])

0    NaN
1    NaN
2    1.0
3    3.0
4    6.0
dtype: float64

# s.diff(3)的rolling等价替代
s.rolling(4).apply(lambda x:list(x)[-1]-list(x)[0])
# list(x)[-1]最后一项
# list(x)[0]第一项

0     NaN
1     NaN
2     NaN
3     9.0
4    12.0
dtype: float64

# s.pct_change()
def my_pct(x):
    L = list(x)
    return L[-1]/L[0]-1
s.rolling(2).apply(my_pct) 
# L[-1]/L[0]-1==(L[-1]-L[0])/L[0]

0         NaN
1    2.000000
2    1.000000
3    0.666667
4    0.500000
dtype: float64

扩张窗口

累计窗口，动态长度的窗口
窗口的大小就是从序列开始处到具体操作的对应位置
使用的聚合函数会作用于这些逐步扩张的窗口上、
设序列为 a1, a2, a3, a4
每个位置对应的窗口即 [a1]、[a1, a2]、[a1, a2, a3]、[a1, a2, a3, a4]。

s = pd.Series([1, 3, 6, 10])
s.expanding().mean()

0    1.000000
1    2.000000
2    3.333333
3    5.000000
dtype: float64

Shudddd

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
第二章 pandas 基础

第二章 pandas 基础1  文件读取与写入1.1  文件读取1.2  数据写入2  基本数据结构2.1  Series2.2  DataFrame2.3  常用基本函数2.3.1  汇总函数2.3.2  特征统计函数2.3.3  唯一值函数2.3.4  替换函数2.3.5 &
复制链接

扫一扫