模块学习—数据分析—pandas

最新推荐文章于 2024-01-19 16:01:54 发布

所念非欢

最新推荐文章于 2024-01-19 16:01:54 发布

阅读量2.1k

点赞数

文章标签： python 数据分析 pandas

本文链接：https://blog.csdn.net/m0_61357071/article/details/121527671

版权

Python环境：3.10.0，已安装第三方库 pandas

一、pandas概述

Pandas主要读取表格类型（二维数据类型）的数据，然后进行分析：

以pandas.read_csv为例，介绍常用参数：

read_csv(file, index_col = None, encoding =‘utf-8’)

数据类型	说明	pandas读取方法	参数说明
csv, tsv, txt	用逗号分隔、tab分割的纯文本文件	pandas.read_csv()	1.数据文件路径； 2.若想选择数据中的某一列元素作为行索引，则修改为那一列的列索引； 3.数据文件的编码方式。
excel	微软xls或xlsx文件	pandas.read_cexcel()
mysql	关系型数据库表	pandas.read_sql()

二、从文件中读取数据（方法）

1. 读取csv文件

1.1 文件变量fobj = read_csv(文件路径)

from pandas import *
fobj = read_csv('D://桌面//stock30.csv')

1.2 使用文件变量fobj.head() 读取到前五行文件。值得注意的是，pandas会自动添加一列数据作为操作索引值，因此，第一列不是原文件包含的内容。

>>> fobj.head()

   Unnamed: 0  code              name  lasttrade
0           1   MMM                3M     160.09
1           2   AXP  American Express     134.46
2           3  AAPL             Apple     326.12
3           4    BA            Boeing     345.02
4           5   CAT       Caterpillar     140.47

1.3 查看数据形状，即数据行列，使用 fobj.shape 方法，以元组形式返回(行，列)。

>>> fobj.shape
(30, 4)

1.4 查看列名列表，fobj.columns

>>> fobj.columns
Index(['Unnamed: 0', 'code', 'name', 'lasttrade'], dtype='object')

1.5 查看索引列，fobj.index

>>> fobj.index
RangeIndex(start=0, stop=30, step=1)

1.6 查看每列数据的数据类型，fobj.dtypes

>>> fobj.dtypes
Unnamed: 0      int64
code           object
name           object
lasttrade     float64
dtype: object

2. 读取txt文件（自定义数据形式）

2.1 读取文件，fobj = read_csv()

>>> fobj = read_csv(
···    'D://桌面//01.txt',                    #文件路径
···    sep = '/',                             #数据分割符   
···    header = None,                         #文件标题行
···    names = ['date', 'member', 'number'])  #自定义列名

3. 读取excel文件

3.1 类似csv文件

4. 读取MySQL

后续补充

三、pandas数据结构（DateFrame & Series）

访问数据元素类似于字典，使用键（索引值index）来访问值。

1. Series

一维数据列表。

以下是几种创建的方法：

1.1 仅有列表即可生成最简单的Series；索引值默认为0~正无穷。

>>> s1 = Series([1, '2', 'a', '我是数据'])
>>> s1
0       1
1       2
2       a
3    我是数据
dtype: object

1.2 可自定义索引值。

>>> s1 = Series([1, '2', 'a', '我是数据'], index = [1, '52', 'a', '我是索引值'])
>>> s1
1           1
52          2
a           a
我是索引值    我是数据
dtype: object

1.3 使用字典创建Series类型数据

>>> s1 = Series({1:25, '48':56, '我是索引':'我是数据'})
>>> s1
1         25
48        56
我是索引    我是数据
dtype: object

另外：

1. 只获取索引值：.index方法；只获取数据：.values方法。两者皆以列表形式返回。

2. 批量访问数据，可以使用如下方法

>>> s1[[1, '48']]
1     25
48    56
dtype: object

注意，访问时，批量访问的索引值以列表形式给出；

3. 访问单一数据的类型，会返回数据的具体类型；批量访问数据的类型，返回的仍然是，Series数据类型。

2. DataFrame

即为二维or多维数据列表

既有行索引index，又有列索引columns。可以看作是由Series组成的字典。

使用字典嵌套列表的方法创建，其键，即为列索引，行索引默认添加为自然数列。

tips:

1. 访问其每个数据类型（按照列索引访问）， object.dtypes，会返回具体数据类型（包括object, int, float...）

2. 获取行索引，obj.index；获取列索引，obj.columns。

3. 列索引一般需要自定义，访问列索引时返回一个列表；行索引一般会默认为自然数列（返回元组，(start=, stop=, step=)），也可自定义（返回一个列表）。

4. 若想使用行索引来访问数据时，使用obj.loc[索引值]的方法来访问，返回一个Series对象，其索引值此时变为原来的列索引值。

3. 从DataFrame中查询到Series

一行或者一列DataFrame即为一个Series对象。

查询方法类似于使用字典的键来访问值的过程。

4. 从DataFrame中查询到DataFrame

访问时：1. 一般使用切片方法；

2. 类似于Series中批量访问数据时的语法。

>>> s2 = DataFrame({
···     'one':[1, 2, 3, 4, 5],
···     'two':[11, 12, 13, 14, 15],
···     'three':[21, 22, 23, 24, 25]})
>>> s2
   one  two  three
0    1   11     21
1    2   12     22
2    3   13     23
3    4   14     24
4    5   15     25
>>> s2['two']#使用单一列索引访问
0    11
1    12
2    13
3    14
4    15
Name: two, dtype: int64
>>> s2.loc[2:4]#批量使用行索引访问
   one  two  three
2    3   13     23
3    4   14     24
4    5   15     25

值得注意的是，列表的切片是不包含末尾值的，但是在这两个数据结构中使用的切片方法是包含末尾值的。

2021/11/30

四、pandas的数据查询（主要讲解.loc方法）

本文介绍按数值、列表、区间、条件、函数五种方法来查询数据。

1. 使用单个label值查询数据；

>>> fobj
                  Unnamed: 0  code  lasttrade
name                                         
3M                         1   MMM     160.09
American Express           2   AXP     134.46
Apple                      3  AAPL     326.12
Boeing                     4    BA     345.02
Caterpillar                5   CAT     140.47

>>> fobj.loc['3M', 'code']
'MMM'

#使用时，参数先用行序列，后用列序列

2. 使用值列表批量查询

>>> fobj.loc[['Apple', '3M'], ['code', 'lasttrade']]
       code  lasttrade
name                  
Apple  AAPL     326.12
3M      MMM     160.09

3. 使用数据区间进行范围查询(查询时，既包含起点，也包含终点)

>>> fobj.loc['3M':'Boeing', 'code':'lasttrade']
                  code  lasttrade
name                             
3M                 MMM     160.09
American Express   AXP     134.46
Apple             AAPL     326.12
Boeing              BA     345.02

4. 使用条件表达式查询

>>> fobj.loc[(fobj['lasttrade']<200) & (fobj['lasttrade'] > 150), : ]
      Unnamed: 0 code  lasttrade
name                            
3M             1  MMM     160.09

'''
该种查询方式称之为，布尔查询

>>> (fobj['lasttrade']<200) & (fobj['lasttrade'] > 150)
name
3M                   True
American Express    False
Apple               False
Boeing              False
Caterpillar         False
Name: lasttrade, dtype: bool
'''

5. 调用函数查询

# 1. 使用lambda函数
>>> fobj.loc[lambda fobj:(fobj['lasttrade']<200) & (fobj['lasttrade'] > 150)]
      Unnamed: 0 code  lasttrade
name                            
3M             1  MMM     160.09
# 说明：lambda表达式在此处，其参数，fobj即为一行series数据，通过后面的条件来对数据进行筛选

# 2. 使用自定义函数查询

# 举例略

'''
说明：
1. 使用自定义函数时，只需要写函数名即可；
2. 一般在自定义函数中，return 语句要反映出函数查询的条件。
'''

注意：以上查询方法，既适用于行，也适用于列。

还有.iloc()：

后面使用参数为行列的默认数字索引。

.where()，.query()等方法。

五、pandas新增数据列的四种方法

0. 数据准备

>>> from pandas import *
>>> fobj = read_csv('D:/桌面/01.csv', index_col = 'Unnamed: 0')
>>> fobj.head(3)
#分别对应一段时间内某公司股票的收盘价，最高价，最低价，开盘价，成交量，月份
               close        high         low        open   volume  month
2019/4/1  111.699997  112.040001  110.080002  110.290001  5125100      4
2019/4/2  111.000000  111.459999  110.370003  111.209999  3568900      4
2019/4/3  110.559998  112.000000  110.349998  111.820000  3719700      4

1. 直接赋值

举例：算出每天该公司的开盘与收盘时股票价格差，该数据列命名为increase：

>>> fobj['increase'] = fobj.close - fobj.open
>>> fobj.head(3)
               close        high         low  ...   volume  month  increase
2019/4/1  111.699997  112.040001  110.080002  ...  5125100      4  1.409996
2019/4/2  111.000000  111.459999  110.370003  ...  3568900      4 -0.209999
2019/4/3  110.559998  112.000000  110.349998  ...  3719700      4 -1.260002

[3 rows x 7 columns]

2. .apply方法

3. .assign方法

4. 按条件选择分组并分别赋值

六、常用方法详解

1. DateFrame.set_index

(self, keys, drop=True, inplace=False, append=False, verify_integrity=False):

函数功能：设置数据的索引值；

参数说明：

keys：需要设置成索引的数据的列索引，可以是一个标签，数组，列表；

drop：在数据中是否删除作为索引使用的列，默认True，即删除做为索引的列；

>>> fobj
   Unnamed: 0  code              name  lasttrade
0           1   MMM                3M     160.09
1           2   AXP  American Express     134.46
2           3  AAPL             Apple     326.12
3           4    BA            Boeing     345.02
4           5   CAT       Caterpillar     140.47
>>> fobj.set_index('name', drop = False)#保留原数据
                  Unnamed: 0  code              name  lasttrade
name                                                           
3M                         1   MMM                3M     160.09
American Express           2   AXP  American Express     134.46
Apple                      3  AAPL             Apple     326.12
Boeing                     4    BA            Boeing     345.02
Caterpillar                5   CAT       Caterpillar     140.47
>>> fobj.set_index('name')#不保留原数据
                  Unnamed: 0  code  lasttrade
name                                         
3M                         1   MMM     160.09
American Express           2   AXP     134.46
Apple                      3  AAPL     326.12
Boeing                     4    BA     345.02
Caterpillar                5   CAT     140.47

inplace：是否替换原有数据为经函数处理后的结果，默认为False，不替换；

>>> fobj
   Unnamed: 0  code              name  lasttrade
0           1   MMM                3M     160.09
1           2   AXP  American Express     134.46
2           3  AAPL             Apple     326.12
3           4    BA            Boeing     345.02
4           5   CAT       Caterpillar     140.47

>>> fobj.set_index('name')#默认不修改原数据
>>> fobj
   Unnamed: 0  code              name  lasttrade
0           1   MMM                3M     160.09
1           2   AXP  American Express     134.46
2           3  AAPL             Apple     326.12
3           4    BA            Boeing     345.02
4           5   CAT       Caterpillar     140.47

>>> fobj.set_index('name', inplace = True)#修改原数据
>>> fobj
                  Unnamed: 0  code  lasttrade
name                                         
3M                         1   MMM     160.09
American Express           2   AXP     134.46
Apple                      3  AAPL     326.12
Boeing                     4    BA     345.02
Caterpillar                5   CAT     140.47

append：将序列添加到索引中，形成多级序列，默认为False，即不添加；
verify_integrity：检查索引是否重复。默认是False。

所念非欢

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
模块学习—数据分析—pandas

Python环境：3.10.0，已安装第三方库 pandas一、pandas概述Pandas主要读取表格类型（二维数据类型）的数据，然后进行分析：数据类型说明 pandas读取方法参数说明 csv, tsv, txt 用逗号分隔、tab分割的纯文本文件 pandas.read_csv(参数1) 参数1，均为文件路径 excel 微软xls或xlsx文件 pandas.read_cexcel(参数1) - mysql 关系
复制链接

扫一扫