DataWhale_Pandas Task02 Pandas基础

最新推荐文章于 2024-09-18 09:40:08 发布

那一夜见红

最新推荐文章于 2024-09-18 09:40:08 发布

阅读量230

点赞数

分类专栏： DataWhale_Pandas 文章标签： python 数据分析

本文链接：https://blog.csdn.net/weixin_42044546/article/details/111411293

版权

本文介绍了Pandas库的基础知识，包括文件的读取和写入，如read_csv、to_csv等，以及Series和DataFrame的基本数据结构。还讲解了常用的基本函数，如汇总、统计和排序，以及窗口对象的应用。文中提供了具体的代码示例和练习，适合初学者学习。

摘要由CSDN通过智能技术生成

第二章 pandas基础

In [1]: import numpy as np

In [2]: import pandas as pd

在开始学习前，请保证 pandas 的版本号不低于1.1.5，否则请务必升级！请确认已经安装了 xlrd, xlwt, openpyxl 这三个包，其中xlrd版本不得高于 2.0.0 。

所用数据集下载可以在该项目的github地址：https://github.com/datawhalechina/joyful-pandas

一、文件的读取和写入

1. 文件读取

三个文件读取函数，分别读取csv，excel和txt文件。

In [4]: df_csv = pd.read_csv('data/my_csv.csv')

In [5]: df_csv
Out[5]: 
   col1 col2  col3    col4      col5
0     2    a   1.4   apple  2020/1/1
1     3    b   3.4  banana  2020/1/2
2     6    c   2.5  orange  2020/1/5
3     5    d   3.2   lemon  2020/1/7

In [6]: df_txt = pd.read_table('data/my_table.txt')

In [7]: df_txt
Out[7]: 
   col1 col2  col3             col4
0     2    a   1.4   apple 2020/1/1
1     3    b   3.4  banana 2020/1/2
2     6    c   2.5  orange 2020/1/5
3     5    d   3.2   lemon 2020/1/7

In [8]: df_excel = pd.read_excel('data/my_excel.xlsx')

In [9]: df_excel
Out[9]: 
   col1 col2  col3    col4      col5
0     2    a   1.4   apple  2020/1/1
1     3    b   3.4  banana  2020/1/2
2     6    c   2.5  orange  2020/1/5
3     5    d   3.2   lemon  2020/1/7

其中read_csv之前有看过很详细的介绍，链接放这里https://www.cnblogs.com/traditional/p/12514914.html

一些常用的公共参数， header=None 表示第一行不作为列名， index_col 表示把某一列或几列作为索引，索引的内容将会在第三章进行详述， usecols 表示读取列的集合，默认读取所有的列， parse_dates 表示需要转化为时间的列，关于时间序列的有关内容将在第十章讲解， nrows 表示读取的数据行数。上面这些参数在上述的三个函数里都可以使用。

In [10]: pd.read_table('data/my_table.txt', header=None)
Out[10]: 
      0     1     2                3
0  col1  col2  col3             col4
1     2     a   1.4   apple 2020/1/1
2     3     b   3.4  banana 2020/1/2
3     6     c   2.5  orange 2020/1/5
4     5     d   3.2   lemon 2020/1/7

In [11]: pd.read_csv('data/my_csv.csv', index_col=['col1', 'col2'])
Out[11]: 
           col3    col4      col5
col1 col2                        
2    a      1.4   apple  2020/1/1
3    b      3.4  banana  2020/1/2
6    c      2.5  orange  2020/1/5
5    d      3.2   lemon  2020/1/7

In [12]: pd.read_table('data/my_table.txt', usecols=['col1', 'col2'])
Out[12]: 
   col1 col2
0     2    a
1     3    b
2     6    c
3     5    d

In [13]: pd.read_csv('data/my_csv.csv', parse_dates=['col5'])
Out[13]: 
   col1 col2  col3    col4       col5
0     2    a   1.4   apple 2020-01-01
1     3    b   3.4  banana 2020-01-02
2     6    c   2.5  orange 2020-01-05
3     5    d   3.2   lemon 2020-01-07

In [14]: pd.read_excel('data/my_excel.xlsx', nrows=2)
Out[14]: 
   col1 col2  col3    col4      col5
0     2    a   1.4   apple  2020/1/1
1     3    b   3.4  banana  2020/1/2

在读取 txt 文件时，经常遇到分隔符非空格的情况， read_table 有一个分割参数 sep ，它使得用户可以自定义分割符号，进行 txt 数据的读取。一般数据集里遇到用‘\t’比较多(这里是个人经验)。例如，下面的读取的表以 |||| 为分割：

In [15]: pd.read_table('data/my_table_special_sep.txt')
Out[15]: 
              col1 |||| col2
0  TS |||| This is an apple.
1    GQ |||| My name is Bob.
2         WT |||| Well done!
3    PT |||| May I help you?
'''在使用 read_table 的时候需要注意，参数 sep 中使用的是正则表达式，因此需要对 | 进行转义变成 \| ，否则无法读取到正确的结果。有关正则表达式的基本内容可以参考第八章或者其他相关资料。'''

2. 数据写入

一般在数据写入中，最常用的操作是把 index 设置为 False ，特别当索引没有特殊意义的时候，这样的行为能把索引在保存的时候去除。

In [17]: df_csv.to_csv('data/my_csv_saved.csv', index=False)

In [18]: df_excel.to_excel('data/my_excel_saved.xlsx', index=False)

pandas 中没有定义 to_table 函数，但是 to_csv 可以保存为 txt 文件，并且允许自定义分隔符，常用制表符 \t 分割：

In [19]: df_txt.to_csv('data/my_txt_saved.txt', sep='\t', index=False)

如果想要把表格快速转换为 markdown 和 latex 语言，可以使用 to_markdown 和 to_latex 函数，此处需要安装 tabulate 包。

In [20]: print(df_csv.to_markdown())
|    |   col1 | col2   |   col3 | col4   | col5     |
|---:|-------:|:-------|-------:|:-------|:---------|
|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |
|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |
|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |
|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |

In [21]: print(df_csv.to_latex())
\begin{tabular}{lrlrll}
\toprule
{} &  col1 & col2 &  col3 &    col4 &      col5 \\
\midrule
0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\
1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\
2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\
3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\
\bottomrule
\end{tabular}

二、基本数据结构

pandas 中具有两种基本的数据存储结构，存储一维 values 的 Series 和存储二维 values 的 DataFrame ，在这两种结构上定义了很多的属性和方法。

1. Series

Series 一般由四个部分组成，分别是序列的值 data 、索引 index 、存储类型 dtype 、序列的名字 name 。其中，索引也可以指定它的名字，默认为空。

In [22]: s = pd.Series(data = [100, 'a', {'dic1':5}],
   ....:               index = pd.Inde

x(['id1', 20, 'third'], name='my_idx'),
   ....:               dtype = 'object',
   ....:               name = 'my_name')
   ....: 

In [23]: s
Out[23]: 
my_idx
id1              100
20