【DW组队学习—Pandas】02.pandas基础

最新推荐文章于 2023-07-29 09:30:00 发布

0_×

最新推荐文章于 2023-07-29 09:30:00 发布

阅读量519

点赞数 1

分类专栏： Pandas DW组队学习 Python 文章标签： pandas

本文链接：https://blog.csdn.net/sinat_33209811/article/details/111519312

版权

本文详细介绍了Pandas库的基础知识，包括文件的读取和写入，如csv、excel、txt文件的处理，常用的数据结构Series和DataFrame的特性，以及基本的统计函数如汇总、统计量、唯一值等。文章还提到了窗口对象的滑动窗口和扩张窗口操作，并强调了数据读写、数据结构和统计函数在数据分析中的重要性。

摘要由CSDN通过智能技术生成

import numpy as np
import pandas as pd

在开始学习前，请保证 pandas 的版本号不低于如下所示的版本，否则请务必升级！请确认已经安装了 xlrd, xlwt, openpyxl 这三个包，其中xlrd版本不得高于 2.0.0 。

pd.__version__

'1.1.5'

pip install -U pandas==1.1.5 # 版本不够时更新，重启kernel生效

一、文件的读取和写入

1.文件读取

pandas 可以读取的文件格式有很多，这里主要介绍读取 csv, excel, txt 文件。

df_csv = pd.read_csv('data/my_csv.csv')
df_csv

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

df_txt = pd.read_table('data/my_table.txt')
df_txt

	col1	col2	col3	col4
0	2	a	1.4	apple 2020/1/1
1	3	b	3.4	banana 2020/1/2
2	6	c	2.5	orange 2020/1/5
3	5	d	3.2	lemon 2020/1/7

df_excel = pd.read_excel('data/my_excel.xlsx')
df_excel

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2
2	6	c	2.5	orange	2020/1/5
3	5	d	3.2	lemon	2020/1/7

这里有一些常用的公共参数：

header=None 表示第一行不作为列名
index_col 表示把某一列或几列作为索引，索引的内容将会在第三章进行详述
usecols 表示读取列的集合，默认读取所有的列
parse_dates 表示需要转化为时间的列，关于时间序列的有关内容将在第十章讲解
nrows 表示读取的数据行数。

上面这些参数在上述的三个函数里都可以使用。

pd.read_table('data/my_table.txt', header = None) #第一行不作为列名

	0	1	2	3
0	col1	col2	col3	col4
1	2	a	1.4	apple 2020/1/1
2	3	b	3.4	banana 2020/1/2
3	6	c	2.5	orange 2020/1/5
4	5	d	3.2	lemon 2020/1/7

pd.read_csv('data/my_csv.csv', index_col = ['col1', 'col2']) #第一、二列作为索引

		col3	col4	col5
col1	col2
2	a	1.4	apple	2020/1/1
3	b	3.4	banana	2020/1/2
6	c	2.5	orange	2020/1/5
5	d	3.2	lemon	2020/1/7

pd.read_table('data/my_table.txt', usecols = ['col1', 'col2']) #只读取第一、二列

	col1	col2
0	2	a
1	3	b
2	6	c
3	5	d

pd.read_csv('data/my_csv.csv', parse_dates = ['col5']) # 第五列转化为时间

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020-01-01
1	3	b	3.4	banana	2020-01-02
2	6	c	2.5	orange	2020-01-05
3	5	d	3.2	lemon	2020-01-07

pd.read_excel('data/my_excel.xlsx', nrows = 2) #读取两行数据

	col1	col2	col3	col4	col5
0	2	a	1.4	apple	2020/1/1
1	3	b	3.4	banana	2020/1/2

在读取 txt 文件时，经常遇到分隔符非空格的情况， read_table 有一个分割参数sep，它使得用户可以自定义分割符号，进行 txt 数据的读取。例如，下面的读取的表以 |||| 为分割：
【VC小注】即sep指定分割标志

pd.read_table('data/my_table_special_sep.txt')

	col1 \|\|\|\| col2
0	TS \|\|\|\| This is an apple.
1	GQ \|\|\|\| My name is Bob.
2	WT \|\|\|\| Well done!
3	PT \|\|\|\| May I help you?

上面的结果显然不是理想的，这时可以使用 sep ，同时需要指定引擎为 python ：

pd.read_table('data/my_table_special_sep.txt', sep = '\|\|\|\|', engine = 'python')

	col1	col2
0	TS	This is an apple.
1	GQ	My name is Bob.
2	WT	Well done!
3	PT	May I help you?

【注】sep 是正则参数，需要进行转义
在使用read_table 的时候需要注意，参数 sep 中使用的是正则表达式，因此需要对 | 进行转义变成 | ，否则无法读取到正确的结果。有关正则表达式的基本内容可以参考第八章或者其他相关资料。

2.数据写入

一般在数据写入中，最常用的操作是把 index 设置为 False ，特别当索引没有特殊意义的时候，这样的行为能把索引在保存的时候去除。

df_csv.to_csv('data/my_csv_saved.csv', index = False)

#出现XLRD相关报错的，可以尝试安装xlrd、xlwt、tabulate三个扩展库，
pip install xlrd xlwt tabulate

Requirement already satisfied: xlrd in c:\programdata\anaconda3\envs\py37\lib\site-packages (1.2.0)
Collecting xlwt
  Downloading xlwt-1.3.0-py2.py3-none-any.whl (99 kB)
Collecting tabulate
  Downloading tabulate-0.8.7-py3-none-any.whl (24 kB)
Installing collected packages: xlwt, tabulate
Successfully installed tabulate-0.8.7 xlwt-1.3.0
Note: you may need to restart the kernel to use updated packages.

pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.0.5-py2.py3-none-any.whl (242 kB)
Collecting et-xmlfile
  Downloading et_xmlfile-1.0.1.tar.gz (8.4 kB)
Collecting jdcal
  Downloading jdcal-1.4.1-py2.py3-none-any.whl (9.5 kB)
Building wheels for collected packages: et-xmlfile
  Building wheel for et-xmlfile (setup.py): started
  Building wheel for et-xmlfile (setup.py): finished with status 'done'
  Created wheel for et-xmlfile: filename=et_xmlfile-1.0.1-py3-none-any.whl size=8919 sha256=cdc60ba1dee65fbab03e098bc4d92e1221d06fd80800a626fba7fa5fe4429865
  Stored in directory: c:\users\viochan\appdata\local\pip\cache\wheels\e2\bd\55\048b4fd505716c4c298f42ee02dffd9496bb6d212b266c7f31
Successfully built et-xmlfile
Installing collected packages: et-xmlfile, jdcal, openpyxl
Successfully installed et-xmlfile-1.0.1 jdcal-1.4.1 openpyxl-3.0.5
Note: you may need to restart the kernel to use updated packages.

【VC小注】出现ModuleNotFoundError错误时，可根据提示安装相关库

df_excel.to_excel('data/my_excel_saved.xlsx', index = False)

pandas 中没有定义 to_table 函数，但是to_csv 可以保存为 txt 文件，并且允许自定义分隔符，常用制表符 \t 分割：

df_txt.to_csv('data/my_txt_saved.txt', sep = "\t", index = False)

如果想要把表格快速转换为 markdown 和 latex 语言，可以使用 to_markdown 和 to_latex 函数，此处需要安装 tabulate 包。

**Series/DataFrame.to_markdown(buf=None, mode=None, index=True, kwargs)
以Markdown形式打印
参数

buf：str，可选，默认为None，指定要写入的缓冲区，默认输出以字符串形式返回

mode：str，可选，打开文件的模式

index：bool，可选，默认为True，添加索引（行）标签

DataFrame.to_latex(buf=None, columns=None, col_space=None, header=True, index=True, na_rep=‘NaN’, formatters=None, float_format=None, sparsify=None, index_names=True, bold_rows=False, column_format=None, longtable=None, escape=None, encoding=None, decimal=’.’, multicolumn=None, multicolumn_format=None, multirow=None, caption=None, label=None)
将对象渲染为LaTeX表格，长表或嵌套表/表格(LtTeX语言)

print(df_csv.to_markdown())

|    |   col1 | col2   |   col3 | col4   | col5     |
|---:|-------:|:-------|-------:|:-------|:---------|
|  0 |      2 | a      |    1.4 | apple  | 2020/1/1 |
|  1 |      3 | b      |    3.4 | banana | 2020/1/2 |
|  2 |      6 | c      |    2.5 | orange | 2020/1/5 |
|  3 |      5 | d      |    3.2 | lemon  | 2020/1/7 |

print(df_csv.to_latex())

\begin{tabular}{lrlrll}
\toprule
{} &  col1 & col2 &  col3 &    col4 &      col5 \\
\midrule
0 &     2 &    a &   1.4 &   apple &  2020/1/1 \\
1 &     3 &    b &   3.4 &  banana &  2020/1/2 \\
2 &     6 &    c &   2.5 &  orange &  2020/1/5 \\
3 &     5 &    d &   3.2 &   lemon &  2020/1/7 \\
\bottomrule
\end{tabular}

二、基本数据结构

pandas 中具有两种基本的数据存储结构，存储一维 values 的 Series 和存储二维 values 的 DataFrame ，在这两种结构上定义了很多的属性和方法。

1. Series

Series 一般由四个部分组成，分别是序列的**值 data 、索引 index 、存储类型 dtype 、序列的名字 name** 。其中，索引也可以指定它的名字，默认为空。

s = pd.Series(data = [100, 'a', {
   'dic1': 5}],
              index = pd.Index(['id1', 20, 'third'], name = 'my_idx'),
              dtype = 'object',
              name = 'my_name'
             )
s

my_idx
id1              100
20                 a
third    {'dic1': 5}
Name: my_name, dtype: object

pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
生成具有轴标签（包括时间序列）的一维ndarray对象
参数

data：数组，可迭代，字典或标量值，包含存储在Series中的数据

index：数组或索引

①值必须是可散列的，并且长度与data相同；

②允许使用非唯一索引值，如果未提供，则默认为RangeIndex（0，1，2，…，n）；

③如果同时使用字典和索引序列，则索引将覆盖在字典中找到的键。

dtype：可选，指定输出的数据类型，如果未指定，则将从data推断出来。

name：str，可选，指定series的名称

copy：bool，默认为False，复制输入数据

【VC小注】这是一个类函数，生成一个具体的对象，该对象有预定的属性和方法。

【注】object 类型
object 代表了一种混合类型，正如上面的例子中存储了整数、字符串以及 Python 的字典数据结构。此外，目前 pandas 把纯字符串序列也默认认为是一种 object 类型的序列，但它也可以用 string 类型存储，文本序列的内容会在第八章中讨论。
对于这些属性，可以通过 . 的方式来获取：

s.values #查看series的值

array([100, 'a', {'dic1': 5}], dtype=object)

s.index #查看series的索引

Index(['id1', 20, 'third'], dtype='object', name='my_idx')

s.dtype #查看series的类型

dtype('O')

s.name #查看series的名称

'my_name'

利用 .shape 可以获取序列的长度：

s.shape

(3,)

索引是 pandas 中最重要的概念之一，它将在第三章中被详细地讨论。如果想要取出单个索引对应的值，可以通过**[index_item]**可以取出。

s['third']

{'dic1': 5}

2. DataFrame

DataFrame 在 Series 的基础上增加了列索引，一个数据框可以由二维的 data 与行列索引来构造：

最低0.47元/天解锁文章

0_×

关注

1
点赞
踩
0

收藏

觉得还不错? 一键收藏
1
评论
【DW组队学习—Pandas】02.pandas基础

import numpy as npimport pandas as pd在开始学习前，请保证 pandas 的版本号不低于如下所示的版本，否则请务必升级！请确认已经安装了 xlrd, xlwt, openpyxl 这三个包，其中xlrd版本不得高于 2.0.0 。pd.__version__'1.1.5'pip install -U pandas==1.1.5 # 版本不够时更新，重启kernel生效一、文件的读取和写入1.文件读取pandas 可以读取的文件格式有很多，这里主要介
复制链接

扫一扫