Pandas学习（3）DataFream操作2

最新推荐文章于 2023-11-06 15:26:36 发布

Sk8er-boi

最新推荐文章于 2023-11-06 15:26:36 发布

阅读量292

点赞数

分类专栏： Pandas 文章标签： python 深度学习

原文链接：https://www.pypandas.cn/

版权

Pandas 专栏收录该内容

4 篇文章 0 订阅

订阅专栏

分组（Grouping），适用于有重复数据可统计、可分组的DataFrame
“group by” 指的是涵盖下列一项或多项步骤的处理流程：
分割：按条件把数据分割成多组；
应用：为每组单独应用函数；
组合：将处理结果组合成一个数据结构。

import numpy as np
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                           'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})

df

	A	B	C	D
0	foo	one	0.433992	-1.333293
1	bar	one	0.194936	-0.391367
2	foo	two	1.199633	0.119350
3	bar	three	-0.786895	0.729469
4	foo	two	0.450172	0.418308
5	bar	two	-0.256733	-0.802991
6	foo	one	0.202326	0.758224
7	foo	three	-1.381662	0.517538

df1=df.groupby('A').sum()#计算

df1#统计A组中所有bar、foo的和

	C	D
A
bar	-0.848692	-0.464888
foo	0.904460	0.480128

也可进行多列分组：

df.groupby(['A', 'B']).sum()#同时统计A、B列可分组数据和

		C	D
A	B
bar	one	0.194936	-0.391367
	three	-0.786895	0.729469
	two	-0.256733	-0.802991
foo	one	0.636317	-0.575069
	three	-1.381662	0.517538
	two	1.649805	0.537658

多层索引与重塑：
有时候为了更好的查看DataFrame数据，我们可以使用多层索引，并会使用到重塑方法。
首先需要创建一个MultiIndex（层次索引）对象
MultiIndex可以从阵列（使用的列表来创建 MultiIndex.from_arrays()），
元组（使用的阵列 MultiIndex.from_tuples()），
一个交叉组iterables（使用的 MultiIndex.from_product()），
或者一个DataFrame（使用 MultiIndex.from_frame()）创建。
下列实例使用一个元组列表和Index构造函数创建

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                      'foo', 'foo', 'qux', 'qux'],
                     ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])  
df = pd.DataFrame(np.random.randn(8, 4), index=index, columns=['A', 'B','C','D'])
df

		A	B	C	D
first	second
bar	one	0.271837	-0.997848	-2.000626	-0.034044
bar	two	-0.779253	-1.503962	-1.244141	0.337168
baz	one	1.060027	-1.522924	-0.400959	0.813439
baz	two	1.896187	-0.525100	0.341734	0.120456
foo	one	-0.549181	0.474325	0.605349	0.161852
foo	two	-0.732263	0.756273	-0.986397	0.810937
qux	one	-0.755619	-1.457063	0.626315	0.405198
qux	two	0.377285	0.494047	-1.161364	-0.772654

stack()方法把 DataFrame 多列压缩至一层：

df1=df.stack()
df1

first  second   
bar    one     A    0.271837
               B   -0.997848
               C   -2.000626
               D   -0.034044
       two     A   -0.779253
               B   -1.503962
               C   -1.244141
               D    0.337168
baz    one     A    1.060027
               B   -1.522924
               C   -0.400959
               D    0.813439
       two     A    1.896187
               B   -0.525100
               C    0.341734
               D    0.120456
foo    one     A   -0.549181
               B    0.474325
               C    0.605349
               D    0.161852
       two     A   -0.732263
               B    0.756273
               C   -0.986397
               D    0.810937
qux    one     A   -0.755619
               B   -1.457063
               C    0.626315
               D    0.405198
       two     A    0.377285
               B    0.494047
               C   -1.161364
               D   -0.772654
dtype: float64

stack() 的逆操作是 unstack()

df1.unstack()

		A	B	C	D
first	second
bar	one	0.271837	-0.997848	-2.000626	-0.034044
bar	two	-0.779253	-1.503962	-1.244141	0.337168
baz	one	1.060027	-1.522924	-0.400959	0.813439
baz	two	1.896187	-0.525100	0.341734	0.120456
foo	one	-0.549181	0.474325	0.605349	0.161852
foo	two	-0.732263	0.756273	-0.986397	0.810937
qux	one	-0.755619	-1.457063	0.626315	0.405198
qux	two	0.377285	0.494047	-1.161364	-0.772654

数据输入 / 输出
文本文件读写
CSV read_csv to_csv
Fixed-WidthTextFile read_fwf
JSON read_json to_json
HTML read_html to_html
Local clipboard read_clipboard to_clipboard
MS Excel read_excel to_excel

二进制文件
OpenDocument read_excel
HDF5 Format read_hdf to_hdf
Feather Format read_feather to_feather
Parquet Format read_parquet to_parquet
ORC Format read_orc
Msgpack read_msgpack to_msgpack
Stata read_stata to_stata
SAS read_sas
SPSS read_spss
Python Pickle Format read_pickle to_pickle

SQL
SQL read_sql to_sql
Google Big Queryread_gbq to_gbq

df

		A	B	C	D
first	second
bar	one	0.271837	-0.997848	-2.000626	-0.034044
bar	two	-0.779253	-1.503962	-1.244141	0.337168
baz	one	1.060027	-1.522924	-0.400959	0.813439
baz	two	1.896187	-0.525100	0.341734	0.120456
foo	one	-0.549181	0.474325	0.605349	0.161852
foo	two	-0.732263	0.756273	-0.986397	0.810937
qux	one	-0.755619	-1.457063	0.626315	0.405198
qux	two	0.377285	0.494047	-1.161364	-0.772654

#将df写入csv文件。
df.to_csv('file.csv')

#打开创建好的csv文件夹
pd.read_csv('file.csv')

	first	second	A	B	C	D
0	bar	one	0.271837	-0.997848	-2.000626	-0.034044
1	bar	two	-0.779253	-1.503962	-1.244141	0.337168
2	baz	one	1.060027	-1.522924	-0.400959	0.813439
3	baz	two	1.896187	-0.525100	0.341734	0.120456
4	foo	one	-0.549181	0.474325	0.605349	0.161852
5	foo	two	-0.732263	0.756273	-0.986397	0.810937
6	qux	one	-0.755619	-1.457063	0.626315	0.405198
7	qux	two	0.377285	0.494047	-1.161364	-0.772654

Sk8er-boi

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Pandas学习（3）DataFream操作2

分组（Grouping），适用于有重复数据可统计、可分组的DataFrame“group by” 指的是涵盖下列一项或多项步骤的处理流程：分割：按条件把数据分割成多组；应用：为每组单独应用函数；组合：将处理结果组合成一个数据结构。import numpy as npimport pandas as pddf = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'fo
复制链接

扫一扫