pandas之基本操作

最新推荐文章于 2024-08-11 16:05:35 发布

Tomorrowave

最新推荐文章于 2024-08-11 16:05:35 发布

阅读量267

点赞数

本文链接：https://blog.csdn.net/m0_58381606/article/details/125775087

版权

数据分析专栏收录该内容

5 篇文章 0 订阅

订阅专栏

文章目录

基本的数据结构
- Series
- DataFrame
基本函数

基本的数据结构

Series

Series 一般由四个部分组成，分别是序列的值 data 、索引 index 、存储类型 dtype 、序列的名字 name 。其中，索引也可以指定它的名字，默认为空。

s = pd.Series(data = [100, 'a', {'dic1':5}],
index = pd.Index(['id1', 20, 'third'], name='my_idx'),
 dtype = 'object',
 name = 'my_name')
 '''
In [23]: s
Out[23]:
my_idx
id1 100
20 a
third {'dic1': 5}
Name: my_name, dtype: object
'''

DataFrame

DataFrame 在 Series 的基础上增加了列索引，一个数据框可以由二维的 data 与行列索引来构造：

 data = [[1, 'a', 1.2], [2, 'b', 2.2], [3, 'c', 3.2]]
df = pd.DataFrame(data = data,
index = ['row_%d'%i for i in range(3)],
columns=['col_0', 'col_1', 'col_2'])
'''
In [32]: df
Out[32]:
col_0 col_1 col_2
row_0 1 a 1.2
row_1 2 b 2.2
row_2 3 c 3.2
'''

基本函数

head, tail 函数分别表示返回表或者序列的前 n 行和后 n 行，其中 n 默认为 5：

In [46]: df.head(2)
Out[46]:
School Grade Name Gender Height Weight Transfer
0 Shanghai Jiao Tong University Freshman Gaopeng Yang Female 158.9 46.0 N 1 Peking University Freshman Changqiang You Male 166.5 70.0 N
In [47]: df.tail(3)

info, describe 分别返回表的信息概况和表中数值列对应的主要统计量

In [48]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 School 200 non-null object
1 Grade 200 non-null object
2 Name 200 non-null object
3 Gender 200 non-null object
4 Height 183 non-null float64
5 Weight 189 non-null float64
6 Transfer 188 non-null object
dtypes: float64(2), object(5)
memory usage: 11.1+ KB
In [49]: df.describe()
Out[49]:
Height Weight
count 183.000000 189.000000
mean 163.218033 55.015873
std 8.608879 12.824294
min 145.400000 34.000000
25% 157.150000 46.000000
50% 161.900000 51.000000
75% 167.500000 65.000000
max 193.900000 89.000000

更全面的数据汇总
info, describe 只能实现较少信息的展示，如果想要对一份数据集进行全面且有效的观察，特别是
在列较多的情况下，推荐使用 pandas-profiling 包，它将在第十一章被再次提到。

在 Series 和 DataFrame 上定义了许多统计函数，最常见的是 sum, mean, median, var, std, max, min 。例
如，选出身高和体重列进行演示

In [50]: df_demo = df[['Height', 'Weight']]
In [51]: df_demo.mean()
Out[51]:
Height 163.218033
Weight 55.015873
dtype: float64
In [52]: df_demo.max()
Out[52]:
Height 193.9
Weight 89.0
dtype: float64

对序列使用 unique 和 nunique 可以分别得到其唯一值组成的列表和唯一值的个数：

In [57]: df['School'].unique()
Out[57]:
array(['Shanghai Jiao Tong University', 'Peking University',
'Fudan University', 'Tsinghua University'], dtype=object)
In [58]: df['School'].nunique()
Out[58]: 4

value_counts 可以得到唯一值和其对应出现的频数：

In [59]: df['School'].value_counts()
Out[59]:
Tsinghua University 69
Shanghai Jiao Tong University 57
Fudan University 40
Peking University 34
Name: School, dtype: int64

Tomorrowave

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
打赏
0
评论
pandas之基本操作

Series 一般由四个部分组成，分别是序列的值 data 、索引 index 、存储类型 dtype 、序列的名字 name 。其中，索引也可以指定它的名字，默认为空。DataFrameDataFrame 在 Series 的基础上增加了列索引，一个数据框可以由二维的 data 与行列索引来构造：基本函数head, tail 函数分别表示返回表或者序列的前 n 行和后 n 行，其中 n 默认为 5：info, describe 分别返回表的信息概况和表中数值列对应的主要统计量更全面的数据汇
复制链接

扫一扫