Pandas 数据结构介绍——DataFrame的构造和属性

最新推荐文章于 2024-08-21 16:45:43 发布

峡谷的小鱼

最新推荐文章于 2024-08-21 16:45:43 发布

阅读量3.2k

点赞数

分类专栏：数据分析 pandas 文章标签： python 数据分析机器学习 pytorch

本文链接：https://blog.csdn.net/weixin_43276033/article/details/124030844

版权

数据分析 pandas 专栏收录该内容

9 篇文章 1 订阅

订阅专栏

本文详细介绍了如何使用pandas库中的DataFrame对象，包括字典、ndarray和Series构造DataFrame，展示了其属性、数据类型操作、索引管理以及常用函数。重点涵盖了DataFrame的构造实例和核心概念。

摘要由CSDN通过智能技术生成

DataFrame数据结构

一、`DataFrame`对象构造

class pandas.DataFrame(data=None, index=None, colums=None, dtype=None, copy=None):
	"""
		二维可变数组；
		数据结构包括行列标签；
		参数：
			data: ndarray, iterable, dict, or DataFrame
				字典可以包含Series，数组，类列表对象等。
			index: Index or array-like
				结果对象的行索引，如果不设定，默认为RangeIndex。
			columns: Index or array-like
				结果对象的列标签，默认RangeIndex。
			dtype: dtype, default None
				强制设定的数据类型。
			copy: bool or None, default None
				从输入复制数据。
	"""

构造实例：

# 1. 使用字典构造DataFrame
>>> d = {'col1': [1, 2, 3], 'col2': [3, 4, 5]}
>>> df = pd.DataFrame(data=d)
>>> df
  col1 col2
0	1	3
1	2	4
2	3	5

# 对象的数据类型
>>> df.dtypes
col1    int64
col2    int64
dtype: object

# 2. 使用字典构造DataFrame，包括Series对象
>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=list('abcd'))
	col1	col2
a	0	NaN
b	1	NaN
c	2	NaN
d	3	NaN

# 3. 使用numpy ndarray构造
>>> pd.DataFrame(data=np.arange(20).reshape(4,5), index=list('abcd'))
	0	1	2	3	4
a	0	1	2	3	4
b	5	6	7	8	9
c	10	11	12	13	14
d	15	16	17	18	19
>>> pd.DataFrame(data=np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
                dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")]), columns=['a', 'c'])
	a	c
0	1	3
1	4	6
2	7	9

二、DataFrame的属性

>>> df = pd.DataFrame(data=np.array([(x, x+1, x+2, x+3, x+4) for x in range(0,25,5)],
                                dtype=[('col1', 'i4'), ('col2', 'i8'), ('col3', 'i4'), ('col4','f8'), ('col5', 'b')])
                  )
>>> df
	col1	col2	col3	col4	col5
0	0	1	2	3.0	4
1	5	6	7	8.0	9
2	10	11	12	13.0	14
3	15	16	17	18.0	19
4	20	21	22	23.0	24

# DataFrame属性
# DataFrame的行索引
>>> df.index
RangeIndex(start=0, stop=5, step=1)

# DataFrame的列索引
>>> df.columns
Index(['col1', 'col2', 'col3', 'col4', 'col5'], dtype='object')

# DataFrame的数据类型
>>> df.dtypes
col1      int32
col2      int64
col3      int32
col4    float64
col5       int8
dtype: object

# DataFrame的描述
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   col1    5 non-null      int32  
 1   col2    5 non-null      int64  
 2   col3    5 non-null      int32  
 3   col4    5 non-null      float64
 4   col5    5 non-null      int8   
dtypes: float64(1), int32(2), int64(1), int8(1)
memory usage: 253.0 bytes

# 返回DataFrame中符合数据类型的子集
>>> df.select_dtypes([np.int32])

  col1	col3
0	0	2
1	5	7
2	10	12
3	15	17
4	20	22

# 以array形式的值
>>> df.values
array([[ 0.,  1.,  2.,  3.,  4.],
       [ 5.,  6.,  7.,  8.,  9.],
       [10., 11., 12., 13., 14.],
       [15., 16., 17., 18., 19.],
       [20., 21., 22., 23., 24.]])

# DataFrame的索引
>>> df.axes
[RangeIndex(start=0, stop=5, step=1),
 Index(['col1', 'col2', 'col3', 'col4', 'col5'], dtype='object')]

# 维度，2
>>> df.ndim
2

# DataFrame中元素的数量
>>> df.size
25

# DataFrame元组形式的尺寸
>>> df.shape
(5, 5)

# DataFrame是否为空，布尔值
>>> df.empty
False

# DataFrame每列的内存占用
>>> df.memory_usage()
Index    128
col1      20
col2      40
col3      20
col4      40
col5       5
dtype: int64