理论:
DataFrame数据结构:
- 类似多维数组/表格数据
- 每列数据可以是不同的类型
- 索引包括行索引(index)和列索引(label/column)
DataFrame的构建以及操作:
1.构建DataFrame
- 通过ndarray/列表
- 通过dict通
2.获取列数据(Series类型 )
df_obj[label] 或 df_obj.label
3.增加列数据
df_obj[new_label] = data
4.删除列
- df_obj.drop(columns=[],inplace=False),当inplace=True时,原数据不会改变,否则不变。
- del df_obj[col_idx]
实验:
第五课数据分析工具Pandas基础
第二节 数据结构-DataFrame
数据结构--DataFrame
In [1]:
import pandas as pd
import numpy as np
构建DataFrame
In [3]:
# 通过ndarray
array = np.random.randn(5,4)
print(array)
[[-0.12816457 0.27077416 -0.61990247 0.4035906 ] [-0.26341507 0.04203491 0.77618217 1.37930502] [ 1.48660347 0.2923378 -0.54919946 -0.30086526] [-0.53059414 -1.56234117 -0.33324783 0.41363335] [ 0.12839676 -0.96041499 -0.67103782 -1.01363347]]
In [4]:
df_obj = pd.DataFrame(array)
df_obj
Out[4]:
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | -0.128165 | 0.270774 | -0.619902 | 0.403591 |
1 | -0.263415 | 0.042035 | 0.776182 | 1.379305 |
2 | 1.486603 | 0.292338 | -0.549199 | -0.300865 |
3 | -0.530594 | -1.562341 | -0.333248 | 0.413633 |
4 | 0.128397 | -0.960415 | -0.671038 | -1.013633 |
In [5]:
print(df_obj)
0 1 2 3 0 -0.128165 0.270774 -0.619902 0.403591 1 -0.263415 0.042035 0.776182 1.379305 2 1.486603 0.292338 -0.549199 -0.300865 3 -0.530594 -1.562341 -0.333248 0.413633 4 0.128397 -0.960415 -0.671038 -1.013633
In [6]:
# 通过dict
dict_data = {'A': 1.,
'B': pd.Timestamp('20200318'),
'C': pd.Series(1,index=list(range(4)),dtype='float32'),
'D': np.array([3] * 4,dtype='int32'),
'E': ['Python','Java','C++','C#'],
'F': 'ChinaHadoop'}
print(dict_data)
{'A': 1.0, 'B': Timestamp('2020-03-18 00:00:00'), 'C': 0 1.0 1 1.0 2 1.0 3 1.0 dtype: float32, 'D': array([3, 3, 3, 3]), 'E': ['Python', 'Java', 'C++', 'C#'], 'F': 'ChinaHadoop'}
In [7]:
df_obj2 = pd.DataFrame(dict_data)
df_obj2
Out[7]:
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2020-03-18 | 1.0 | 3 | Python | ChinaHadoop |
1 | 1.0 | 2020-03-18 | 1.0 | 3 | Java | ChinaHadoop |
2 | 1.0 | 2020-03-18 | 1.0 | 3 | C++ | ChinaHadoop |
3 | 1.0 | 2020-03-18 | 1.0 | 3 | C# | ChinaHadoop |
获取行,列、值
In [9]:
df_obj2.columns
Out[9]:
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
In [10]:
df_obj2.index
Out[10]:
Int64Index([0, 1, 2, 3], dtype='int64')
In [11]:
df_obj2.values
Out[11]:
array([[1.0, Timestamp('2020-03-18 00:00:00'), 1.0, 3, 'Python', 'ChinaHadoop'], [1.0, Timestamp('2020-03-18 00:00:00'), 1.0, 3, 'Java', 'ChinaHadoop'], [1.0, Timestamp('2020-03-18 00:00:00'), 1.0, 3, 'C++', 'ChinaHadoop'], [1.0, Timestamp('2020-03-18 00:00:00'), 1.0, 3, 'C#', 'ChinaHadoop']], dtype=object)
预览数据
In [12]:
df_obj2.head(3)
Out[12]:
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2020-03-18 | 1.0 | 3 | Python | ChinaHadoop |
1 | 1.0 | 2020-03-18 | 1.0 | 3 | Java | ChinaHadoop |
2 | 1.0 | 2020-03-18 | 1.0 | 3 | C++ | ChinaHadoop |
In [13]:
df_obj2.tail(3)
Out[13]:
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
1 | 1.0 | 2020-03-18 | 1.0 | 3 | Java | ChinaHadoop |
2 | 1.0 | 2020-03-18 | 1.0 | 3 | C++ | ChinaHadoop |
3 | 1.0 | 2020-03-18 | 1.0 | 3 | C# | ChinaHadoop |
In [14]:
df_obj2['E']
Out[14]:
0 Python 1 Java 2 C++ 3 C# Name: E, dtype: object
In [16]:
df_obj2.E
Out[16]:
0 Python 1 Java 2 C++ 3 C# Name: E, dtype: object
In [15]:
type(df_obj2['E'])
Out[15]:
pandas.core.series.Series
增加列数据
In [17]:
df_obj2['G'] = range(4)
In [18]:
df_obj2
Out[18]:
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
0 | 1.0 | 2020-03-18 | 1.0 | 3 | Python | ChinaHadoop | 0 |
1 | 1.0 | 2020-03-18 | 1.0 | 3 | Java | ChinaHadoop | 1 |
2 | 1.0 | 2020-03-18 | 1.0 | 3 | C++ | ChinaHadoop | 2 |
3 | 1.0 | 2020-03-18 | 1.0 | 3 | C# | ChinaHadoop | 3 |
删除列
In [21]:
df_obj3 = df_obj2.drop(columns=['B','G'])
In [22]:
df_obj3
Out[22]:
A | C | D | E | F | |
---|---|---|---|---|---|
0 | 1.0 | 1.0 | 3 | Python | ChinaHadoop |
1 | 1.0 | 1.0 | 3 | Java | ChinaHadoop |
2 | 1.0 | 1.0 | 3 | C++ | ChinaHadoop |
3 | 1.0 | 1.0 | 3 | C# | ChinaHadoop |
In [23]:
df_obj2
Out[23]:
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
0 | 1.0 | 2020-03-18 | 1.0 | 3 | Python | ChinaHadoop | 0 |
1 | 1.0 | 2020-03-18 | 1.0 | 3 | Java | ChinaHadoop | 1 |
2 | 1.0 | 2020-03-18 | 1.0 | 3 | C++ | ChinaHadoop | 2 |
3 | 1.0 | 2020-03-18 | 1.0 | 3 | C# | ChinaHadoop | 3 |
In [ ]: