2.5 DataFrame 操作
DataFrame 是一种二维的数据结构,接近于 EXCEL 或者数据库表格的形式。它的竖行称之为 columns,就是先前的列,可以直接对应成跟前面的 Series ;先前的行,称之为 索引 (index),也就是说可以通过 columns 和 index 来确定一个元素的位置。底下实例透过 字典 (dictionary) 结构来创建一个 DataFrame,字典的键 (key)(“UserName”, “Age”, “Sex”, “Salary”)就是 DataFrame 的 columns 的值,字典中每个键的值 (value) 是一个 Series ,它们就是那一竖列中的具体填充数据,透过 shape 属性来检查 DataFrame 的行列大小为 (3, 4)。底下的定义中没有定义索引,所以,就是从 0 开始的整数。
实例
import pandas as pd
dict = {
"UserName": ["Braund",
"Allen",
"Bonnell"],
"Age": [22, 35, 58],
"Sex": ["male", "male", "female"],
"Salary": [5000, 6000, 10000]
}
df = pd.DataFrame(dict)
print(df)
print(df.shape)
print(type(df),type(df['UserName']))
print('Series from DataFrame :\n', df['UserName'])
# 输出结果如下:
UserName Age Sex Salary
0 Braund 22 male 5000
1 Allen 35 male 6000
2 Bonnell 58 female 10000
(3, 4)
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.series.Series'>
Series from DataFrame :
0 Braund
1 Allen
2 Bonnell
Name: UserName, dtype: object
在 DataFrame 中,columns 其顺序可以被规定,跟 Series 类似的,DataFrame 数据的索引也能够自定义,如下所示。
实例
df2 = pd.DataFrame(dict,columns=['UserName','Salary','Age', 'Sex'],index=['a','b','c'])
print(df2)
# 输出结果如下:
UserName Salary Age Sex
a Braund 5000 22 male
b Allen 6000 35 male
c Bonnell 10000 58 female
定义 DataFrame 的方法,除了上面的之外,还可以使用“字典套字典”的方式,第一层的键为 columns,第二层的键为 index ,第二层的值就是 DataFrame 的值了,字典中定义好了每个数据格子中的数据,没有定义的都是空 (NaN)。以下实例演示三种不同的情况。
实例
# 两个数据 index 数量不同
dictIndict1 = {'lang':{'first':'python','second':'java'},'price':{'first':5000,'second':2000,'third':3000}}
print(pd.DataFrame(dictIndict1))
# 两个数据 index 名称与数量不同
dictIndict2 = {'lang':{'first':'python','second':'java'},'price':{'1st':5000,'2nd':2000,'3rd':3000}}
print(pd.DataFrame(dictIndict2))
# 有三层 dict 结构
dictIndict3 = {'lang':{'first':'python','second':'java'},'price':{'1st':5000,'2nd':2000,'3rd':{'x':3000}}}
print(pd.DataFrame(dictIndict3))
# 输出结果如下:
lang price
first python 5000
second java 2000
third NaN 3000
lang price
first python NaN
second java NaN
1st NaN 5000.0
2nd NaN 2000.0
3rd NaN 3000.0
lang price
first python NaN
second java NaN
1st NaN 5000
2nd NaN 2000
3rd NaN {'x': 3000}
DataFrame 最大的好处可以以轻松的操作数据,以下范例来新增一竖列,并透过 Series 来赋予新值
实例
df3 = pd.DataFrame(dict,columns=['UserName','Salary','Age', 'Sex', 'Marriage'])
# 对所有行直接指定同一个值
df3['Marriage'] = '已婚'
print(df3)
MarriageStatus = pd.Series(['已婚','未婚','未婚'])
df3['Marriage'] = MarriageStatus
print(df3)
# 输出结果如下:
UserName Salary Age Sex Marriage
0 Braund 5000 22 male 已婚
1 Allen 6000 35 male 已婚
2 Bonnell 10000 58 female 已婚
UserName Salary Age Sex Marriage
0 Braund 5000 22 male 已婚
1 Allen 6000 35 male 未婚
2 Bonnell 10000 58 female 未婚
References
- NumPy 教程,https://www.runoob.com/numpy/numpy-tutorial.html
- Python 3 教程,https://www.runoob.com/python3/python3-tutorial.html
- pandas documentation,https://pandas.pydata.org/pandas-docs/stable/index.html
- Installing pandas, https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html
- Python Package Index, https://pypi.org/
- Python下opencv库的安装过程与一些问题汇总,https://www.cnblogs.com/BIXIABUMO/p/12440634.html
- Links for opencv-python, https://mirrors.tuna.tsinghua.edu.cn/pypi/web/simple/opencv-python/