一、pandas中的对象
1、Series对象
由两个相互关联的数组(values, index)组成,前者(又称主数组)存储数据,后者存储values内每个元素对应关联的标签。
import numpy as np
import pandas as pd
s1 = pd.Series([1, 3, 5, 7])
print(s1)
→0 1
1 3
2 5
3 7
dtype: int64
print(s1.values)
→[1 3 5 7]
print(s1.index)
→RangeIndex(start=0, stop=4, step=1)
通过NumPy数组导入Series对象:
arr1 = np.array([1, 3, 5, 7])
s2 = pd.Series(arr1, index=['a', 'b', 'c', 'd'])
s2_ = pd.Series(s2)
print(s2)
→a 1
b 3
c 5
d 7
dtype: int32
print(s2_)
→a 1
b 3
c 5
d 7
dtype: int32
若index数组的值在字典中有对应的键,则生成的Series中对应的元素是字典中对应的值(如果没有,其值为NaN空值)。
dict1 = {"a": 3, "b": 4, "c": 5}
s3 = pd.Series(dict1, index=["a", "b", "c", "d"])
print(s3)
→a 3.0
b 4.0
c 5.0
d NaN
dtype: float64
2、DataFrame对象
将Series的使用场景扩展到多维,由按一定顺序的多列数据(可不同类型)组成,有两个索引数组(index, columns)。
dict2 = {"a": [1, 2, 3, 4], "b": [5, 6, 7, 8], "c": [9, 10, 11, 12]}
df1 = pd.DataFrame(dict2)
print(df1)
→ a b c
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
df2 = pd.DataFrame(np.arange(16).reshape((4, 4)),
index=["one", "two", "three", "four"],
columns=["ball", "pen", "pencil", "paper"])
print(df2)
→ ball pen pencil paper
one 0 1 2 3
two 4 5 6 7
three 8 9 10 11
four 12 13 14 15
二、pandas的基本操作
1、导入与导出数据
(1)csv文件导入
函数原型read_csv(filepath, sep, names, encoding),参数分别为:导入csv文件的路径、分隔符、导入的列和指定列的顺序(默认按顺序导入所有列)和文件编码(一般为utf-8)。
(2)txt文件导入
read_table()的参数与read_csv()一样,但txt文件的分隔符不确定,所以参数设置需要更严格准确。
(3)Excel文件导入
read_excel()的参数只有三个:路径名、读取表格名和读取列名,一般只需要第一个。
示例如下,其中data.csv的内容如下:
data.txt的内容如下:
data.xlsx的内容如下:
df3 = pd.read_csv(r"D:\Pycharm professional\pythonProject\test_pandas_files\data.csv")
print(df3)
→ 0 1 2
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
df4 = pd.read_table(r"D:\Pycharm professional\pythonProject\test_pandas_files\data.txt", sep=' ', header=None)
print(df4)
→ 0 1
0 1 2
1 3 4
2 5 6
3 7 8
4 9 10
df5 = pd.read_excel(r"D:\Pycharm professional\pythonProject\test_pandas_files\data.xlsx")
print(df5)
→ 0 1 2 3
0 a b c d
1 e f g h
2 i j k l
(4)数据导出
函数原型为to_csv(filepath, sep, names, encoding),参数分别为:导出csv文件的路径、分隔符(默认为逗号)、是否输出索引(默认为True,即输出索引)和文件编码(一般为utf-8)。
df3.to_csv(r"D:\Pycharm professional\pythonProject\test_pandas_files\data1.csv", index=True, header=True)
df3.to_csv(r"D:\Pycharm professional\pythonProject\test_pandas_files\data2.csv", index=False, header=True)
data1.csv的内容如下:
data2.csv的内容如下:
2、数据的查看与检查
(1)Series对象
print(s1[2])
→5
print(s2['c'])
→5
print(s2[0:2])
→a 1
b 3
dtype: int32
print(s2[['a', 'b']])
→a 1
b 3
dtype: int32
(2)DataFrame对象
print(df2.columns)
→Index(['ball', 'pen', 'pencil', 'paper'], dtype='object')
print(type(df2.columns))
→<class 'pandas.core.indexes.base.Index'>
print(df2.index)
→Index(['one', 'two', 'three', 'four'], dtype='object')
print(type(df2.index))
→<class 'pandas.core.indexes.base.Index'>
print(df2.values)
→[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
print(type(df2.values))
→<class 'numpy.ndarray'>
print(df2["pencil"])
→one 2
two 6
three 10
four 14
Name: pencil, dtype: int32
print(df2.pen)
→one 1
two 5
three 9
four 13
Name: pen, dtype: int32
print(df2[0:2])
→ ball pen pencil paper
one 0 1 2 3
two 4 5 6 7
3、数据的增删查改
创建Series对象如下:
s4 = pd.Series([1, 3, 5, 7], index=['a', 'b', 'c', 'd'])
(1)增加
s4['e'] = 9
print(s4)
→a 1
b 3
c 5
d 7
e 9
dtype: int64
(2)删除
s4.pop('e')
print(s4)
→a 1
b 3
c 5
d 7
dtype: int64
print(s4.drop('c'))
→a 1
b 3
d 7
dtype: int64
print(s4)
→a 1
b 3
c 5
d 7
dtype: int64
(3)查找与修改
s4[2] = 6
s4['a'] = 0
print(s4)
→a 0
b 3
c 6
d 7
dtype: int64
print(s4[s4 > 4])
→c 6
d 7
dtype: int64
df2["pencil"][1] = 12
print(df2)
→ ball pen pencil paper
one 0 1 2 3
two 4 5 12 7
three 8 9 10 11
four 12 13 14 15
4、pandas的基本运用
(1)数据统计
创建DataFrame对象如下:
arr2 = np.array([1, 2, 3, 4, 5, 6, 7, 8]).reshape(4, 2)
df6 = pd.DataFrame(arr2, index=['a', 'b', 'c', 'd'], columns=['one', 'two'])
print(df6)
→ one two
a 1 2
b 3 4
c 5 6
d 7 8
① 求和
print(df6.sum())
→one 16
two 20
dtype: int64
print(df6.sum(axis=1))
→a 3
b 7
c 11
d 15
dtype: int64
② 累计求和
print(df6.cumsum())
→ one two
a 1 2
b 4 6
c 9 12
d 16 20
③ 返回最值行名称
print(df6.idxmax())
→one d
two d
dtype: object
print(df6.idxmin())
→one a
two a
dtype: object
④ 去重
unique()返回NumPy数组,value_counts()返回Series对象(index为不重复的元素,values为不重复元素的频数)。
s5 = pd.Series([1, 3, 5, 7, 2, 4, 3, 5, 7, 6, 7])
print(s5.unique())
→[1 3 5 7 2 4 6]
print(type(s5.unique()))
→<class 'numpy.ndarray'>
print(s5.value_counts())
→7 3
3 2
5 2
1 1
2 1
4 1
6 1
dtype: int64
print(type(s5.value_counts()))
→<class 'pandas.core.series.Series'>
⑤ 筛选数据
isin()判定Series对象中每个元素是否包含在给定的参数中。
print(s5.isin([2, 4]))
→0 False
1 False
2 False
3 False
4 True
5 True
6 False
7 False
8 False
9 False
10 False
dtype: bool
print(s5[s5.isin([2, 4])])
→4 2
5 4
dtype: int64
(2)算术运算
s6 = pd.Series([20, 40, 60, 80])
print(s6 / 2)
→0 10.0
1 20.0
2 30.0
3 40.0
dtype: float64
print(np.log(s6))
→0 2.995732
1 3.688879
2 4.094345
3 4.382027
dtype: float64
(3)数据对齐
数据清洗的重要过程,可按索引进行对齐运算,没对齐的位置填充NaN,数据末尾也可填充NaN。
s7 = pd.Series({"b": 4, "c": 5, "a": 3})
s8 = pd.Series({"a": 1, "b": 7, "c": 2, "d": 11})
print(s7 + s8)
→a 4.0
b 11.0
c 7.0
d NaN
dtype: float64