导入Pandas
import pandas as pd # 导入pandas
1. 数据结构 Series
pandas.Series( data, index, dtype, name, copy)
- data:一组数据(ndarray 类型)。
- index:数据索引标签,如果不指定,默认从 0 开始。
- dtype:数据类型。
- name:设置名称。
- copy:拷贝数据,默认为 False。
# 创建一个Series
a = [1, 2, 3]
data = pd.Series(a)
print(data)
print(data[1]) # 索引
outputs:
0 1
1 2
2 3
dtype: int64
2
a = [1, 2, 3]
data = pd.Series(a, index = ["x", "y", "z"])
print(data)
print(data["x"]) # 索引
outputs:
x 1
y 2
z 3
dtype: int64
1
2. 数据类型
在以上Series中,dtype为int64
在pandas中,数据类型以及对应的场景如下:
- object: 文本
- int64: 整数
- float64: 浮点数
- bool: 布尔
- datetime64: 日期
- timedelta[ns]: 日期间隔
- category: 分类
# 创建不同类型的Series
a = ["a", "b", "c"]
data = pd.Series(a)
print(data.dtype)
a = ['2022-01-01', '2023-01-01', '2022-09-08']
data = pd.Series(a)
data = pd.to_datetime(data)
print(data.dtype)
a = [1.2, 2.4, 5.6]
data = pd.Series(a)
print(data.dtype)
outputs:
object
datetime64[ns]
float64
3. 数据结构DataFrame
pandas.DataFrame( data, index, columns, dtype, copy)
DataFrame类似一个二维数组
- data:一组数据(ndarray、series, map, lists, dict 等类型)。
- index:索引值,或者可以称为行标签。
- columns:列标签,默认为 RangeIndex (0, 1, 2, …, n) 。
- dtype:数据类型。
- copy:拷贝数据,默认为 False。
3.1 使用list创建DataFrame
data = [['a',10],['b',12],['c',13]]
data = pd.DataFrame(data, columns=['character','num'])
print(data)
print(data.dtypes)
outputs:
character num
0 a 10
1 b 12
2 c 13
character object
num int64
dtype: object
3.2 使用dict创建DataFrame
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
data = pd.DataFrame(data)
print(data)
print(data.dtypes)
outputs:
a b c
0 1 2 NaN
1 5 10 20.0
a int64
b int64
c float64
dtype: object
其中缺失的值填充为NaN
3.3从文件中读取DataFrame
data = pd.read_csv("data/Online_Retail_Fake.csv") # 读取数据
data.head() # 获取前5行
outputs:
InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
---|---|---|---|---|---|---|---|---|
0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 2010/12/1 8:26 | 2.55 | 17850.0 | United Kingdom |
1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 2010/12/1 8:26 | NaN | 17850.0 | United Kingdom |
2 | 536365 | 84406B | NaN | 8 | 2010/12/1 8:26 | 2.75 | 17850.0 | United Kingdom |
3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 2010/12/1 8:26 | 3.39 | 17850.0 | United Kingdom |
4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 2010/12/1 8:26 | 3.39 | 17850.0 | United Kingdom |
出现的问题及解决办法
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 22: invalid start byte
需要设置文档编码方式 encoding='gbk'
ImportError: Missing optional dependency 'openpyxl'. Use pip or conda to install openpyxl.
按要求安装所需的包 pip install openpyxl
,pip install xlrd
- 报错
UserWarning: Unknown extension is not supported and will be removed warn(msg)
目前没有影响,可以忽略不计
type(data) # 获取type
# 可以看到是 DataFrame类型的数据
outputs:
pandas.core.frame.DataFrame
print(data.index) # 获取行标签
outputs:
RangeIndex(start=0, stop=541910, step=1)
print(data.columns) # 获取属性
outputs:
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
'UnitPrice', 'CustomerID', 'Country'],
dtype='object')
print(data.values) # 获取数据
outputs:
[['536365' '85123A' 'WHITE HANGING HEART T-LIGHT HOLDER' ... 2.55 17850.0
'United Kingdom']
['536365' '71053' 'WHITE METAL LANTERN' ... nan 17850.0 'United Kingdom']
['536365' '84406B' nan ... 2.75 17850.0 'United Kingdom']
...
['581587' '23255' 'CHILDRENS CUTLERY CIRCUS PARADE' ... 4.15 12680.0
'France']
['581587' '22138' 'BAKING SET 9 PIECE RETROSPOT ' ... 4.95 12680.0
'France']
['581587' '22138' 'Wrong booking' ... 4.95 12680.0 'France']]
print(data.dtypes) # 获取数据类型
outputs:
InvoiceNo object
StockCode object
Description object
Quantity int64
InvoiceDate object
UnitPrice float64
CustomerID float64
Country object
dtype: object
4.索引、切片
print(data[["StockCode", "Quantity"]]) # 获取属性列
outputs:
StockCode Quantity
0 85123A 6
1 71053 6
2 84406B 8
3 84029G 6
4 84029E 6
... ... ...
541905 22899 6
541906 23254 4
541907 23255 4
541908 22138 3
541909 22138 3
[541910 rows x 2 columns]
print(data.loc[[0, 1]]) # 获取行
outputs:
InvoiceNo StockCode Description Quantity \
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6
1 536365 71053 WHITE METAL LANTERN 6
InvoiceDate UnitPrice CustomerID Country
0 2010/12/1 8:26 2.55 17850.0 United Kingdom
1 2010/12/1 8:26 NaN 17850.0 United Kingdom
print(data[["StockCode", "Quantity"]].loc[[1, 2]]) # 获取局部
outputs:
StockCode Quantity
1 71053 6
2 84406B 8
5.新增、删除、修改
5.1 缺失值填充
data = [{"a": 1, "b": 2}, {"a": 3, "b": 4, "c": 5}]
data = pd.DataFrame(data)
print(data)
print("------------------")
data = data.fillna(value=10)
print(data)
outputs:
a b c
0 1 2 NaN
1 3 4 5.0
------------------
a b c
0 1 2 10.0
1 3 4 5.0
5.2 添加一行
s = pd.DataFrame([[4, 5, 6]], columns=data.columns)
print(s)
print("------------------")
data = data.append(s)
print(data)
outputs:
a b c
0 4 5 6
------------------
a b c
0 1 2 10.0
1 3 4 5.0
0 4 5 6.0
出现的问题
FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
在接下来的版本中会去掉这种使用方式,目前还可以用。建议使用pandas.concat
函数
data = pd.concat([data, s], axis=0)
print(data)
# axis=0表示增加行,axis=1表示增加列
outputs:
a b c
0 1 2 10.0
1 3 4 5.0
0 4 5 6.0
0 4 5 6.0
data.loc[0]
outputs:
a | b | c | |
---|---|---|---|
0 | 1 | 2 | 10.0 |
0 | 4 | 5 | 6.0 |
0 | 4 | 5 | 6.0 |
s的行标签为0,两次concat后,有三行的行标签均为0,将其重置易于后续的使用
data = data.reset_index(drop=True)
print(data)
# 其中drop=True不保存原来的行标签
outputs:
a b c
0 1 2 10.0
1 3 4 5.0
2 4 5 6.0
3 4 5 6.0
5.3 添加一列
s = pd.Series(["a", "b", "c"])
data = pd.concat([data, s], axis=1)
print(data)
# axis=1,增加列
outputs:
a b c 0
0 1 2 10.0 a
1 3 4 5.0 b
2 4 5 6.0 c
3 4 5 6.0 NaN
增加的列属性为0,将其修改
data.rename(columns={0:'d'},inplace=True)
print(data)
outputs:
a b c d
0 1 2 10.0 a
1 3 4 5.0 b
2 4 5 6.0 c
3 4 5 6.0 NaN
s = pd.DataFrame([1.3, 5.4, 7.6, 9.8], columns=["e"])
print(s)
print("------------------------------")
data = pd.concat([data, s], axis=1)
print(data)
outputs:
e
0 1.3
1 5.4
2 7.6
3 9.8
------------------------------
a b c d e
0 1 2 10.0 a 1.3
1 3 4 5.0 b 5.4
2 4 5 6.0 c 7.6
3 4 5 6.0 NaN 9.8
5.4 修改内容
data.loc[1, "a"] = 10
print(data)
# 把1行a列的值改为10
outputs:
a b c d e
0 1 2 10.0 a 1.3
1 10 4 5.0 b 5.4
2 4 5 6.0 c 7.6
3 4 5 6.0 NaN 9.8
data["a"][data["a"] == 4] = 9
print(data)
# 把a列中4改成9
outputs:
a b c d e
0 1 2 10.0 a 1.3
1 10 4 5.0 b 5.4
2 9 5 6.0 c 7.6
3 9 5 6.0 NaN 9.8
5.5 删除一行
data = data.drop(3, axis=0)
print(data)
outputs:
a b c d e
0 1 2 10.0 a 1.3
1 10 4 5.0 b 5.4
2 9 5 6.0 c 7.6
5.6 删除一列
data = data.drop("e", axis=1)
print(data)
outputs:
a b c d
0 1 2 10.0 a
1 10 4 5.0 b
2 9 5 6.0 c