今天整理了一些pandas里,DataFrame的用法,有帮助的话请点赞关注哦~
【可与Python】DataFrame用法详解 超详细!!!
0. 预备操作
import pandas as pd
import numpy as np
1. 生成一维数组
x=pd.Series([1,2,3,np.nan])
timelist = pd.date_range(start='20200101',end='20201231',freq='W')
print(timelist)
DatetimeIndex(['2020-01-05', '2020-01-12', '2020-01-19', '2020-01-26',
'2020-02-02', '2020-02-09', '2020-02-16', '2020-02-23',
'2020-03-01', '2020-03-08', '2020-03-15', '2020-03-22',
'2020-03-29', '2020-04-05', '2020-04-12', '2020-04-19',
'2020-04-26', '2020-05-03', '2020-05-10', '2020-05-17',
'2020-05-24', '2020-05-31', '2020-06-07', '2020-06-14',
'2020-06-21', '2020-06-28', '2020-07-05', '2020-07-12',
'2020-07-19', '2020-07-26', '2020-08-02', '2020-08-09',
'2020-08-16', '2020-08-23', '2020-08-30', '2020-09-06',
'2020-09-13', '2020-09-20', '2020-09-27', '2020-10-04',
'2020-10-11', '2020-10-18', '2020-10-25', '2020-11-01',
'2020-11-08', '2020-11-15', '2020-11-22', '2020-11-29',
'2020-12-06', '2020-12-13', '2020-12-20', '2020-12-27'],
dtype='datetime64[ns]', freq='W-SUN')
2. 生成DataFrame
a1=pd.DataFrame(np.random.randn(12,4),index=list(range(5,17)),columns=list('ABCD'))
print(a1)
a2=pd.DataFrame([np.random.randint(1,100,4) for i in range(12)],columns=list('ABCD'))
print(a2)
a3= pd.DataFrame({'A':np.random.randint(1,100,4),'B':pd.date_range(start='20130101',periods=4,freq='D'),
'C': pd.Series([1,2,3,4],dtype='float64'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(['test','train']*2),
'F':'f00'})
print(a3)
pd.set_option('display.max_rows',5)
pd.set_option('display.max_columns',5)
a3= pd.DataFrame({'A':np.random.randint(1,100,4),'B':pd.date_range(start='20130101',periods=4,freq='D'),
'C': pd.Series([1,2,3,4],dtype='float64'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(['test','train']*2),
'F':'f00'})
print(a3)
A B C D
5 0.704611 0.431369 -0.828452 0.169185
6 0.780436 0.938247 -1.656147 -0.243374
.. ... ... ... ...
15 0.333885 1.823517 0.483210 -0.536295
16 -0.424291 0.305974 0.587208 -0.639165
[12 rows x 4 columns]
A B C D
0 13 45 74 32
1 60 43 45 50
.. .. .. .. ..
10 61 98 95 7
11 74 8 24 44
[12 rows x 4 columns]
A B ... E F
0 56 2013-01-01 ... test f00
1 86 2013-01-02 ... train f00
2 12 2013-01-03 ... test f00
3 45 2013-01-04 ... train f00
[4 rows x 6 columns]
A B ... E F
0 71 2013-01-01 ... test f00
1 74 2013-01-02 ... train f00
2 13 2013-01-03 ... test f00
3 92 2013-01-04 ... train f00
[4 rows x 6 columns]
3.二维数据查看
3.1 查看行
df= pd.DataFrame({'A':np.random.randint(1,100,4),'B':pd.date_range(start='20130101',periods=4,freq='D'),
'C': pd.Series([1,2,3,4],dtype='float64'),
'D':np.array([3]*4,dtype='int32'),
'E':pd.Categorical(['test','train']*2),
'F':'f00'})
df.head()
| A | B | ... | E | F |
---|
0 | 81 | 2013-01-01 | ... | test | f00 |
---|
1 | 6 | 2013-01-02 | ... | train | f00 |
---|
2 | 27 | 2013-01-03 | ... | test | f00 |
---|
3 | 97 | 2013-01-04 | ... | train | f00 |
---|
4 rows × 6 columns
df.head(3)
| A | B | ... | E | F |
---|
0 | 81 | 2013-01-01 | ... | test | f00 |
---|
1 | 6 | 2013-01-02 | ... | train | f00 |
---|
2 | 27 | 2013-01-03 | ... | test | f00 |
---|
3 rows × 6 columns
df.tail(2)
| A | B | ... | E | F |
---|
2 | 27 | 2013-01-03 | ... | test | f00 |
---|
3 | 97 | 2013-01-04 | ... | train | f00 |
---|
2 rows × 6 columns
3.2 查看二位数据的索引、列名和数据
df.index
RangeIndex(start=0, stop=4, step=1)
df.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
df.values
array([[81, Timestamp('2013-01-01 00:00:00'), 1.0, 3, 'test', 'f00'],
[6, Timestamp('2013-01-02 00:00:00'), 2.0, 3, 'train', 'f00'],
[27, Timestamp('2013-01-03 00:00:00'), 3.0, 3, 'test', 'f00'],
[97, Timestamp('2013-01-04 00:00:00'), 4.0, 3, 'train', 'f00']],
dtype=object)
3.3 查看数据的统计信息
df.describe()
| A | C | D |
---|
count | 4.00 | 4.00 | 4.0 |
---|
mean | 52.75 | 2.50 | 3.0 |
---|
... | ... | ... | ... |
---|
75% | 85.00 | 3.25 | 3.0 |
---|
max | 97.00 | 4.00 | 3.0 |
---|
8 rows × 3 columns
4. 二维数组操作
4.1二维数组转置
df.T
| 0 | 1 | 2 | 3 |
---|
A | 81 | 6 | 27 | 97 |
---|
B | 2013-01-01 00:00:00 | 2013-01-02 00:00:00 | 2013-01-03 00:00:00 | 2013-01-04 00:00:00 |
---|
... | ... | ... | ... | ... |
---|
E | test | train | test | train |
---|
F | f00 | f00 | f00 | f00 |
---|
6 rows × 4 columns
4.2 二维数组排序
df.sort_index(axis=0,ascending=False)
| A | B | ... | E | F |
---|
3 | 97 | 2013-01-04 | ... | train | f00 |
---|
2 | 27 | 2013-01-03 | ... | test | f00 |
---|
1 | 6 | 2013-01-02 | ... | train | f00 |
---|
0 | 81 | 2013-01-01 | ... | test | f00 |
---|
4 rows × 6 columns
df.sort_index(axis=1,ascending=False)
| F | E | ... | B | A |
---|
0 | f00 | test | ... | 2013-01-01 | 81 |
---|
1 | f00 | train | ... | 2013-01-02 | 6 |
---|
2 | f00 | test | ... | 2013-01-03 | 27 |
---|
3 | f00 | train | ... | 2013-01-04 | 97 |
---|
4 rows × 6 columns
df.sort_values(by='A')
| A | B | ... | E | F |
---|
1 | 6 | 2013-01-02 | ... | train | f00 |
---|
2 | 27 | 2013-01-03 | ... | test | f00 |
---|
0 | 81 | 2013-01-01 | ... | test | f00 |
---|
3 | 97 | 2013-01-04 | ... | train | f00 |
---|
4 rows × 6 columns
df.sort_values(by=['A','B'],ascending=['True','False'])
| A | B | ... | E | F |
---|
1 | 6 | 2013-01-02 | ... | train | f00 |
---|
2 | 27 | 2013-01-03 | ... | test | f00 |
---|
0 | 81 | 2013-01-01 | ... | test | f00 |
---|
3 | 97 | 2013-01-04 | ... | train | f00 |
---|
4 rows × 6 columns
4.3 数据选择
4.3.1选择列
df.A
0 81
1 6
2 27
3 97
Name: A, dtype: int32
df['A']
0 81
1 6
2 27
3 97
Name: A, dtype: int32
6 in df['A']
False
6 in df['A'].values
True
df.loc[:,['A','C']]
| A | C |
---|
0 | 81 | 1.0 |
---|
1 | 6 | 2.0 |
---|
2 | 27 | 3.0 |
---|
3 | 97 | 4.0 |
---|
4.3.2 选择行
df[0:2]
| A | B | ... | E | F |
---|
0 | 81 | 2013-01-01 | ... | test | f00 |
---|
1 | 6 | 2013-01-02 | ... | train | f00 |
---|
2 rows × 6 columns
df.iloc[3]
A 97
B 2013-01-04 00:00:00
...
E train
F f00
Name: 3, Length: 6, dtype: object
4.3.3 选择行和列
df.loc[[0,2],['A','B']]
| A | B |
---|
0 | 81 | 2013-01-01 |
---|
2 | 27 | 2013-01-03 |
---|
df.iloc[0:3,0:4]
| A | B | C | D |
---|
0 | 81 | 2013-01-01 | 1.0 | 3 |
---|
1 | 6 | 2013-01-02 | 2.0 | 3 |
---|
2 | 27 | 2013-01-03 | 3.0 | 3 |
---|
df.iloc[[1,3],[2,4]]
4.3.4查询值
df.at[0,'A']
81
df.iloc[0,0]
81
4.3.5 按给定条件查询
4.3.5.1 简单查询
df[df.A>50]
| A | B | ... | E | F |
---|
0 | 81 | 2013-01-01 | ... | test | f00 |
---|
3 | 97 | 2013-01-04 | ... | train | f00 |
---|
2 rows × 6 columns
df[df['E']=='test']
| A | B | ... | E | F |
---|
0 | 81 | 2013-01-01 | ... | test | f00 |
---|
2 | 27 | 2013-01-03 | ... | test | f00 |
---|
2 rows × 6 columns
df[df['A'].isin([20,81])]
| A | B | ... | E | F |
---|
0 | 81 | 2013-01-01 | ... | test | f00 |
---|
1 rows × 6 columns
pd.set_option('display.max_columns',10)
df.nlargest(3,['C'])
| A | B | C | D | E | F |
---|
3 | 97 | 2013-01-04 | 4.0 | 3 | train | f00 |
---|
2 | 27 | 2013-01-03 | 3.0 | 3 | test | f00 |
---|
1 | 6 | 2013-01-02 | 2.0 | 3 | train | f00 |
---|
4.3.5.2 按求和值查询
dff = pd.DataFrame({'A':[1,2,3,4],'B':[10,20,8,40]})
dff
dff[dff.sum(axis=1)==11]
5. 数据修改
5.1 数据原地简单修改
df
| A | B | C | D | E | F |
---|
0 | 81 | 2013-01-01 | 3.0 | 50 | test | f00 |
---|
1 | 6 | 2013-01-02 | 2.0 | 56 | train | f00 |
---|
2 | 27 | 2013-01-03 | 3.0 | 58 | test | f00 |
---|
3 | 97 | 2013-01-04 | 4.0 | 50 | train | f00 |
---|
df.iat[0,2]=3
df.loc[:,'D']=np.random.randint(50,60,4)
df['C']=-df['C']
df
| A | B | C | D | E | F |
---|
0 | 81 | 2013-01-01 | -3.0 | 58 | test | f00 |
---|
1 | 6 | 2013-01-02 | -2.0 | 58 | train | f00 |
---|
2 | 27 | 2013-01-03 | -3.0 | 55 | test | f00 |
---|
3 | 97 | 2013-01-04 | -4.0 | 54 | train | f00 |
---|
from copy import deepcopy
dff = deepcopy(df)
dff
| A | B | C | D | E | F |
---|
0 | 81 | 2013-01-01 | -3.0 | 58 | test | f00 |
---|
1 | 6 | 2013-01-02 | -2.0 | 58 | train | f00 |
---|
2 | 27 | 2013-01-03 | -3.0 | 55 | test | f00 |
---|
3 | 97 | 2013-01-04 | -4.0 | 54 | train | f00 |
---|
dff["C"]=dff['C']**2
dff
| A | B | C | D | E | F |
---|
0 | 81 | 2013-01-01 | 9.0 | 58 | test | f00 |
---|
1 | 6 | 2013-01-02 | 4.0 | 58 | train | f00 |
---|
2 | 27 | 2013-01-03 | 9.0 | 55 | test | f00 |
---|
3 | 97 | 2013-01-04 | 16.0 | 54 | train | f00 |
---|
dff.loc[dff.C==9,'D']=100
dff
| A | B | C | D | E | F |
---|
0 | 81 | 2013-01-01 | 9.0 | 100 | test | f00 |
---|
1 | 6 | 2013-01-02 | 4.0 | 58 | train | f00 |
---|
2 | 27 | 2013-01-03 | 9.0 | 100 | test | f00 |
---|
3 | 97 | 2013-01-04 | 16.0 | 54 | train | f00 |
---|
5.2 数据替换
data = pd.DataFrame({"k1":['one']*3+['two']*4,'k2':[1,1,2,3,3,4,4]})
pd.set_option('display.max_rows',10)
data.replace(1,5) f
| k1 | k2 |
---|
0 | one | 5 |
---|
1 | one | 5 |
---|
2 | one | 2 |
---|
3 | two | 3 |
---|
4 | two | 3 |
---|
5 | two | 4 |
---|
6 | two | 4 |
---|
data.replace([1,2],[5,6])
| k1 | k2 |
---|
0 | one | 5 |
---|
1 | one | 5 |
---|
2 | one | 6 |
---|
3 | two | 3 |
---|
4 | two | 3 |
---|
5 | two | 4 |
---|
6 | two | 4 |
---|
data.replace({1:5,'one':'ONE'})
6.数据删除
data = pd.DataFrame({"k1":['one']*3+['two']*4,'k2':[1,1,2,3,3,4,4]})
data
| k1 | k2 |
---|
0 | one | 1 |
---|
1 | one | 1 |
---|
2 | one | 2 |
---|
3 | two | 3 |
---|
4 | two | 3 |
---|
5 | two | 4 |
---|
6 | two | 4 |
---|
data.drop(5,axis=0)
| k1 | k2 |
---|
0 | one | 1 |
---|
1 | one | 1 |
---|
2 | one | 2 |
---|
3 | two | 3 |
---|
4 | two | 3 |
---|
6 | two | 4 |
---|
data.drop(3,inplace=True)
data
| k1 | k2 |
---|
0 | one | 1 |
---|
1 | one | 1 |
---|
2 | one | 2 |
---|
4 | two | 3 |
---|
5 | two | 4 |
---|
6 | two | 4 |
---|
data.drop('k1',axis=1)
7.数据增添
data = pd.DataFrame({'姓名':['张三','李四','王五','赵六','刘七','孙八'],
'age':np.random.randint(20,50,6),
'成绩':[86,92,86,60,78,78]})
7.1增加位次号
data['rank'] = data['age'].rank()
data
| 姓名 | age | 成绩 | rank |
---|
0 | 张三 | 27 | 86 | 1.0 |
---|
1 | 李四 | 42 | 92 | 3.0 |
---|
2 | 王五 | 36 | 86 | 2.0 |
---|
3 | 赵六 | 49 | 60 | 6.0 |
---|
4 | 刘七 | 48 | 78 | 5.0 |
---|
5 | 孙八 | 46 | 78 | 4.0 |
---|
data['倒数排名'] = data['成绩'].rank(method='min')
data
| 姓名 | age | 成绩 | rank | 倒数排名 |
---|
0 | 张三 | 27 | 86 | 1.0 | 4.0 |
---|
1 | 李四 | 42 | 92 | 3.0 | 6.0 |
---|
2 | 王五 | 36 | 86 | 2.0 | 4.0 |
---|
3 | 赵六 | 49 | 60 | 6.0 | 1.0 |
---|
4 | 刘七 | 48 | 78 | 5.0 | 2.0 |
---|
5 | 孙八 | 46 | 78 | 4.0 | 2.0 |
---|
data['正数排名'] = data['成绩'].rank(method='min',ascending=False)
data
| 姓名 | age | 成绩 | rank | 倒数排名 | 正数排名 |
---|
0 | 张三 | 27 | 86 | 1.0 | 4.0 | 2.0 |
---|
1 | 李四 | 42 | 92 | 3.0 | 6.0 | 1.0 |
---|
2 | 王五 | 36 | 86 | 2.0 | 4.0 | 2.0 |
---|
3 | 赵六 | 49 | 60 | 6.0 | 1.0 | 6.0 |
---|
4 | 刘七 | 48 | 78 | 5.0 | 2.0 | 4.0 |
---|
5 | 孙八 | 46 | 78 | 4.0 | 2.0 | 4.0 |
---|
data['正数排名2'] = data['成绩'].rank(method='max',ascending=False)
data
| 姓名 | age | 成绩 | rank | 倒数排名 | 正数排名 | 正数排名2 |
---|
0 | 张三 | 27 | 86 | 1.0 | 4.0 | 2.0 | 3.0 |
---|
1 | 李四 | 42 | 92 | 3.0 | 6.0 | 1.0 | 1.0 |
---|
2 | 王五 | 36 | 86 | 2.0 | 4.0 | 2.0 | 3.0 |
---|
3 | 赵六 | 49 | 60 | 6.0 | 1.0 | 6.0 | 6.0 |
---|
4 | 刘七 | 48 | 78 | 5.0 | 2.0 | 4.0 | 5.0 |
---|
5 | 孙八 | 46 | 78 | 4.0 | 2.0 | 4.0 | 5.0 |
---|
data['排名3'] = data['成绩'].rank(method='average')
data
| 姓名 | age | 成绩 | rank | 倒数排名 | 正数排名 | 正数排名2 | 排名3 |
---|
0 | 张三 | 27 | 86 | 1.0 | 4.0 | 2.0 | 3.0 | 4.5 |
---|
1 | 李四 | 42 | 92 | 3.0 | 6.0 | 1.0 | 1.0 | 6.0 |
---|
2 | 王五 | 36 | 86 | 2.0 | 4.0 | 2.0 | 3.0 | 4.5 |
---|
3 | 赵六 | 49 | 60 | 6.0 | 1.0 | 6.0 | 6.0 | 1.0 |
---|
4 | 刘七 | 48 | 78 | 5.0 | 2.0 | 4.0 | 5.0 | 2.5 |
---|
5 | 孙八 | 46 | 78 | 4.0 | 2.0 | 4.0 | 5.0 | 2.5 |
---|
7.2 行、列求和
dff = pd.DataFrame({'A':[1,2,3,4], 'B':[10,20,8,40]})
dff
dff['col_Sum']=dff.apply(sum,axis=1)
dff
| A | B | col_Sum |
---|
0 | 1 | 10 | 11 |
---|
1 | 2 | 20 | 22 |
---|
2 | 3 | 8 | 11 |
---|
3 | 4 | 40 | 44 |
---|
DataFrame.apply(self, func, axis=0, raw=False, result_type=None, args=(), **kwds)API
用于 DataFrame 和 Series 对象。主要用于数据聚合运算,可以很方便的对分组进行现有的运算和自定义的运算。
dff
| A | B | col_Sum |
---|
0 | 1 | 10 | 11 |
---|
1 | 2 | 20 | 22 |
---|
2 | 3 | 8 | 11 |
---|
3 | 4 | 40 | 44 |
---|
dff.loc['row_Sum'] = dff.apply(sum,axis=0)
dff
| A | B | col_Sum |
---|
0 | 1 | 10 | 11 |
---|
1 | 2 | 20 | 22 |
---|
2 | 3 | 8 | 11 |
---|
3 | 4 | 40 | 44 |
---|
row_Sum | 20 | 156 | 176 |
---|
8. 重排和增加列名
da=pd.DataFrame({'A3':[2,3,4,5],'A1':[1,2,5,6]})
da
da.reindex(columns=['A1',"A2","A3",'A4'])
| A1 | A2 | A3 | A4 |
---|
0 | 1 | NaN | 2 | NaN |
---|
1 | 2 | NaN | 3 | NaN |
---|
2 | 5 | NaN | 4 | NaN |
---|
3 | 6 | NaN | 5 | NaN |
---|
pd.concat([da, pd.DataFrame(columns=list('DE'))])
| A3 | A1 | D | E |
---|
0 | 2.0 | 1.0 | NaN | NaN |
---|
1 | 3.0 | 2.0 | NaN | NaN |
---|
2 | 4.0 | 5.0 | NaN | NaN |
---|
3 | 5.0 | 6.0 | NaN | NaN |
---|
9.缺失值处理
df.index=['zhang','li','zhou','wang']
df
| A | B | C | D | E | F |
---|
zhang | 81 | 2013-01-01 | -3.0 | 58 | test | f00 |
---|
li | 6 | 2013-01-02 | -2.0 | 58 | train | f00 |
---|
zhou | 27 | 2013-01-03 | -3.0 | 55 | test | f00 |
---|
wang | 97 | 2013-01-04 | -4.0 | 54 | train | f00 |
---|
df1 = df.reindex(columns=list(df.columns)+['G'])
df1
| A | B | C | D | E | F | G |
---|
zhang | 81 | 2013-01-01 | -3.0 | 58 | test | f00 | NaN |
---|
li | 6 | 2013-01-02 | -2.0 | 58 | train | f00 | NaN |
---|
zhou | 27 | 2013-01-03 | -3.0 | 55 | test | f00 | NaN |
---|
wang | 97 | 2013-01-04 | -4.0 | 54 | train | f00 | NaN |
---|
df1.iat[0,6] = 3
9.1 测试缺失值
pd.isnull(df1)
| A | B | C | D | E | F | G |
---|
zhang | False | False | False | False | False | False | False |
---|
li | False | False | False | False | False | False | True |
---|
zhou | False | False | False | False | False | False | True |
---|
wang | False | False | False | False | False | False | True |
---|
df1.dropna()
| A | B | C | D | E | F | G |
---|
zhang | 81 | 2013-01-01 | -3.0 | 58 | test | f00 | 3.0 |
---|
li | 6 | 2013-01-02 | -2.0 | 58 | train | f00 | 5.0 |
---|
zhou | 27 | 2013-01-03 | -3.0 | 55 | test | f00 | 5.0 |
---|
wang | 97 | 2013-01-04 | -4.0 | 54 | train | f00 | 5.0 |
---|
9.2 指定值填充
df2 = deepcopy(df1)
df1['G'].fillna(5,inplace=True)
df1
| A | B | C | D | E | F | G |
---|
zhang | 81 | 2013-01-01 | -3.0 | 58 | test | f00 | 3.0 |
---|
li | 6 | 2013-01-02 | -2.0 | 58 | train | f00 | 5.0 |
---|
zhou | 27 | 2013-01-03 | -3.0 | 55 | test | f00 | 5.0 |
---|
wang | 97 | 2013-01-04 | -4.0 | 54 | train | f00 | 5.0 |
---|
df2.iat[2,5] = np.NAN
df2
| A | B | C | D | E | F | G |
---|
zhang | 81 | 2013-01-01 | -3.0 | 58 | test | f00 | 3.0 |
---|
li | 6 | 2013-01-02 | -2.0 | 58 | train | f00 | NaN |
---|
zhou | 27 | 2013-01-03 | -3.0 | 55 | test | NaN | NaN |
---|
wang | 97 | 2013-01-04 | -4.0 | 54 | train | f00 | NaN |
---|
df2.dropna(thresh=6)
| A | B | C | D | E | F | G |
---|
zhang | 81 | 2013-01-01 | -3.0 | 58 | test | f00 | 3.0 |
---|
li | 6 | 2013-01-02 | -2.0 | 58 | train | f00 | NaN |
---|
wang | 97 | 2013-01-04 | -4.0 | 54 | train | f00 | NaN |
---|
df2.iat[3, 6] = 8
df2
| A | B | C | D | E | F | G |
---|
zhang | 81 | 2013-01-01 | -3.0 | 58 | test | f00 | 3.0 |
---|
li | 6 | 2013-01-02 | -2.0 | 58 | train | f00 | NaN |
---|
zhou | 27 | 2013-01-03 | -3.0 | 55 | test | NaN | NaN |
---|
wang | 97 | 2013-01-04 | -4.0 | 54 | train | f00 | 8.0 |
---|
9.3 平均值填充
df2.fillna({'F':'foo', 'G':df2['G'].mean()})
| A | B | C | D | E | F | G |
---|
zhang | 81 | 2013-01-01 | -3.0 | 58 | test | f00 | 3.0 |
---|
li | 6 | 2013-01-02 | -2.0 | 58 | train | f00 | 5.5 |
---|
zhou | 27 | 2013-01-03 | -3.0 | 55 | test | foo | 5.5 |
---|
wang | 97 | 2013-01-04 | -4.0 | 54 | train | f00 | 8.0 |
---|
9.4 前后值填充
dft = pd.DataFrame({'a':[1,np.NaN, np.NaN,3]})
dft
dft.fillna(method='pad')
dft.fillna(method='bfill')
dft.fillna(method='bfill',limit = 1)
10. 重复值处理
data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,
'k2':[1, 1, 2, 3, 3, 4, 4]})
data
| k1 | k2 |
---|
0 | one | 1 |
---|
1 | one | 1 |
---|
2 | one | 2 |
---|
3 | two | 3 |
---|
4 | two | 3 |
---|
5 | two | 4 |
---|
6 | two | 4 |
---|
data.duplicated()
0 False
1 True
2 False
3 False
4 True
5 False
6 True
dtype: bool
data.drop_duplicates()
data.drop_duplicates(['k1'])
data.drop_duplicates(['k1'], keep='last')
data = pd.Series([3,3,3,2,1,1,1,0])
data.drop_duplicates(keep=False)
3 2
7 0
dtype: int64
11.异常值处理
data = pd.DataFrame(np.random.randn(500, 4))
data.describe()
| 0 | 1 | 2 | 3 |
---|
count | 500.000000 | 500.000000 | 500.000000 | 500.000000 |
---|
mean | 0.000003 | -0.055441 | -0.144464 | 0.115541 |
---|
std | 1.040113 | 1.050816 | 1.024364 | 0.993715 |
---|
min | -2.682324 | -3.093345 | -3.251424 | -2.436486 |
---|
25% | -0.714944 | -0.764660 | -0.833083 | -0.571401 |
---|
50% | 0.000479 | -0.094025 | -0.174163 | 0.085931 |
---|
75% | 0.766728 | 0.633535 | 0.580219 | 0.798915 |
---|
max | 2.868074 | 3.444375 | 3.026216 | 3.491998 |
---|
col2 = data[2]
col2[col2>2.5]
454 3.026216
Name: 2, dtype: float64
data[(data>3).any(1)]
| 0 | 1 | 2 | 3 |
---|
9 | 0.215147 | 0.142431 | 0.975679 | 3.491998 |
---|
318 | -1.018064 | 3.367820 | -0.875176 | -0.007749 |
---|
350 | -0.807584 | 3.444375 | 1.055223 | 0.759119 |
---|
454 | -0.260401 | 0.478256 | 3.026216 | -1.251252 |
---|
456 | 0.951752 | -0.646577 | -1.006280 | 3.082105 |
---|
data[np.abs(data)>2.5] = np.sign(data) * 2.5
data.describe()
| 0 | 1 | 2 | 3 |
---|
count | 500.000000 | 500.000000 | 500.000000 | 500.000000 |
---|
mean | -0.000456 | -0.056737 | -0.143212 | 0.111439 |
---|
std | 1.034136 | 1.032042 | 1.015410 | 0.982327 |
---|
min | -2.500000 | -2.500000 | -2.500000 | -2.436486 |
---|
25% | -0.714944 | -0.764660 | -0.833083 | -0.571401 |
---|
50% | 0.000479 | -0.094025 | -0.174163 | 0.085931 |
---|
75% | 0.766728 | 0.633535 | 0.580219 | 0.798915 |
---|
max | 2.500000 | 2.500000 | 2.500000 | 2.500000 |
---|
sign()函数功能介绍
sign()是Python的Numpy中的取数字符号(数字前的正负号)的函数。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-rhoCDq7f-1633943733731)(sign函数.png)]
12.映射
data = pd.DataFrame({'k1':['one'] * 3 + ['two'] * 4,
'k2':[1, 1, 2, 3, 3, 4, 4]})
data
| k1 | k2 |
---|
0 | one | 1 |
---|
1 | one | 1 |
---|
2 | one | 2 |
---|
3 | two | 3 |
---|
4 | two | 3 |
---|
5 | two | 4 |
---|
6 | two | 4 |
---|
12.1映射改值
data['k1'] = data['k1'].map(str.upper)
data
| k1 | k2 |
---|
0 | ONE | 6 |
---|
1 | ONE | 6 |
---|
2 | ONE | 7 |
---|
3 | TWO | 8 |
---|
4 | TWO | 8 |
---|
5 | TWO | 9 |
---|
6 | TWO | 9 |
---|
map() 是一个Series的函数,DataFrame结构中没有map()。map()将一个自定义函数应用于Series结构中的每个元素(elements)。
data['k1'] = data['k1'].map({'ONE':'one','TWO':'two'})
data
| k1 | k2 |
---|
0 | one | 1 |
---|
1 | one | 1 |
---|
2 | one | 2 |
---|
3 | two | 3 |
---|
4 | two | 3 |
---|
5 | two | 4 |
---|
6 | two | 4 |
---|
data['k2'] = data['k2'].map(lambda x: x+5)
data
| k1 | k2 |
---|
0 | one | 6 |
---|
1 | one | 6 |
---|
2 | one | 7 |
---|
3 | two | 8 |
---|
4 | two | 8 |
---|
5 | two | 9 |
---|
6 | two | 9 |
---|
12.2映射修改索引、列名
data.index = data.index.map(lambda x:x+5)
data
| k1 | k2 |
---|
5 | one | 6 |
---|
6 | one | 6 |
---|
7 | one | 7 |
---|
8 | two | 8 |
---|
9 | two | 8 |
---|
10 | two | 9 |
---|
11 | two | 9 |
---|
data.columns = data.columns.map(str.upper)
data
| K1 | K2 |
---|
5 | one | 6 |
---|
6 | one | 6 |
---|
7 | one | 7 |
---|
8 | two | 8 |
---|
9 | two | 8 |
---|
10 | two | 9 |
---|
11 | two | 9 |
---|
data.rename(index=lambda x:x+5 , columns= str.lower ,inplace=True)
data
| k1 | k2 |
---|
10 | one | 6 |
---|
11 | one | 6 |
---|
12 | one | 7 |
---|
13 | two | 8 |
---|
14 | two | 8 |
---|
15 | two | 9 |
---|
16 | two | 9 |
---|
Pandas rename()方法用于重命名任何索引,列或行。列的重命名也可以通过dataframe.columns = [#list]。但在上述情况下,自由度不高。即使必须更改一列,也必须传递完整的列列表。另外,上述方法不适用于索引标签。
用法: DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None)
12.3 映射计算离差
dff = pd.DataFrame({'A':[1,2,3,4], 'B':[10,20,8,40]})
dff
dff.apply(lambda x:x-x.mean(),axis=1)
| A | B |
---|
0 | -4.5 | 4.5 |
---|
1 | -9.0 | 9.0 |
---|
2 | -2.5 | 2.5 |
---|
3 | -18.0 | 18.0 |
---|
dff.apply(lambda x:x-x.mean(),axis=0)
| A | B |
---|
0 | -1.5 | -9.5 |
---|
1 | -0.5 | 0.5 |
---|
2 | 0.5 | -11.5 |
---|
3 | 1.5 | 20.5 |
---|
12.4 批量格式化数据
dff.applymap(lambda x: '%.1f'%x)
| A | B |
---|
0 | 1.0 | 10.0 |
---|
1 | 2.0 | 20.0 |
---|
2 | 3.0 | 8.0 |
---|
3 | 4.0 | 40.0 |
---|
dff.B=dff.B.map(lambda x: '%.0f'%x)
dff
13. 数据离散化
13.1划分区间
data13 = np.random.randint(0,100,10)
data13
array([33, 92, 42, 42, 57, 99, 4, 11, 57, 94])
category = [0,25,50,100]
pd.cut(data13,category)
[(25, 50], (50, 100], (25, 50], (25, 50], (50, 100], (50, 100], (0, 25], (0, 25], (50, 100], (50, 100]]
Categories (3, interval[int64]): [(0, 25] < (25, 50] < (50, 100]]
pd.cut(data13, category, right=False)
[[25, 50), [50, 100), [25, 50), [25, 50), [50, 100), [50, 100), [0, 25), [0, 25), [50, 100), [50, 100)]
Categories (3, interval[int64]): [[0, 25) < [25, 50) < [50, 100)]
labels = ['low', 'middle', 'high']
pd.cut(data13, category, right=False, labels=labels)
[middle, high, middle, middle, high, high, low, low, high, high]
Categories (3, object): [low < middle < high]
pd.cut(data13, 4)
[(27.75, 51.5], (75.25, 99.0], (27.75, 51.5], (27.75, 51.5], (51.5, 75.25], (75.25, 99.0], (3.905, 27.75], (3.905, 27.75], (51.5, 75.25], (75.25, 99.0]]
Categories (4, interval[float64]): [(3.905, 27.75] < (27.75, 51.5] < (51.5, 75.25] < (75.25, 99.0]]
pd.qcut(data13, 4)
[(3.999, 35.25], (83.25, 99.0], (35.25, 49.5], (35.25, 49.5], (49.5, 83.25], (83.25, 99.0], (3.999, 35.25], (3.999, 35.25], (49.5, 83.25], (83.25, 99.0]]
Categories (4, interval[float64]): [(3.999, 35.25] < (35.25, 49.5] < (49.5, 83.25] < (83.25, 99.0]]
14.频次统计与移位
pd.value_counts([1,1,3,3,3,3,2,1])
3 4
1 3
2 1
dtype: int64
pd.value_counts([1,1,3,3,3,3,2,1], sort=False)
1 3
2 1
3 4
dtype: int64
pd.value_counts([1,1,3,3,3,3,2,1], ascending=True)
2 1
1 3
3 4
dtype: int64
df1.shift(1)
| A | B | C | D | E | F | G |
---|
zhang | NaN | NaT | NaN | NaN | NaN | NaN | NaN |
---|
li | 81.0 | 2013-01-01 | -3.0 | 58.0 | test | f00 | 3.0 |
---|
zhou | 6.0 | 2013-01-02 | -2.0 | 58.0 | train | f00 | 5.0 |
---|
wang | 27.0 | 2013-01-03 | -3.0 | 55.0 | test | f00 | 5.0 |
---|
df1
| A | B | C | D | E | F | G |
---|
zhang | 81 | 2013-01-01 | -3.0 | 58 | test | f00 | 3.0 |
---|
li | 6 | 2013-01-02 | -2.0 | 58 | train | f00 | 5.0 |
---|
zhou | 27 | 2013-01-03 | -3.0 | 55 | test | f00 | 5.0 |
---|
wang | 97 | 2013-01-04 | -4.0 | 54 | train | f00 | 5.0 |
---|
df1['D'].value_counts()
58 2
55 1
54 1
Name: D, dtype: int64
15.差分与合并/连接
df14 = pd.DataFrame(np.random.randn(10, 4))
df14
| 0 | 1 | 2 | 3 |
---|
0 | 1.356177 | -1.594548 | -0.744250 | -0.561444 |
---|
1 | -0.359151 | -0.638286 | -0.297279 | 0.598555 |
---|
2 | -1.528712 | -0.893813 | -1.922199 | 0.444479 |
---|
3 | -0.159882 | -0.082586 | 0.467607 | -2.322559 |
---|
4 | 0.696892 | -0.015499 | -0.587565 | -0.612759 |
---|
5 | 1.539294 | 0.220061 | 1.313642 | 0.621169 |
---|
6 | -1.078099 | -0.965604 | 0.580988 | 1.752221 |
---|
7 | -0.557355 | -0.440118 | 0.874022 | 0.910304 |
---|
8 | 2.523452 | -0.010282 | -0.078567 | -0.588690 |
---|
9 | 0.177979 | 0.432169 | -0.433072 | 1.379981 |
---|
p1 = df14[:3]
p1
| 0 | 1 | 2 | 3 |
---|
0 | 1.356177 | -1.594548 | -0.744250 | -0.561444 |
---|
1 | -0.359151 | -0.638286 | -0.297279 | 0.598555 |
---|
2 | -1.528712 | -0.893813 | -1.922199 | 0.444479 |
---|
p2 = df14[3:7]
p2
| 0 | 1 | 2 | 3 |
---|
3 | -0.159882 | -0.082586 | 0.467607 | -2.322559 |
---|
4 | 0.696892 | -0.015499 | -0.587565 | -0.612759 |
---|
5 | 1.539294 | 0.220061 | 1.313642 | 0.621169 |
---|
6 | -1.078099 | -0.965604 | 0.580988 | 1.752221 |
---|
p3 = df14[7:]
df14_ = pd.concat([p1, p2, p3])
df14_ == df14
| 0 | 1 | 2 | 3 |
---|
0 | True | True | True | True |
---|
1 | True | True | True | True |
---|
2 | True | True | True | True |
---|
3 | True | True | True | True |
---|
4 | True | True | True | True |
---|
5 | True | True | True | True |
---|
6 | True | True | True | True |
---|
7 | True | True | True | True |
---|
8 | True | True | True | True |
---|
9 | True | True | True | True |
---|
16. 分组计算
df4 = pd.DataFrame({'A':np.random.randint(1,5,8),
'B':np.random.randint(10,15,8),
'C':np.random.randint(20,30,8),
'D':np.random.randint(80,100,8)})
df4
| A | B | C | D |
---|
0 | 3 | 13 | 21 | 86 |
---|
1 | 4 | 12 | 27 | 85 |
---|
2 | 4 | 12 | 25 | 95 |
---|
3 | 4 | 11 | 26 | 98 |
---|
4 | 1 | 11 | 20 | 85 |
---|
5 | 4 | 14 | 27 | 92 |
---|
6 | 2 | 10 | 27 | 82 |
---|
7 | 2 | 14 | 27 | 88 |
---|
df4.groupby('A').sum()
| B | C | D |
---|
A | | | |
---|
1 | 11 | 20 | 85 |
---|
2 | 24 | 54 | 170 |
---|
3 | 13 | 21 | 86 |
---|
4 | 49 | 105 | 370 |
---|
df4.groupby(by=['A', 'B']).mean()
| | C | D |
---|
A | B | | |
---|
1 | 11 | 20 | 85 |
---|
2 | 10 | 27 | 82 |
---|
14 | 27 | 88 |
---|
3 | 13 | 21 | 86 |
---|
4 | 11 | 26 | 98 |
---|
12 | 26 | 90 |
---|
14 | 27 | 92 |
---|
df4.groupby(by=['A','B'],as_index=False).mean()
| A | B | C | D |
---|
0 | 1 | 11 | 20 | 85 |
---|
1 | 2 | 10 | 27 | 82 |
---|
2 | 2 | 14 | 27 | 88 |
---|
3 | 3 | 13 | 21 | 86 |
---|
4 | 4 | 11 | 26 | 98 |
---|
5 | 4 | 12 | 26 | 90 |
---|
6 | 4 | 14 | 27 | 92 |
---|
df4.groupby(by=['A', 'B']).aggregate({'C':np.mean, 'D':np.min})
| | C | D |
---|
A | B | | |
---|
1 | 11 | 20 | 85 |
---|
2 | 10 | 27 | 82 |
---|
14 | 27 | 88 |
---|
3 | 13 | 21 | 86 |
---|
4 | 11 | 26 | 98 |
---|
12 | 26 | 85 |
---|
14 | 27 | 92 |
---|
17.哑变量矩阵/指标矩阵
df = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
'data':[3,4,5,6,7,8]})
df
| key | data |
---|
0 | b | 3 |
---|
1 | b | 4 |
---|
2 | a | 5 |
---|
3 | c | 6 |
---|
4 | a | 7 |
---|
5 | b | 8 |
---|
pd.get_dummies(df)
| data | key_a | key_b | key_c |
---|
0 | 3 | 0 | 1 | 0 |
---|
1 | 4 | 0 | 1 | 0 |
---|
2 | 5 | 1 | 0 | 0 |
---|
3 | 6 | 0 | 0 | 1 |
---|
4 | 7 | 1 | 0 | 0 |
---|
5 | 8 | 0 | 1 | 0 |
---|
pd.get_dummies(df['key'])
| a | b | c |
---|
0 | 0 | 1 | 0 |
---|
1 | 0 | 1 | 0 |
---|
2 | 1 | 0 | 0 |
---|
3 | 0 | 0 | 1 |
---|
4 | 1 | 0 | 0 |
---|
5 | 0 | 1 | 0 |
---|
dummies = pd.get_dummies(df['key'], prefix='key')
dummies
| key_a | key_b | key_c |
---|
0 | 0 | 1 | 0 |
---|
1 | 0 | 1 | 0 |
---|
2 | 1 | 0 | 0 |
---|
3 | 0 | 0 | 1 |
---|
4 | 1 | 0 | 0 |
---|
5 | 0 | 1 | 0 |
---|
df[['data']].join(dummies)
| data | key_a | key_b | key_c |
---|
0 | 3 | 0 | 1 | 0 |
---|
1 | 4 | 0 | 1 | 0 |
---|
2 | 5 | 1 | 0 | 0 |
---|
3 | 6 | 0 | 0 | 1 |
---|
4 | 7 | 1 | 0 | 0 |
---|
5 | 8 | 0 | 1 | 0 |
---|
18.透视转换与交叉表(列联表)
df = pd.DataFrame({'a':[1,2,3,4],
'b':[2,3,4,5],
'c':[3,4,5,6],
'd':[3,3,3,3]})
df
18.1 透视表
df.pivot(index='a', columns='b', values='c')
b | 2 | 3 | 4 | 5 |
---|
a | | | | |
---|
1 | 3.0 | NaN | NaN | NaN |
---|
2 | NaN | 4.0 | NaN | NaN |
---|
3 | NaN | NaN | 5.0 | NaN |
---|
4 | NaN | NaN | NaN | 6.0 |
---|
df.pivot(index='a', columns='b', values='d')
b | 2 | 3 | 4 | 5 |
---|
a | | | | |
---|
1 | 3.0 | NaN | NaN | NaN |
---|
2 | NaN | 3.0 | NaN | NaN |
---|
3 | NaN | NaN | 3.0 | NaN |
---|
4 | NaN | NaN | NaN | 3.0 |
---|
df.pivot(index='a', columns='b')
| c | d |
---|
b | 2 | 3 | 4 | 5 | 2 | 3 | 4 | 5 |
---|
a | | | | | | | | |
---|
1 | 3.0 | NaN | NaN | NaN | 3.0 | NaN | NaN | NaN |
---|
2 | NaN | 4.0 | NaN | NaN | NaN | 3.0 | NaN | NaN |
---|
3 | NaN | NaN | 5.0 | NaN | NaN | NaN | 3.0 | NaN |
---|
4 | NaN | NaN | NaN | 6.0 | NaN | NaN | NaN | 3.0 |
---|
df.pivot(index='a', columns='b')['c']
b | 2 | 3 | 4 | 5 |
---|
a | | | | |
---|
1 | 3.0 | NaN | NaN | NaN |
---|
2 | NaN | 4.0 | NaN | NaN |
---|
3 | NaN | NaN | 5.0 | NaN |
---|
4 | NaN | NaN | NaN | 6.0 |
---|
18.2 交叉表
pd.crosstab(index=df.a, columns=df.b)
b | 2 | 3 | 4 | 5 |
---|
a | | | | |
---|
1 | 1 | 0 | 0 | 0 |
---|
2 | 0 | 1 | 0 | 0 |
---|
3 | 0 | 0 | 1 | 0 |
---|
4 | 0 | 0 | 0 | 1 |
---|
pd.crosstab(index=df.a, columns=df.b, margins=True)
b | 2 | 3 | 4 | 5 | All |
---|
a | | | | | |
---|
1 | 1 | 0 | 0 | 0 | 1 |
---|
2 | 0 | 1 | 0 | 0 | 1 |
---|
3 | 0 | 0 | 1 | 0 | 1 |
---|
4 | 0 | 0 | 0 | 1 | 1 |
---|
All | 1 | 1 | 1 | 1 | 4 |
---|
pd.crosstab(index=df.a, columns=df.b, values=df.c, aggfunc='sum', margins=True)
b | 2 | 3 | 4 | 5 | All |
---|
a | | | | | |
---|
1 | 3.0 | NaN | NaN | NaN | 3 |
---|
2 | NaN | 4.0 | NaN | NaN | 4 |
---|
3 | NaN | NaN | 5.0 | NaN | 5 |
---|
4 | NaN | NaN | NaN | 6.0 | 6 |
---|
All | 3.0 | 4.0 | 5.0 | 6.0 | 18 |
---|
pd.crosstab(index=df.a, columns=df.b, values=df.c, aggfunc='mean', margins=True)
b | 2 | 3 | 4 | 5 | All |
---|
a | | | | | |
---|
1 | 3.0 | NaN | NaN | NaN | 3.0 |
---|
2 | NaN | 4.0 | NaN | NaN | 4.0 |
---|
3 | NaN | NaN | 5.0 | NaN | 5.0 |
---|
4 | NaN | NaN | NaN | 6.0 | 6.0 |
---|
All | 3.0 | 4.0 | 5.0 | 6.0 | 4.5 |
---|
19. 数据差分
df = pd.DataFrame({'a':np.random.randint(1, 100, 10),
'b':np.random.randint(1, 100, 10)},
index=map(str, range(10)))
df
| a | b |
---|
0 | 14 | 63 |
---|
1 | 20 | 60 |
---|
2 | 89 | 12 |
---|
3 | 28 | 78 |
---|
4 | 21 | 99 |
---|
5 | 27 | 41 |
---|
6 | 83 | 64 |
---|
7 | 91 | 94 |
---|
8 | 60 | 51 |
---|
9 | 69 | 29 |
---|
df.diff()
| a | b |
---|
0 | NaN | NaN |
---|
1 | 6.0 | -3.0 |
---|
2 | 69.0 | -48.0 |
---|
3 | -61.0 | 66.0 |
---|
4 | -7.0 | 21.0 |
---|
5 | 6.0 | -58.0 |
---|
6 | 56.0 | 23.0 |
---|
7 | 8.0 | 30.0 |
---|
8 | -31.0 | -43.0 |
---|
9 | 9.0 | -22.0 |
---|
df.diff(axis=1)
| a | b |
---|
0 | NaN | 49.0 |
---|
1 | NaN | 40.0 |
---|
2 | NaN | -77.0 |
---|
3 | NaN | 50.0 |
---|
4 | NaN | 78.0 |
---|
5 | NaN | 14.0 |
---|
6 | NaN | -19.0 |
---|
7 | NaN | 3.0 |
---|
8 | NaN | -9.0 |
---|
9 | NaN | -40.0 |
---|
df.diff(periods=2)
| a | b |
---|
0 | NaN | NaN |
---|
1 | NaN | NaN |
---|
2 | 75.0 | -51.0 |
---|
3 | 8.0 | 18.0 |
---|
4 | -68.0 | 87.0 |
---|
5 | -1.0 | -37.0 |
---|
6 | 62.0 | -35.0 |
---|
7 | 64.0 | 53.0 |
---|
8 | -23.0 | -13.0 |
---|
9 | -22.0 | -65.0 |
---|
20.相关系数
df = pd.DataFrame({'A':np.random.randint(1, 100, 10),
'B':np.random.randint(1, 100, 10),
'C':np.random.randint(1, 100, 10)})
df
df.corr()
| A | B | C |
---|
A | 1.000000 | -0.360141 | 0.232945 |
---|
B | -0.360141 | 1.000000 | -0.476603 |
---|
C | 0.232945 | -0.476603 | 1.000000 |
---|
df.corr('kendall')
| A | B | C |
---|
A | 1.000000 | -0.340909 | 0.179787 |
---|
B | -0.340909 | 1.000000 | -0.269680 |
---|
C | 0.179787 | -0.269680 | 1.000000 |
---|
df.corr('spearman')
| A | B | C |
---|
A | 1.000000 | -0.503049 | 0.255320 |
---|
B | -0.503049 | 1.000000 | -0.468087 |
---|
C | 0.255320 | -0.468087 | 1.000000 |
---|
21. matplotlib 绘图
import matplotlib.pyplot as plt
df = pd.DataFrame(np.random.randn(1000, 2), columns=['B', 'C']).cumsum()
df
| B | C |
---|
0 | -1.434415 | -1.363257 |
---|
1 | -1.988009 | -2.407375 |
---|
2 | -1.722312 | -3.656220 |
---|
3 | -1.622777 | -2.629169 |
---|
4 | -3.002289 | -2.623288 |
---|
... | ... | ... |
---|
995 | -39.633091 | -14.478918 |
---|
996 | -40.224034 | -14.200355 |
---|
997 | -40.242355 | -14.070692 |
---|
998 | -40.567508 | -12.884639 |
---|
999 | -40.407534 | -13.452417 |
---|
1000 rows × 2 columns
df['A'] = pd.Series(list(range(len(df))))
plt.figure()
df.plot(x='A')
plt.show()
<Figure size 432x288 with 0 Axes>
![在这里插入图片描述](https://img-blog.csdnimg.cn/8cf42f18ca9f43acab91f5b4369e37dd.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBA5Y-v5LiO5b6Ib2s=,size_12,color_FFFFFF,t_70,g_se,x_16#pic_center)
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
df.plot(kind='bar')
plt.show()
![](https://img-blog.csdnimg.cn/61fa1b37e94f4242952da842838805b8.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBA5Y-v5LiO5b6Ib2s=,size_12,color_FFFFFF,t_70,g_se,x_16#pic_center)
df = pd.DataFrame({'height':[180,170,172,183,179,178,160],
'weight':[85,80,85,75,78,78,70]})
df.plot(x='height', y='weight', kind='scatter',
marker='*', s=60, label='height-weight')
plt.show()
![在这里插入图片描述](https://img-blog.csdnimg.cn/5987cb909f12488ba937540c56bee545.png?x-oss-process=image/watermark,type_ZHJvaWRzYW5zZmFsbGJhY2s,shadow_50,text_Q1NETiBA5Y-v5LiO5b6Ib2s=,size_13,color_FFFFFF,t_70,g_se,x_16#pic_center)
df['weight'].plot(kind='pie', autopct='%.2f%%',
labels=df['weight'].values,
shadow=True)
plt.show()
![在这里插入图片描述](https://img-blog.csdnimg.cn/a653b45508b44968817c5b80bb5bf883.png#pic_center)
22.文件读写
df.to_excel('pd_to_xlsx.xlsx', sheet_name='dfg')
df = pd.read_excel('pd_to_xlsx.xlsx', 'dfg', index_col=None, na_values=['NA'])
df3 = pd.read_excel('pd_to_xlsx.xlsx', 'dfg',skiprows=3)
df3
| 2 | 172 | 85 |
---|
0 | 3 | 183 | 75 |
---|
1 | 4 | 179 | 78 |
---|
2 | 5 | 178 | 78 |
---|
3 | 6 | 160 | 70 |
---|
df44 = pd.read_excel('pd_to_xlsx.xlsx', 'dfg',skiprows=[2,4])
df44
| Unnamed: 0 | height | weight |
---|
0 | 0 | 180 | 85 |
---|
1 | 2 | 172 | 85 |
---|
2 | 4 | 179 | 78 |
---|
3 | 5 | 178 | 78 |
---|
4 | 6 | 160 | 70 |
---|
df.to_csv('df_2_csv.csv')
dfdd = pd.read_csv('df_2_csv.csv')
dfdd
| Unnamed: 0 | Unnamed: 0.1 | height | weight |
---|
0 | 0 | 0 | 180 | 85 |
---|
1 | 1 | 1 | 170 | 80 |
---|
2 | 2 | 2 | 172 | 85 |
---|
3 | 3 | 3 | 183 | 75 |
---|
4 | 4 | 4 | 179 | 78 |
---|
5 | 5 | 5 | 178 | 78 |
---|
6 | 6 | 6 | 160 | 70 |
---|