| … | … | … | … |
| 95 | 145 | 25 | 139 |
| 96 | 53 | 51 | 109 |
| 97 | 35 | 7 | 130 |
| 98 | 86 | 51 | 20 |
| 99 | 149 | 66 | 75 |
100 rows × 3 columns
df.to_csv('./data.csv',sep = ',',
index = True, # 保存行索引
header=True) # 保存列索引
df.to_csv('./data2.csv',sep = ',',
index = False, # 不保存行索引
header=False) # 不保存列索引
pd.read_csv('./data.csv',
index_col=0) # 第一列作为行索引
| Python | Math | En |
---|
0 | 10 | 128 | 54 |
1 | 45 | 47 | 74 |
2 | 103 | 33 | 133 |
3 | 70 | 24 | 81 |
4 | 90 | 143 | 121 |
… | … | … | … |
95 | 145 | 25 | 139 |
96 | 53 | 51 | 109 |
97 | 35 | 7 | 130 |
98 | 86 | 51 | 20 |
99 | 149 | 66 | 75 |
100 rows × 3 columns
pd.read_csv('./data2.csv',header =None)
| 0 | 1 | 2 |
---|
0 | 10 | 128 | 54 |
1 | 45 | 47 | 74 |
2 | 103 | 33 | 133 |
3 | 70 | 24 | 81 |
4 | 90 | 143 | 121 |
… | … | … | … |
95 | 145 | 25 | 139 |
96 | 53 | 51 | 109 |
97 | 35 | 7 | 130 |
98 | 86 | 51 | 20 |
99 | 149 | 66 | 75 |
100 rows × 3 columns
3.2 Excel
df
| Python | Math | En |
---|
0 | 10 | 128 | 54 |
1 | 45 | 47 | 74 |
2 | 103 | 33 | 133 |
3 | 70 | 24 | 81 |
4 | 90 | 143 | 121 |
… | … | … | … |
95 | 145 | 25 | 139 |
96 | 53 | 51 | 109 |
97 | 35 | 7 | 130 |
98 | 86 | 51 | 20 |
99 | 149 | 66 | 75 |
100 rows × 3 columns
df.to_excel('./data.xls')
pd.read_excel('./data.xls',
index_col=0) # 第一列作为行索引
| Python | Math | En |
---|
0 | 10 | 128 | 54 |
1 | 45 | 47 | 74 |
2 | 103 | 33 | 133 |
3 | 70 | 24 | 81 |
4 | 90 | 143 | 121 |
… | … | … | … |
95 | 145 | 25 | 139 |
96 | 53 | 51 | 109 |
97 | 35 | 7 | 130 |
98 | 86 | 51 | 20 |
99 | 149 | 66 | 75 |
100 rows × 3 columns
3.3 HDF5
df.to_hdf('./data.h5',key = 'score')
df2 = pd.DataFrame(data = np.random.randint(6,100,size = (1000,5)),
columns=['计算机','化工','生物','工程','教师'])
df2
| 计算机 | 化工 | 生物 | 工程 | 教师 |
---|
0 | 64 | 22 | 16 | 68 | 60 |
1 | 95 | 47 | 72 | 76 | 37 |
2 | 88 | 48 | 92 | 50 | 37 |
3 | 75 | 38 | 8 | 63 | 83 |
4 | 62 | 14 | 20 | 21 | 45 |
… | … | … | … | … | … |
995 | 89 | 94 | 88 | 97 | 27 |
996 | 52 | 68 | 21 | 8 | 50 |
997 | 76 | 99 | 10 | 92 | 56 |
998 | 66 | 31 | 55 | 65 | 94 |
999 | 8 | 21 | 38 | 89 | 14 |
1000 rows × 5 columns
df2.to_hdf('./data.h5',key = 'salary')
pd.read_hdf('./data.h5',key = 'salary')
| 计算机 | 化工 | 生物 | 工程 | 教师 |
---|
0 | 64 | 22 | 16 | 68 | 60 |
1 | 95 | 47 | 72 | 76 | 37 |
2 | 88 | 48 | 92 | 50 | 37 |
3 | 75 | 38 | 8 | 63 | 83 |
4 | 62 | 14 | 20 | 21 | 45 |
… | … | … | … | … | … |
995 | 89 | 94 | 88 | 97 | 27 |
996 | 52 | 68 | 21 | 8 | 50 |
997 | 76 | 99 | 10 | 92 | 56 |
998 | 66 | 31 | 55 | 65 | 94 |
999 | 8 | 21 | 38 | 89 | 14 |
1000 rows × 5 columns
3.4 SQL
from sqlalchemy import create_engine # 数据库引擎,构建和数据库的连接
# PyMySQL
# 类似网页地址
engine = create_engine('mysql+pymysql://root:12345678@localhost/pandas?charset=utf8')
df2.to_sql('salary',engine,index=False) # 将Python中数据DataFrame保存到Mysql
df3 = pd.read_sql('select * from salary limit 50',con = engine)
df3
| 计算机 | 化工 | 生物 | 工程 | 教师 |
---|
0 | 64 | 22 | 16 | 68 | 60 |
1 | 95 | 47 | 72 | 76 | 37 |
2 | 88 | 48 | 92 | 50 | 37 |
3 | 75 | 38 | 8 | 63 | 83 |
4 | 62 | 14 | 20 | 21 | 45 |
5 | 95 | 41 | 84 | 37 | 16 |
6 | 34 | 45 | 11 | 93 | 94 |
7 | 99 | 10 | 57 | 30 | 63 |
8 | 59 | 60 | 12 | 37 | 93 |
9 | 41 | 58 | 15 | 67 | 70 |
10 | 19 | 8 | 63 | 96 | 64 |
11 | 75 | 61 | 78 | 49 | 89 |
12 | 86 | 84 | 31 | 68 | 27 |
13 | 42 | 98 | 24 | 20 | 85 |
14 | 95 | 15 | 97 | 80 | 87 |
15 | 52 | 22 | 44 | 35 | 74 |
16 | 67 | 20 | 65 | 24 | 10 |
17 | 9 | 46 | 41 | 62 | 66 |
18 | 86 | 76 | 80 | 72 | 19 |
19 | 61 | 81 | 64 | 26 | 6 |
20 | 77 | 92 | 84 | 18 | 7 |
21 | 87 | 16 | 75 | 14 | 34 |
22 | 23 | 82 | 92 | 42 | 32 |
23 | 61 | 89 | 28 | 21 | 40 |
24 | 22 | 12 | 38 | 89 | 14 |
25 | 77 | 12 | 46 | 89 | 12 |
26 | 60 | 45 | 52 | 71 | 67 |
27 | 29 | 76 | 94 | 26 | 91 |
28 | 14 | 60 | 82 | 88 | 60 |
29 | 56 | 36 | 44 | 60 | 37 |
30 | 63 | 77 | 43 | 42 | 82 |
31 | 25 | 71 | 36 | 51 | 21 |
32 | 76 | 86 | 87 | 83 | 8 |
33 | 93 | 59 | 86 | 25 | 78 |
34 | 73 | 40 | 12 | 86 | 66 |
35 | 10 | 30 | 54 | 13 | 71 |
36 | 9 | 48 | 58 | 75 | 85 |
37 | 81 | 41 | 61 | 12 | 55 |
38 | 80 | 68 | 66 | 92 | 84 |
39 | 53 | 36 | 84 | 26 | 66 |
40 | 19 | 62 | 63 | 47 | 45 |
41 | 89 | 39 | 91 | 31 | 86 |
42 | 57 | 43 | 53 | 48 | 19 |
43 | 66 | 16 | 23 | 19 | 10 |
44 | 46 | 28 | 78 | 81 | 21 |
45 | 38 | 53 | 76 | 49 | 8 |
46 | 55 | 94 | 70 | 6 | 44 |
47 | 56 | 33 | 92 | 17 | 84 |
48 | 69 | 68 | 23 | 87 | 90 |
49 | 12 | 47 | 32 | 80 | 15 |
第四部分:数据选取
4.1 获取数据
df = pd.DataFrame(np.random.randint(0,151,size = (10,3)),
index=list('ABCDEFHIJK'),columns=['Python','Math','En'])
df
| Python | Math | En |
---|
A | 88 | 52 | 48 |
B | 78 | 62 | 94 |
C | 9 | 14 | 71 |
D | 86 | 15 | 21 |
E | 1 | 71 | 71 |
F | 123 | 138 | 55 |
H | 59 | 17 | 140 |
I | 68 | 8 | 58 |
J | 100 | 70 | 63 |
K | 79 | 37 | 72 |
df['Python'] # 获取数据Series
A 88
B 78
C 9
D 86
E 1
F 123
H 59
I 68
J 100
K 79
Name: Python, dtype: int64
df.Python # 属性,DataFrame中列索引,表示属性
A 88
B 78
C 9
D 86
E 1
F 123
H 59
I 68
J 100
K 79
Name: Python, dtype: int64
df[['Python','En']] # 获取多列数据
| Python | En |
---|
A | 88 | 48 |
B | 78 | 94 |
C | 9 | 71 |
D | 86 | 21 |
E | 1 | 71 |
F | 123 | 55 |
H | 59 | 140 |
I | 68 | 58 |
J | 100 | 63 |
K | 79 | 72 |
4.2 标签选择
# 标签,就是行索引 location = loc 位置
df.loc['A']
Python 88
Math 52
En 48
Name: A, dtype: int64
df.loc[['A','F','K']]
| Python | Math | En |
---|
A | 88 | 52 | 48 |
F | 123 | 138 | 55 |
K | 79 | 37 | 72 |
df.loc['A','Python']
88
df.loc[['A','C','F'],'Python']
A 88
C 9
F 123
Name: Python, dtype: int64
df.loc['A'::2,['Math','En']]
| Math | En |
---|
A | 52 | 48 |
C | 14 | 71 |
E | 71 | 71 |
H | 17 | 140 |
J | 70 | 63 |
df.loc['A':'D',:]
| Python | Math | En |
---|
A | 88 | 52 | 48 |
B | 78 | 62 | 94 |
C | 9 | 14 | 71 |
D | 86 | 15 | 21 |
4.3 位置选择
df.iloc[0]
Python 88
Math 52
En 48
Name: A, dtype: int64
df.iloc[[0,2,4]]
| Python | Math | En |
---|
A | 88 | 52 | 48 |
C | 9 | 14 | 71 |
E | 1 | 71 | 71 |
df.iloc[0:4,[0,2]]
| Python | En |
---|
A | 88 | 48 |
B | 78 | 94 |
C | 9 | 71 |
D | 86 | 21 |
df.iloc[3:8:2]
| Python | Math | En |
---|
D | 86 | 15 | 21 |
F | 123 | 138 | 55 |
I | 68 | 8 | 58 |
4.4 boolean索引
cond = df.Python > 80 # 将Python大于80分的成绩获取
df[cond]
| Python | Math | En |
---|
A | 88 | 52 | 48 |
D | 86 | 15 | 21 |
F | 123 | 138 | 55 |
J | 100 | 70 | 63 |
cond = df.mean(axis = 1) > 75 # 平均分大于75,优秀,筛选出来
df[cond]
| Python | Math | En |
---|
B | 78 | 62 | 94 |
F | 123 | 138 | 55 |
J | 100 | 70 | 63 |
cond = (df.Python > 70) & (df.Math > 70)
df[cond]
cond = df.index.isin(['C','E','H','K']) # 判断数据是否在数组中
df[cond] # 删选出来了符合条件的数据
| Python | Math | En |
---|
C | 9 | 14 | 71 |
E | 1 | 71 | 71 |
H | 59 | 17 | 140 |
K | 79 | 37 | 72 |
4.5 赋值操作
df['Python']['A'] = 150 # 修改某个位置的值
df
| Python | Math | En |
---|
A | 150 | 52 | 48 |
B | 78 | 62 | 94 |
C | 9 | 14 | 71 |
D | 86 | 15 | 21 |
E | 1 | 71 | 71 |
F | 123 | 138 | 55 |
H | 59 | 17 | 140 |
I | 68 | 8 | 58 |
J | 100 | 70 | 63 |
K | 79 | 37 | 72 |
df['Java'] = np.random.randint(0,151,size = 10) # 新增加一列
df
| Python | Math | En | Java |
---|
A | 150 | 52 | 48 | 65 |
B | 78 | 62 | 94 | 25 |
C | 9 | 14 | 71 | 82 |
D | 86 | 15 | 21 | 139 |
E | 1 | 71 | 71 | 67 |
F | 123 | 138 | 55 | 145 |
H | 59 | 17 | 140 | 53 |
I | 68 | 8 | 58 | 141 |
J | 100 | 70 | 63 | 11 |
K | 79 | 37 | 72 | 127 |
df.loc[['C','D','E'],'Math'] = 147 # 修改多个人的成绩
df
| Python | Math | En | Java |
---|
A | 150 | 52 | 48 | 65 |
B | 78 | 62 | 94 | 25 |
C | 9 | 147 | 71 | 82 |
D | 86 | 147 | 21 | 139 |
E | 1 | 147 | 71 | 67 |
F | 123 | 138 | 55 | 145 |
H | 59 | 17 | 140 | 53 |
I | 68 | 8 | 58 | 141 |
J | 100 | 70 | 63 | 11 |
K | 79 | 37 | 72 | 127 |
cond = df < 60
df[cond] = 60 # where 条件操作,符合这条件值,修改,不符合,不改变
df
| Python | Math | En | Java |
---|
A | 150 | 60 | 60 | 65 |
B | 78 | 62 | 94 | 60 |
C | 60 | 147 | 71 | 82 |
D | 86 | 147 | 60 | 139 |
E | 60 | 147 | 71 | 67 |
F | 123 | 138 | 60 | 145 |
H | 60 | 60 | 140 | 60 |
I | 68 | 60 | 60 | 141 |
J | 100 | 70 | 63 | 60 |
K | 79 | 60 | 72 | 127 |
df.iloc[3::3,[0,2]] += 100
df
| Python | Math | En | Java |
---|
A | 150 | 60 | 60 | 65 |
B | 78 | 62 | 94 | 60 |
C | 60 | 147 | 71 | 82 |
D | 186 | 147 | 160 | 139 |
E | 60 | 147 | 71 | 67 |
F | 123 | 138 | 60 | 145 |
H | 160 | 60 | 240 | 60 |
I | 68 | 60 | 60 | 141 |
J | 100 | 70 | 63 | 60 |
K | 179 | 60 | 172 | 127 |
第五部分:数据集成
5.1 concat数据串联
# np.concatenate NumPy数据集成
df1 = pd.DataFrame(np.random.randint(0,151,size = (10,3)),
columns=['Python','Math','En'],
index = list('ABCDEFHIJK'))
df2 = pd.DataFrame(np.random.randint(0,151,size = (10,3)),
columns = ['Python','Math','En'],
index = list('QWRTUYOPLM'))
df3 = pd.DataFrame(np.random.randint(0,151,size = (10,2)),
columns=['Java','Chinese'],index = list('ABCDEFHIJK'))
pd.concat([df1,df2],axis = 0) # axis = 0变是行合并,行增加
| Python | Math | En |
---|
A | 108 | 74 | 53 |
B | 98 | 16 | 47 |
C | 71 | 77 | 128 |
D | 9 | 123 | 131 |
E | 25 | 90 | 132 |
F | 105 | 106 | 86 |
H | 146 | 42 | 81 |
I | 83 | 4 | 36 |
J | 102 | 79 | 8 |
K | 92 | 11 | 47 |
Q | 119 | 59 | 43 |
W | 20 | 62 | 106 |
R | 77 | 82 | 128 |
T | 44 | 119 | 15 |
U | 49 | 149 | 62 |
Y | 94 | 90 | 88 |
O | 105 | 72 | 133 |
P | 87 | 109 | 123 |
L | 125 | 140 | 149 |
M | 148 | 22 | 102 |
pd.concat([df1,df3],axis = 1) # axis = 1表示列增加
| Python | Math | En | Java | Chinese |
---|
A | 108 | 74 | 53 | 61 | 81 |
B | 98 | 16 | 47 | 117 | 117 |
C | 71 | 77 | 128 | 48 | 4 |
D | 9 | 123 | 131 | 149 | 115 |
E | 25 | 90 | 132 | 113 | 73 |
F | 105 | 106 | 86 | 140 | 26 |
H | 146 | 42 | 81 | 117 | 118 |
I | 83 | 4 | 36 | 103 | 91 |
J | 102 | 79 | 8 | 43 | 20 |
K | 92 | 11 | 47 | 93 | 72 |
df1.append(df2) # append追加,在行后面直接进行追加
| Python | C++ | Math | En |
---|
A | 108 | 59.0 | 74 | 53 |
B | 98 | 4.0 | 16 | 47 |
C | 71 | 27.0 | 77 | 128 |
D | 9 | 17.0 | 123 | 131 |
E | 25 | 60.0 | 90 | 132 |
F | 105 | 136.0 | 106 | 86 |
H | 146 | 112.0 | 42 | 81 |
I | 83 | 120.0 | 4 | 36 |
J | 102 | 28.0 | 79 | 8 |
K | 92 | 53.0 | 11 | 47 |
Q | 119 | NaN | 59 | 43 |
W | 20 | NaN | 62 | 106 |
R | 77 | NaN | 82 | 128 |
T | 44 | NaN | 119 | 15 |
U | 49 | NaN | 149 | 62 |
Y | 94 | NaN | 90 | 88 |
O | 105 | NaN | 72 | 133 |
P | 87 | NaN | 109 | 123 |
L | 125 | NaN | 140 | 149 |
M | 148 | NaN | 22 | 102 |
df1.append(df3) # 出现空数据,原因在于:df1的列索引和df3列索引不一致
| Python | C++ | Math | En | Java | Chinese |
---|
A | 108.0 | 59.0 | 74.0 | 53.0 | NaN | NaN |
B | 98.0 | 4.0 | 16.0 | 47.0 | NaN | NaN |
C | 71.0 | 27.0 | 77.0 | 128.0 | NaN | NaN |
D | 9.0 | 17.0 | 123.0 | 131.0 | NaN | NaN |
E | 25.0 | 60.0 | 90.0 | 132.0 | NaN | NaN |
F | 105.0 | 136.0 | 106.0 | 86.0 | NaN | NaN |
H | 146.0 | 112.0 | 42.0 | 81.0 | NaN | NaN |
I | 83.0 | 120.0 | 4.0 | 36.0 | NaN | NaN |
J | 102.0 | 28.0 | 79.0 | 8.0 | NaN | NaN |
K | 92.0 | 53.0 | 11.0 | 47.0 | NaN | NaN |
A | NaN | NaN | NaN | NaN | 61.0 | 81.0 |
B | NaN | NaN | NaN | NaN | 117.0 | 117.0 |
C | NaN | NaN | NaN | NaN | 48.0 | 4.0 |
D | NaN | NaN | NaN | NaN | 149.0 | 115.0 |
E | NaN | NaN | NaN | NaN | 113.0 | 73.0 |
F | NaN | NaN | NaN | NaN | 140.0 | 26.0 |
H | NaN | NaN | NaN | NaN | 117.0 | 118.0 |
I | NaN | NaN | NaN | NaN | 103.0 | 91.0 |
J | NaN | NaN | NaN | NaN | 43.0 | 20.0 |
K | NaN | NaN | NaN | NaN | 93.0 | 72.0 |
pd.concat([df1,df3],axis = 0)
| Python | C++ | Math | En | Java | Chinese |
---|
A | 108.0 | 59.0 | 74.0 | 53.0 | NaN | NaN |
B | 98.0 | 4.0 | 16.0 | 47.0 | NaN | NaN |
C | 71.0 | 27.0 | 77.0 | 128.0 | NaN | NaN |
D | 9.0 | 17.0 | 123.0 | 131.0 | NaN | NaN |
E | 25.0 | 60.0 | 90.0 | 132.0 | NaN | NaN |
F | 105.0 | 136.0 | 106.0 | 86.0 | NaN | NaN |
H | 146.0 | 112.0 | 42.0 | 81.0 | NaN | NaN |
I | 83.0 | 120.0 | 4.0 | 36.0 | NaN | NaN |
J | 102.0 | 28.0 | 79.0 | 8.0 | NaN | NaN |
K | 92.0 | 53.0 | 11.0 | 47.0 | NaN | NaN |
A | NaN | NaN | NaN | NaN | 61.0 | 81.0 |
B | NaN | NaN | NaN | NaN | 117.0 | 117.0 |
C | NaN | NaN | NaN | NaN | 48.0 | 4.0 |
D | NaN | NaN | NaN | NaN | 149.0 | 115.0 |
E | NaN | NaN | NaN | NaN | 113.0 | 73.0 |
F | NaN | NaN | NaN | NaN | 140.0 | 26.0 |
H | NaN | NaN | NaN | NaN | 117.0 | 118.0 |
I | NaN | NaN | NaN | NaN | 103.0 | 91.0 |
J | NaN | NaN | NaN | NaN | 43.0 | 20.0 |
K | NaN | NaN | NaN | NaN | 93.0 | 72.0 |
5.2 数据插入
df1
| Python | Math | En |
---|
A | 108 | 74 | 53 |
B | 98 | 16 | 47 |
C | 71 | 77 | 128 |
D | 9 | 123 | 131 |
E | 25 | 90 | 132 |
F | 105 | 106 | 86 |
H | 146 | 42 | 81 |
I | 83 | 4 | 36 |
J | 102 | 79 | 8 |
K | 92 | 11 | 47 |
df1.insert(loc = 1, # 插入位置
column='C++', # 插入一列,这一列名字
value = np.random.randint(0,151,size = 10)) # 插入的值
df1
| Python | C++ | Math | En |
---|
A | 108 | 59 | 74 | 53 |
B | 98 | 4 | 16 | 47 |
C | 71 | 27 | 77 | 128 |
D | 9 | 17 | 123 | 131 |
E | 25 | 60 | 90 | 132 |
F | 105 | 136 | 106 | 86 |
H | 146 | 112 | 42 | 81 |
I | 83 | 120 | 4 | 36 |
J | 102 | 28 | 79 | 8 |
K | 92 | 53 | 11 | 47 |
5.3 Join SQL风格合并
df1 = pd.DataFrame(data = {'name':['softpo','Brandon','Ella','Daniel','张三'],
'height':[175,180,169,177,168]}) # 身高
df2 = pd.DataFrame(data = {'name':['softpo','Brandon','Ella','Daniel','李四'],
'weight':[70,65,74,63,88]}) # 体重
df3 = pd.DataFrame(data = {'名字':['softpo','Brandon','Ella','Daniel','张三'],
'salary':np.random.randint(20,100,size = 5)}) # 薪水
display(df1,df2,df3)
| name | height |
---|
0 | softpo | 175 |
1 | Brandon | 180 |
2 | Ella | 169 |
3 | Daniel | 177 |
4 | 张三 | 168 |
| name | weight |
---|
0 | softpo | 70 |
1 | Brandon | 65 |
2 | Ella | 74 |
3 | Daniel | 63 |
4 | 李四 | 88 |
| 名字 | salary |
---|
0 | softpo | 64 |
1 | Brandon | 48 |
2 | Ella | 25 |
3 | Daniel | 26 |
4 | 张三 | 96 |
pd.concat([df1,df2],axis = 1)
| name | height | name | weight |
---|
0 | softpo | 175 | softpo | 70 |
1 | Brandon | 180 | Brandon | 65 |
2 | Ella | 169 | Ella | 74 |
3 | Daniel | 177 | Daniel | 63 |
4 | 张三 | 168 | 李四 | 88 |
# 根据共同的属性,合并数据
# df1 和 df2 共同属性:name
# 数据库,合并join 共同key
# inner内合并
pd.merge(df1,df2,how = 'inner') # 根据共同name进行合并,两表合并,外键
| name | height | weight |
---|
0 | softpo | 175 | 70 |
1 | Brandon | 180 | 65 |
2 | Ella | 169 | 74 |
3 | Daniel | 177 | 63 |
pd.merge(df1,df2,how = 'outer') # 外合并,所有数据保留,不对应位置,填充了空数据
| name | height | weight |
---|
0 | softpo | 175.0 | 70.0 |
1 | Brandon | 180.0 | 65.0 |
2 | Ella | 169.0 | 74.0 |
3 | Daniel | 177.0 | 63.0 |
4 | 张三 | 168.0 | NaN |
5 | 李四 | NaN | 88.0 |
pd.merge(df1,df2,how = 'left')
| name | height | weight |
---|
0 | softpo | 175 | 70.0 |
1 | Brandon | 180 | 65.0 |
2 | Ella | 169 | 74.0 |
3 | Daniel | 177 | 63.0 |
4 | 张三 | 168 | NaN |
pd.merge(df1,df3,left_on='name',right_on='名字')
| name | height | 名字 | salary |
---|
0 | softpo | 175 | softpo | 64 |
1 | Brandon | 180 | Brandon | 48 |
2 | Ella | 169 | Ella | 25 |
3 | Daniel | 177 | Daniel | 26 |
4 | 张三 | 168 | 张三 | 96 |
df4 = pd.DataFrame(data = np.random.randint(0,151,size = (10,3)),
columns=['Python','Math','En'],index = list('ABCDEFHIJK'))
df4
| Python | Math | En |
---|
A | 71 | 7 | 89 |
B | 145 | 116 | 40 |
C | 56 | 150 | 139 |
D | 88 | 66 | 41 |
E | 87 | 139 | 117 |
F | 141 | 45 | 18 |
H | 93 | 119 | 114 |
I | 110 | 89 | 2 |
J | 2 | 35 | 96 |
K | 125 | 59 | 9 |
score_mean = df4.mean(axis = 1).round(1)
score_mean
A 55.7
B 100.3
C 115.0
D 65.0
E 114.3
F 68.0
H 108.7
I 67.0
J 44.3
K 64.3
dtype: float64
df4.insert(loc = 2,column='平均分',value=score_mean)
df4
| Python | Math | 平均分 | En |
---|
A | 71 | 7 | 55.7 | 89 |
B | 145 | 116 | 100.3 | 40 |
C | 56 | 150 | 115.0 | 139 |
D | 88 | 66 | 65.0 | 41 |
E | 87 | 139 | 114.3 | 117 |
F | 141 | 45 | 68.0 | 18 |
H | 93 | 119 | 108.7 | 114 |
I | 110 | 89 | 67.0 | 2 |
J | 2 | 35 | 44.3 | 96 |
K | 125 | 59 | 64.3 | 9 |
df5 = df4.iloc[:,[0,1,3]]
df5
| Python | Math | En |
---|
A | 71 | 7 | 89 |
B | 145 | 116 | 40 |
C | 56 | 150 | 139 |
D | 88 | 66 | 41 |
E | 87 | 139 | 117 |
F | 141 | 45 | 18 |
H | 93 | 119 | 114 |
I | 110 | 89 | 2 |
J | 2 | 35 | 96 |
K | 125 | 59 | 9 |
score_mean.name = '平均分'
score_mean
A 55.7
B 100.3
C 115.0
D 65.0
E 114.3
F 68.0
H 108.7
I 67.0
J 44.3
K 64.3
Name: 平均分, dtype: float64
df5
| Python | Math | En |
---|
A | 71 | 7 | 89 |
B | 145 | 116 | 40 |
C | 56 | 150 | 139 |
D | 88 | 66 | 41 |
E | 87 | 139 | 117 |
F | 141 | 45 | 18 |
H | 93 | 119 | 114 |
I | 110 | 89 | 2 |
J | 2 | 35 | 96 |
K | 125 | 59 | 9 |
pd.merge(df5,score_mean,
left_index=True, # 数据合并根据行索引,对应
right_index=True) # 右边数据根据行索引,对应
| Python | Math | En | 平均分 |
---|
A | 71 | 7 | 89 | 55.7 |
B | 145 | 116 | 40 | 100.3 |
C | 56 | 150 | 139 | 115.0 |
D | 88 | 66 | 41 | 65.0 |
E | 87 | 139 | 117 | 114.3 |
F | 141 | 45 | 18 | 68.0 |
H | 93 | 119 | 114 | 108.7 |
I | 110 | 89 | 2 | 67.0 |
J | 2 | 35 | 96 | 44.3 |
K | 125 | 59 | 9 | 64.3 |
第六部分:数据清洗
df = pd.DataFrame(data = {'color':['red','blue','red','green','green','blue',None,np.NaN,'green'],
'price':[20,15,20,18,18,22,30,30,22]})
df
| color | price |
---|
0 | red | 20 |
1 | blue | 15 |
2 | red | 20 |
3 | green | 18 |
4 | green | 18 |
5 | blue | 22 |
6 | None | 30 |
7 | NaN | 30 |
8 | green | 22 |
# 重复数据删除
df.drop_duplicates() # 非重复数据,索引7和索引6重复数据,None和NaN一回事
| color | price |
---|
0 | red | 20 |
1 | blue | 15 |
3 | green | 18 |
5 | blue | 22 |
6 | None | 30 |
8 | green | 22 |
df
| color | price |
---|
0 | red | 20 |
1 | blue | 15 |
2 | red | 20 |
3 | green | 18 |
4 | green | 18 |
5 | blue | 22 |
6 | None | 30 |
7 | NaN | 30 |
8 | green | 22 |
df.dropna() # 空数据过滤
| color | price |
---|
0 | red | 20 |
1 | blue | 15 |
2 | red | 20 |
3 | green | 18 |
4 | green | 18 |
5 | blue | 22 |
8 | green | 22 |
# 删除行,或者列
df.drop(labels=[2,4,6,8]) # 默认情况下删除行
| color | price |
---|
0 | red | 20 |
1 | blue | 15 |
3 | green | 18 |
5 | blue | 22 |
7 | NaN | 30 |
# 删除指定的列
df.drop(labels='color',axis = 1) # 删除列,axis = 1
| price |
---|
0 | 20 |
1 | 15 |
2 | 20 |
3 | 18 |
4 | 18 |
5 | 22 |
6 | 30 |
7 | 30 |
8 | 22 |
df.filter(items=['price']) # 参数意思,保留数据price
| price |
---|
0 | 20 |
1 | 15 |
2 | 20 |
3 | 18 |
4 | 18 |
5 | 22 |
6 | 30 |
7 | 30 |
8 | 22 |
df['size'] = 1024 # 广播
df
| color | price | size |
---|
0 | red | 20 | 1024 |
1 | blue | 15 | 1024 |
2 | red | 20 | 1024 |
3 | green | 18 | 1024 |
4 | green | 18 | 1024 |
5 | blue | 22 | 1024 |
6 | None | 30 | 1024 |
7 | NaN | 30 | 1024 |
8 | green | 22 | 1024 |
df.filter(like = 'i') # 模糊匹配,保留了带有i这个字母的索引
| price | size |
---|
0 | 20 | 1024 |
1 | 15 | 1024 |
2 | 20 | 1024 |
3 | 18 | 1024 |
4 | 18 | 1024 |
5 | 22 | 1024 |
6 | 30 | 1024 |
7 | 30 | 1024 |
8 | 22 | 1024 |
df['hello'] = 512
df
| color | price | size | hello |
---|
0 | red | 20 | 1024 | 512 |
1 | blue | 15 | 1024 | 512 |
2 | red | 20 | 1024 | 512 |
3 | green | 18 | 1024 | 512 |
4 | green | 18 | 1024 | 512 |
5 | blue | 22 | 1024 | 512 |
6 | None | 30 | 1024 | 512 |
7 | NaN | 30 | 1024 | 512 |
8 | green | 22 | 1024 | 512 |
# 正则表达式,方式很多
df.filter(regex = 'e$') # 正则表达式,正则表达式,限制e必须在最后
| price | size |
---|
0 | 20 | 1024 |
1 | 15 | 1024 |
2 | 20 | 1024 |
3 | 18 | 1024 |
4 | 18 | 1024 |
5 | 22 | 1024 |
6 | 30 | 1024 |
7 | 30 | 1024 |
8 | 22 | 1024 |
df.filter(regex='e') # 只要带有e全部选出来
| price | size | hello |
---|
0 | 20 | 1024 | 512 |
1 | 15 | 1024 | 512 |
2 | 20 | 1024 | 512 |
3 | 18 | 1024 | 512 |
4 | 18 | 1024 | 512 |
5 | 22 | 1024 | 512 |
6 | 30 | 1024 | 512 |
7 | 30 | 1024 | 512 |
8 | 22 | 1024 | 512 |
# 异常值过滤
a = np.random.randint(0,1000,size = 200)
a
array([647, 871, 35, 738, 789, 587, 413, 559, 648, 993, 579, 129, 825,
904, 356, 316, 997, 800, 35, 601, 1, 208, 465, 614, 680, 619,
922, 346, 994, 135, 5, 650, 165, 475, 95, 194, 225, 455, 634,
717, 836, 678, 156, 203, 263, 180, 143, 248, 407, 56, 202, 947,
46, 408, 686, 530, 545, 273, 125, 964, 323, 775, 313, 238, 242,
804, 228, 322, 322, 768, 556, 9, 629, 938, 932, 859, 955, 707,
729, 541, 280, 493, 255, 681, 428, 992, 420, 650, 267, 32, 662,
185, 756, 319, 313, 271, 229, 711, 803, 85, 527, 853, 670, 685,
423, 458, 628, 701, 253, 495, 548, 879, 503, 115, 90, 978, 665,
532, 198, 482, 412, 850, 879, 913, 96, 177, 778, 337, 502, 128,
49, 747, 591, 22, 557, 105, 136, 775, 626, 515, 959, 869, 245,
437, 51, 236, 438, 489, 854, 49, 163, 687, 488, 175, 428, 517,
493, 377, 100, 728, 717, 926, 689, 186, 777, 639, 79, 83, 620,
623, 931, 918, 721, 315, 133, 423, 161, 999, 341, 55, 837, 582,
530, 805, 22, 301, 177, 322, 708, 14, 50, 864, 889, 929, 967,
497, 624, 127, 539, 14])
# 异常值,大于800,小于 100算作异常,认为定义的。根据实际情况。
cond = (a <=800) & (a >=100)
a[cond]
array([647, 738, 789, 587, 413, 559, 648, 579, 129, 356, 316, 800, 601,
208, 465, 614, 680, 619, 346, 135, 650, 165, 475, 194, 225, 455,
634, 717, 678, 156, 203, 263, 180, 143, 248, 407, 202, 408, 686,
530, 545, 273, 125, 323, 775, 313, 238, 242, 228, 322, 322, 768,
556, 629, 707, 729, 541, 280, 493, 255, 681, 428, 420, 650, 267,
662, 185, 756, 319, 313, 271, 229, 711, 527, 670, 685, 423, 458,
628, 701, 253, 495, 548, 503, 115, 665, 532, 198, 482, 412, 177,
778, 337, 502, 128, 747, 591, 557, 105, 136, 775, 626, 515, 245,
437, 236, 438, 489, 163, 687, 488, 175, 428, 517, 493, 377, 100,
728, 717, 689, 186, 777, 639, 620, 623, 721, 315, 133, 423, 161,
341, 582, 530, 301, 177, 322, 708, 497, 624, 127, 539])
# 正态分布,平均值是0,标准差是1
b = np.random.randn(100000)
b
array([-1.17335196, 2.02215212, -0.29891071, ..., -1.6762474 ,
-1.27071523, -1.15187761])
# 过滤异常值
cond = np.abs(b) > 3*1 # 这些异常值,找到了
b[cond]
array([ 3.46554243, 3.08127362, 3.55119821, 3.62774922, 3.11823028,
3.22620922, -3.10381164, -3.20067563, -3.04607325, -3.04427703,
3.09111414, -3.28220862, 3.00499105, -3.06179762, -3.17331972,
-3.37172359, 3.93766782, -3.22895232, -3.13737479, 3.07612751,
-3.43215209, -3.27660651, -3.35116041, 4.74328695, 3.25586636,
-3.54090785, 3.08881127, 3.00635551, 3.5018534 , -3.14463788,
-3.0182886 , -3.12145648, -3.24276219, 3.08087834, 3.04820238,
-3.24173442, -3.14648209, 3.87748281, -3.07660111, -3.16083928,
3.32641202, -3.05228179, 3.04924043, 3.02825131, -3.08360056,
-3.04890894, -3.27258041, -3.07339115, -3.38375287, -3.14267022,
-3.7207377 , 3.4813841 , -3.12866105, -3.17122631, 3.0599701 ,
3.12393087, 3.20253178, -3.05221958, -3.35532417, 3.02450167,
-3.28385568, 3.3422833 , -3.11052755, -3.09647003, 3.32353664,
-3.70215812, -3.07916575, -3.13546874, 3.20575826, -3.67982084,
-3.17055893, 3.4836615 , -3.30039879, -3.27774497, 3.02125912,
3.12332885, 3.01456477, 3.15958151, -3.34101369, 3.32444673,
3.06479889, 3.14506863, 3.15670827, 3.15066995, 3.14705869,
-3.20526898, -3.0761338 , 3.20716127, -3.20941307, -3.7212859 ,
-3.51785834, -3.06096986, -3.05425748, -3.47049261, 3.22285172,
-3.32233224, -3.04630606, 3.41215312, -3.16482337, -3.01813609,
-3.05441573, -3.10394416, 3.03469642, 3.01493847, -3.11901071,
3.5996865 , 3.48194227, -3.77734847, 3.04588004, 3.10611158,
-3.20473003, -3.4377999 , 3.22680244, -3.1536921 , -3.22798726,
3.34569796, 3.06046948, -3.16955677, 3.12613756, 3.04286964,
3.01148054, 3.18525226, -4.08971624, -3.55427596, -5.39879049,
3.05203254, 3.08944491, -3.02258209, 3.17316913, -3.1615401 ,
3.17205118, -3.24221772, -3.14421237, -3.74675036, 3.61678522,
3.59097443, -3.0302881 , 3.23236707, -3.00850012, 3.33608986,
-3.02859152, -3.7000766 , -3.10992575, -3.00412636, -3.05657102,
-3.05208781, 3.14017797, 3.46457731, 3.15619413, -3.43236114,
3.08259529, -3.84578168, 3.04203424, -3.29444028, -3.01764756,
3.11300256, 3.23071233, 3.20785451, -3.15668756, 3.44176099,
-3.19985577, -3.14126853, -3.26482841, -3.62208271, -3.55305069,
3.09639491, -3.18178713, -3.03662021, 3.17247227, 3.3908074 ,
-3.63563705, -3.56417097, 3.02823554, -3.06955375, 3.74305364,
3.63993306, -3.14193492, -3.04032527, -3.28310908, -3.37949723,
-3.25915912, -3.01206123, -3.10871377, -3.22982732, 3.8136103 ,
3.48893313, 3.9918267 , 3.4526763 , -3.46595488, -3.29996013,
-3.42965097, 3.151502 , 3.10548689, -3.44707735, 3.21881565,
3.50932999, -3.12410382, 3.30296386, 3.02454576, -3.20072608,
3.54339754, -3.17847739, -3.21475045, 3.03546088, -3.06225619,
3.48158164, 3.15243123, -3.06358376, 3.27300242, 3.32577453,
3.23535167, -3.04681725, 3.33439387, 3.10620079, 3.52883469,
-3.1790272 , 3.02641222, -3.45636819, 3.21009424, 3.08045954,
-3.59721754, 3.24693695, 3.05920919, -3.43674159, -3.00370946,
-3.48031594, -3.28748467, 3.42581649, 3.46912521, -3.28384157,
3.76358974, -3.34035865, 3.12978233, 3.44856854, -3.04074246,
3.50018071, 3.33188267, -3.09775514, -3.49356906, -3.09902374,
3.12068562, -3.1776565 , -3.44282129, 3.19286374, -3.28304596,
-3.10080963, -3.37189709, 3.77743156, 3.03547536, 3.22045459,
-3.44007263, 3.01331408, 3.49733677, 3.28831922, 3.62147013,
3.03458981, 3.15447237, -3.33931478, 3.09858431, -3.23592306,
3.3144797 , 3.37067342, -3.18749118, 3.09319307, -3.34390567,
3.29819563, 3.3120354 , 3.04166958, -3.00975323, 3.0347423 ,
-3.82502331, -3.13125028, -3.0876424 , 3.13929221, 3.570775 ,
-3.37420738, 3.17527797, 3.13396148, -3.70088631, -3.04054948,
3.05399103, 3.24908851, 3.19666266, -3.64071456, -3.85271081,
3.06864652, 3.53367592, 3.54650649, 3.6355438 , 3.657715 ,
4.03831601, 3.61651925])
第七部分:数据转换
7.1 轴和元素转换
import numpy as np
import pandas as pd
df = pd.DataFrame(data = np.random.randint(0,10,size = (10,3)),
columns=['Python','Tensorflow','Keras'],
index = list('ABCDEFHIJK'))
df
| Python | Tensorflow | Keras |
---|
A | 2 | 5 | 3 |
B | 5 | 0 | 0 |
C | 7 | 0 | 4 |
D | 0 | 4 | 7 |
E | 8 | 6 | 9 |
F | 8 | 2 | 6 |
H | 6 | 7 | 8 |
I | 7 | 6 | 9 |
J | 4 | 7 | 9 |
K | 6 | 7 | 1 |
df.rename(index = {'A':'X','K':'Y'}, # 行索引
columns={'Python':'人工智能'}, # 列索引修改
inplace=True) # 替换原数据
df.replace(5,50,inplace=True)
df
| 人工智能 | Tensorflow | Keras |
---|
X | 2 | 50 | 3 |
B | 50 | 0 | 0 |
C | 7 | 0 | 4 |
D | 0 | 4 | 7 |
E | 8 | 6 | 9 |
F | 8 | 2 | 6 |
H | 6 | 7 | 8 |
I | 7 | 6 | 9 |
J | 4 | 7 | 9 |
Y | 6 | 7 | 1 |
df.replace([2,7],1024,inplace=True)
df
| 人工智能 | Tensorflow | Keras |
---|
X | 1024 | 50 | 3 |
B | 50 | 0 | 0 |
C | 1024 | 0 | 4 |
D | 0 | 4 | 1024 |
E | 8 | 6 | 9 |
F | 8 | 1024 | 6 |
H | 6 | 1024 | 8 |
I | 1024 | 6 | 9 |
J | 4 | 1024 | 9 |
Y | 6 | 1024 | 1 |
df.iloc[4,2] = np.NaN # 空数据
df.replace({0:2048,np.nan:-100},inplace=True)
df
| 人工智能 | Tensorflow | Keras |
---|
X | 1024 | 50 | 3.0 |
B | 50 | 2048 | 2048.0 |
C | 1024 | 2048 | 4.0 |
D | 2048 | 4 | 1024.0 |
E | 8 | 6 | -100.0 |
F | 8 | 1024 | 6.0 |
H | 6 | 1024 | 8.0 |
I | 1024 | 6 | 9.0 |
J | 4 | 1024 | 9.0 |
Y | 6 | 1024 | 1.0 |
df.replace({'Tensorflow':1024},-1024) # 指定某一列,进行数据替换
| 人工智能 | Tensorflow | Keras |
---|
X | 1024 | 50 | 3.0 |
B | 50 | 2048 | 2048.0 |
C | 1024 | 2048 | 4.0 |
D | 2048 | 4 | 1024.0 |
E | 8 | 6 | -100.0 |
F | 8 | -1024 | 6.0 |
H | 6 | -1024 | 8.0 |
I | 1024 | 6 | 9.0 |
J | 4 | -1024 | 9.0 |
Y | 6 | -1024 | 1.0 |
7.2 map映射元素转变
# map 只能针对一列,就是Series
# 有一些没有对应,那么返回就是空数据
df['人工智能'].map({1024:3.14,2048:2.718,6:1108}) # 跟据字典对数据进行改变
X 3.140
B NaN
C 3.140
D 2.718
E NaN
F NaN
H 1108.000
I 3.140
J NaN
Y 1108.000
Name: 人工智能, dtype: float64
df['Keras'].map(lambda x :True if x > 0 else False) # 如果大于 0 返回True,不然返回False
X True
B True
C True
D True
E False
F True
H True
I True
J True
Y True
Name: Keras, dtype: bool
def convert(x):
if x >= 1024:
return True
else:
return False
df['level'] = df['Tensorflow'].map(convert) # map映射,映射是Tensorflow中这一列中每一个数据,传递到方法中
df
| 人工智能 | Tensorflow | Keras | level |
---|
X | 1024 | 50 | 3.0 | False |
B | 50 | 2048 | 2048.0 | True |
C | 1024 | 2048 | 4.0 | True |
D | 2048 | 4 | 1024.0 | False |
E | 8 | 6 | -100.0 | False |
F | 8 | 1024 | 6.0 | True |
H | 6 | 1024 | 8.0 | True |
I | 1024 | 6 | 9.0 | False |
J | 4 | 1024 | 9.0 | True |
Y | 6 | 1024 | 1.0 | True |
7.3 apply映射元素转变
# 既可以操作Series又可以操作DataFrame
df['人工智能'].apply(lambda x : x + 100)
X 1124
B 150
C 1124
D 2148
E 108
F 108
H 106
I 1124
J 104
Y 106
Name: 人工智能, dtype: int64
df['level'].apply(lambda x:1 if x else 0)
X 0
B 1
C 1
D 0
E 0
F 1
H 1
I 0
J 1
Y 1
Name: level, dtype: int64
df.apply(lambda x : x + 1000) # apply对 所有的数据进行映射
| 人工智能 | Tensorflow | Keras | level |
---|
X | 2024 | 1050 | 1003.0 | 1000 |
B | 1050 | 3048 | 3048.0 | 1001 |
C | 2024 | 3048 | 1004.0 | 1001 |
D | 3048 | 1004 | 2024.0 | 1000 |
E | 1008 | 1006 | 900.0 | 1000 |
F | 1008 | 2024 | 1006.0 | 1001 |
H | 1006 | 2024 | 1008.0 | 1001 |
I | 2024 | 1006 | 1009.0 | 1000 |
J | 1004 | 2024 | 1009.0 | 1001 |
Y | 1006 | 2024 | 1001.0 | 1001 |
def convert(x):
return (x.median(),x.count(),x.min(),x.max(),x.std()) # 返回中位数,返回的是计数
df.apply(convert).round(1) # 默认操作列数据
| 人工智能 | Tensorflow | Keras | level |
---|
0 | 29.0 | 1024.0 | 7.0 | 1 |
1 | 10.0 | 10.0 | 10.0 | 10 |
2 | 4.0 | 4.0 | -100.0 | False |
3 | 2048.0 | 2048.0 | 2048.0 | True |
4 | 717.8 | 800.4 | 694.9 | 0.516398 |
df
| 人工智能 | Tensorflow | Keras | level |
---|
X | 1024 | 50 | 3.0 | False |
B | 50 | 2048 | 2048.0 | True |
C | 1024 | 2048 | 4.0 | True |
D | 2048 | 4 | 1024.0 | False |
E | 8 | 6 | -100.0 | False |
F | 8 | 1024 | 6.0 | True |
H | 6 | 1024 | 8.0 | True |
I | 1024 | 6 | 9.0 | False |
J | 4 | 1024 | 9.0 | True |
Y | 6 | 1024 | 1.0 | True |
df.apply(convert,axis = 1) # axis = 1,操作数据就是行数据
X (26.5, 4, False, 1024, 503.68732033541073)
B (1049.0, 4, True, 2048, 1167.8622564326668)
C (514.0, 4, True, 2048, 979.1007353689405)
D (514.0, 4, False, 2048, 979.3623776042588)
E (3.0, 4, -100.0, 8, 52.443620520834884)
F (7.0, 4, True, 1024, 509.5085049993441)
H (7.0, 4, True, 1024, 509.5085049993441)
I (7.5, 4, False, 1024, 509.51373877453)
J (6.5, 4, True, 1024, 509.6773489179208)
Y (3.5, 4, 1.0, 1024, 510.6721061503164)
dtype: object
7.4 transform元素转变
df = pd.DataFrame(np.random.randint(0,10,size = (10,3)),
columns=['Python','Tensorflow','Keras'],
index = list('ABCDEFHIJK'))
display(df)
# 可以针对一列数据,Series进行运算
df['Python'].transform(lambda x : 1024 if x > 5 else -1024) # 这个功能和map,apply类似的
| Python | Tensorflow | Keras |
---|
A | 1 | 1 | 9 |
B | 6 | 9 | 1 |
C | 1 | 4 | 6 |
D | 9 | 5 | 1 |
E | 4 | 1 | 8 |
F | 2 | 5 | 7 |
H | 7 | 2 | 3 |
I | 7 | 8 | 9 |
J | 5 | 4 | 2 |
K | 7 | 0 | 7 |
A -1024
B 1024
C -1024
D 1024
E -1024
F -1024
H 1024
I 1024
J -1024
K 1024
Name: Python, dtype: int64
df['Tensorflow'].apply([np.sqrt,np.square,np.cumsum]) # 针对一列,进行不同的操作
| sqrt | square | cumsum |
---|
A | 1.000000 | 1 | 1 |
B | 3.000000 | 81 | 10 |
C | 2.000000 | 16 | 14 |
D | 2.236068 | 25 | 19 |
E | 1.000000 | 1 | 20 |
F | 2.236068 | 25 | 25 |
H | 1.414214 | 4 | 27 |
I | 2.828427 | 64 | 35 |
J | 2.000000 | 16 | 39 |
K | 0.000000 | 0 | 39 |
df['Tensorflow'].transform([np.sqrt,np.square,np.cumsum]) # 针对一列,进行不同的操作
| sqrt | square | cumsum |
---|
A | 1.000000 | 1 | 1 |
B | 3.000000 | 81 | 10 |
C | 2.000000 | 16 | 14 |
D | 2.236068 | 25 | 19 |
E | 1.000000 | 1 | 20 |
F | 2.236068 | 25 | 25 |
H | 1.414214 | 4 | 27 |
I | 2.828427 | 64 | 35 |
J | 2.000000 | 16 | 39 |
K | 0.000000 | 0 | 39 |
def convert(x):
if x > 5:
return True
else:
return False
# 可以针对DataFrame进行运算
df.transform({'Python':np.cumsum,'Tensorflow':np.square,'Keras':convert}) # 对不同的列,执行不同的操作
| Python | Tensorflow | Keras |
---|
A | 1 | 1 | True |
B | 7 | 81 | False |
C | 8 | 16 | True |
D | 17 | 25 | False |
E | 21 | 1 | True |
F | 23 | 25 | True |
H | 30 | 4 | False |
I | 37 | 64 | True |
J | 42 | 16 | False |
K | 49 | 0 | True |
df.apply({'Python':np.cumsum,'Tensorflow':np.square,'Keras':convert}) # 对不同的列,执行不同的操作
| Python | Tensorflow | Keras |
---|
A | 1 | 1 | True |
B | 7 | 81 | False |
C | 8 | 16 | True |
D | 17 | 25 | False |
E | 21 | 1 | True |
F | 23 | 25 | True |
H | 30 | 4 | False |
I | 37 | 64 | True |
J | 42 | 16 | False |
K | 49 | 0 | True |
7.5 重排随机抽样哑变量
df
| Python | Tensorflow | Keras |
---|
A | 1 | 1 | 9 |
B | 6 | 9 | 1 |
C | 1 | 4 | 6 |
D | 9 | 5 | 1 |
E | 4 | 1 | 8 |
F | 2 | 5 | 7 |
H | 7 | 2 | 3 |
I | 7 | 8 | 9 |
J | 5 | 4 | 2 |
K | 7 | 0 | 7 |
index = np.random.permutation(10) # 返回打乱顺讯的索引
index
array([3, 4, 1, 2, 7, 9, 0, 8, 5, 6])
# 重排,索引打乱
df.take(index)
| Python | Tensorflow | Keras |
---|
D | 9 | 5 | 1 |
E | 4 | 1 | 8 |
B | 6 | 9 | 1 |
C | 1 | 4 | 6 |
I | 7 | 8 | 9 |
K | 7 | 0 | 7 |
A | 1 | 1 | 9 |
J | 5 | 4 | 2 |
F | 2 | 5 | 7 |
H | 7 | 2 | 3 |
# 从大量数据中随机抽取数据
df.take(np.random.randint(0,10,size = 20)) # 随机抽样20个数据
| Python | Tensorflow | Keras |
---|
J | 5 | 4 | 2 |
J | 5 | 4 | 2 |
D | 9 | 5 | 1 |
K | 7 | 0 | 7 |
H | 7 | 2 | 3 |
I | 7 | 8 | 9 |
J | 5 | 4 | 2 |
A | 1 | 1 | 9 |
C | 1 | 4 | 6 |
J | 5 | 4 | 2 |
I | 7 | 8 | 9 |
D | 9 | 5 | 1 |
I | 7 | 8 | 9 |
K | 7 | 0 | 7 |
A | 1 | 1 | 9 |
B | 6 | 9 | 1 |
H | 7 | 2 | 3 |
D | 9 | 5 | 1 |
B | 6 | 9 | 1 |
H | 7 | 2 | 3 |
df2 = pd.DataFrame(data = {'key':['a','b','a','b','c','b','c']})
df2
# one-hot,哑变量
# str类型数据,经过哑变量变换可以使用数字表示
pd.get_dummies(df2,prefix='',prefix_sep='') # 1表示,有;0表示,没有
| a | b | c |
---|
0 | 1 | 0 | 0 |
1 | 0 | 1 | 0 |
2 | 1 | 0 | 0 |
3 | 0 | 1 | 0 |
4 | 0 | 0 | 1 |
5 | 0 | 1 | 0 |
6 | 0 | 0 | 1 |
第八部分:数据重塑
df
| Python | Tensorflow | Keras |
---|
A | 1 | 1 | 9 |
B | 6 | 9 | 1 |
C | 1 | 4 | 6 |
D | 9 | 5 | 1 |
E | 4 | 1 | 8 |
F | 2 | 5 | 7 |
H | 7 | 2 | 3 |
I | 7 | 8 | 9 |
J | 5 | 4 | 2 |
K | 7 | 0 | 7 |
df.T # 转置,行变列,列变行
做了那么多年开发,自学了很多门编程语言,我很明白学习资源对于学一门新语言的重要性,这些年也收藏了不少的Python干货,对我来说这些东西确实已经用不到了,但对于准备自学Python的人来说,或许它就是一个宝藏,可以给你省去很多的时间和精力。
别在网上瞎学了,我最近也做了一些资源的更新,只要你是我的粉丝,这期福利你都可拿走。
我先来介绍一下这些东西怎么用,文末抱走。
(1)Python所有方向的学习路线(新版)
这是我花了几天的时间去把Python所有方向的技术点做的整理,形成各个领域的知识点汇总,它的用处就在于,你可以按照上面的知识点去找对应的学习资源,保证自己学得较为全面。
最近我才对这些路线做了一下新的更新,知识体系更全面了。
(2)Python学习视频
包含了Python入门、爬虫、数据分析和web开发的学习视频,总共100多个,虽然没有那么全面,但是对于入门来说是没问题的,学完这些之后,你可以按照我上面的学习路线去网上找其他的知识资源进行进阶。
(3)100多个练手项目
我们在看视频学习的时候,不能光动眼动脑不动手,比较科学的学习方法是在理解之后运用它们,这时候练手项目就很适合了,只是里面的项目比较多,水平也是参差不齐,大家可以挑自己能做的项目去练练。
(4)200多本电子书
这些年我也收藏了很多电子书,大概200多本,有时候带实体书不方便的话,我就会去打开电子书看看,书籍可不一定比视频教程差,尤其是权威的技术书籍。
基本上主流的和经典的都有,这里我就不放图了,版权问题,个人看看是没有问题的。
(5)Python知识点汇总
知识点汇总有点像学习路线,但与学习路线不同的点就在于,知识点汇总更为细致,里面包含了对具体知识点的简单说明,而我们的学习路线则更为抽象和简单,只是为了方便大家只是某个领域你应该学习哪些技术栈。
(6)其他资料
还有其他的一些东西,比如说我自己出的Python入门图文类教程,没有电脑的时候用手机也可以学习知识,学会了理论之后再去敲代码实践验证,还有Python中文版的库资料、MySQL和HTML标签大全等等,这些都是可以送给粉丝们的东西。
这些都不是什么非常值钱的东西,但对于没有资源或者资源不是很好的学习者来说确实很不错,你要是用得到的话都可以直接抱走,关注过我的人都知道,这些都是可以拿到的。
网上学习资料一大堆,但如果学到的知识不成体系,遇到问题时只是浅尝辄止,不再深入研究,那么很难做到真正的技术提升。
需要这份系统化学习资料的朋友,可以戳这里无偿获取
一个人可以走的很快,但一群人才能走的更远!不论你是正从事IT行业的老鸟或是对IT行业感兴趣的新人,都欢迎加入我们的的圈子(技术交流、学习资源、职场吐槽、大厂内推、面试辅导),让我们一起学习成长!