PaddlePaddle是百度开源的深度学习框架,采用和cafee类似的layer搭建的方式构建深度神经网络,当前也在试图发布fluid新版本提供算子级别的网络构建技能,最近有一个文本分类的需求,试着使用paddle进行了实验,对paddle的使用体验为:
- 文档不全,特别简陋
- 模型库比较好,即使不懂的用法,可以搜索代码查找用法;
- github的问题回复比较及时
感觉Paddle是在大力推广和发展的,如果有Paddle同学看到的话,建议把文档补全点;
我的输入数据包含两部分:
- 用户画像,包括用户的性别、年龄、职业等信息;
- 用户搜索词列表,保持时间序列,进行分词、停用词过滤等处理;
将类别信息、词语信息,进行StringIndexer之后(使用的Spark进行),输入数据如下所示:
1
2
|
1
0
6
7
3
1
11069
|
36027
|
15862
|
11069
|
48152
|
36027
|
11069
|
33830
|
48152
|
36027
|
11069
|
50730
|
11069
|
50730
|
11069
|
47002
1
1
0
6
7
3
1
62151
|
21292
|
21666
|
53679
|
21292
|
21666
|
34384
|
26807
|
53680
|
381
|
2992
|
64045
|
2992
|
69922
|
62902
|
3346
0
|
其中单个数字列都是画像的属性分类,以|分割的数字列表代表词语列表,最后一列代表分类目标
整个代码的实现包含三部分,分别是数据读入reader的实现、训练算法的实现、使用算法的实现;
数据读入reader的实现
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
|
# coding:utf8
"""
实现paddle需要的读取数据reader的实现
其中包括如下部分:
1、读取整个数据;
2、分割成训练集和测试集;
3、提供paddle可以直接使用的reader()函数,通过yield的方式抛出数据
"""
import
random
def
read_datas
(
)
:
"""
读取所有的数据
:return:
"""
fpath
=
"./inputdatas_numbers.txt"
results_datas
=
[
]
for
line
in
open
(
fpath
)
:
line
=
line
[
:
-
1
]
if
not
line
or
len
(
line
)
==
0
:
continue
fields
=
line
.
split
(
"\t"
)
if
len
(
fields
)
!=
8
:
continue
# 按属性提取
gender
,
age
,
lifeStage
,
trade
,
educationalLevel
,
job
,
words
,
label
=
fields
results_datas
.
append
(
[
int
(
float
(
gender
)
)
,
int
(
float
(
age
)
)
,
int
(
float
(
lifeStage
)
)
,
int
(
float
(
trade
)
)
,
int
(
float
(
educationalLevel
)
)
,
int
(
float
(
job
)
)
,
# 注意这里,paddle的序列数据,sequence_data,其实是元素为list的元素
[
int
(
x
)
for
x
in
words
.
split
(
"|"
)
]
,
int
(
float
(
label
)
)
]
)
return
results_datas
def
split_data_train_test
(
results_datas
,
rand_seed
=
0
,
test_ratio
=
0.1
)
:
"""
进行训练数据和测试数据的拆分,这里使用随机的方法进行
:param results_datas:
:param rand_seed:
:param test_ratio:
:return:
"""
rand_seed
=
37
rand
=
random
.
Random
(
x
=
rand_seed
)
train_data
,
test_data
=
[
]
,
[
]
for
line
in
results_datas
:
if
rand
.
random
(
)
>
test_ratio
:
train_data
.
append
(
line
)
else
:
test_data
.
append
(
line
)
return
train_data
,
test_data
def
split_data_train_test_avg
(
results_datas
,
test_ratio
=
0.03
)
:
"""
按词典顺序倒叙排列,然后均匀采样
:param results_datas:
:param rand_seed:
:param test_ratio:
:return:
"""
total_len
=
len
(
results_datas
)
test_datas_cnt
=
total_len
*
test_ratio
sample_gap
=
int
(
total_len
*
1.0
/
test_datas_cnt
)
sort_datas
=
sorted
(
results_datas
,
cmp
=
lambda
x
,
y
:
int
(
x
[
7
]
)
<
int
(
y
[
7
]
)
)
train_data
,
test_data
=
[
]
,
[
]
for
i
in
range
(
len
(
sort_datas
)
)
:
if
i
%
sample_gap
==
0
:
test_data
.
append
(
sort_datas
[
i
]
)
else
:
train_data
.
append
(
sort_datas
[
i
]
)
return
train_data
,
test_data
results_datas
=
read_datas
(
)
print
"数据读取完毕:"
,
len
(
results_datas
)
train_data
,
test_data
=
split_data_train_test_avg
(
results_datas
,
0.03
)
print
"data read over."
print
"训练集合数据大小:"
,
len
(
train_data
)
print
"测试集合数据大小:"
,
len
(
test_data
)
def
train_reader
(
)
:
"""
paddle需要,用于训练数据的提取
:return:
"""
for
line
in
train_data
:
yield
line
def
test_reader
(
)
:
"""
paddle需要,用于训练数据的提取
:return:
"""
for
line
in
test_data
:
yield
line
if
__name__
==
"__main__"
:
print
len
(
train_data
)
print
len
(
test_data
)
|
其中需要注意的地方,就是paddle的sequence_data的含义,就是一个list,所以对于query的数据列表,需要处理成[x,y,list(),z]中的子元素list的形式。
查看一下本脚本的运行结果:
1
2
3
4
5
6
7
8
|
数据读取完毕
:
20088
data
read
over
.
训练集合数据大小:
19479
测试集合数据大小:
609
19479
609
[
1
,
0
,
6
,
7
,
3
,
1
,
[
62151
,
21292
,
21666
,
53679
,
21292
,
21666
,
34384
,
26807
,
53680
,
381
,
2992
,
64045
,
2992
,
69922
,
62902
,
3346
]
,
0
]
|
训练算法的实现
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
|
# coding:utf8
""
"
使用paddle实现深层网络
"
""
import
os
import
paddle
.
v2
as
paddle
import
reader_paddle_sequence
import
sys
# 该字典是输入数据的模式,比如通过fileds[feeding["gender"]]就可以得到gender的数值
feeding
=
{
'gender'
:
0
,
'age'
:
1
,
'lifeStage'
:
2
,
'trade'
:
3
,
'educationalLevel'
:
4
,
'job'
:
5
,
'words'
:
6
,
'label'
:
7
}
def
convr_perceptron
(
)
:
""
"
搭配而成的卷积网络
:return:
"
""
# 获取卷积层
conv1
,
conv2
=
get_words_conv
(
)
# 获取画像特征层
features_fc
=
get_usr_combined_features
(
)
# concat卷积层和画像层
concat_layer
=
paddle
.
layer
.
concat
(
input
=
[
features_fc
,
conv1
,
conv2
]
)
# 加入dropout layer,防止过拟合
dropout_layer
=
paddle
.
layer
.
dropout
(
input
=
concat_layer
,
dropout_rate
=
0.6
)
# 加入最终的分类层,使用softmax,分类成6个类别
predict
=
paddle
.
layer
.
fc
(
input
=
dropout_layer
,
size
=
6
,
act
=
paddle
.
activation
.
Softmax
(
)
)
return
predict
def
get_usr_combined_features
(
)
:
""
"
用户画像的特征,都进入FC
:return:
"
""
gender
=
paddle
.
layer
.
data
(
name
=
'gender'
,
type
=
paddle
.
data_type
.
integer_value
(
2
)
)
gender_emb
=
paddle
.
layer
.
embedding
(
input
=
gender
,
size
=
16
)
gender_fc
=
paddle
.
layer
.
fc
(
input
=
gender_emb
,
size
=
16
)
age
=
paddle
.
layer
.
data
(
name
=
'age'
,
type
=
paddle
.
data_type
.
integer_value
(
6
)
)
age_emb
=
paddle
.
layer
.
embedding
(
input
=
age
,
size
=
16
)
age_fc
=
paddle
.
layer
.
fc
(
input
=
age_emb
,
size
=
16
)
lifeStage
=
paddle
.
layer
.
data
(
name
=
'lifeStage'
,
type
=
paddle
.
data_type
.
integer_value
(
10
)
)
lifeStage_emb
=
paddle
.
layer
.
embedding
(
input
=
lifeStage
,
size
=
16
)
lifeStage_fc
=
paddle
.
layer
.
fc
(
input
=
lifeStage_emb
,
size
=
16
)
trade
=
paddle
.
layer
.
data
(
name
=
'trade'
,
type
=
paddle
.
data_type
.
integer_value
(
23
)
)
trade_emb
=
paddle
.
layer
.
embedding
(
input
=
trade
,
size
=
16
)
trade_fc
=
paddle
.
layer
.
fc
(
input
=
trade_emb
,
size
=
16
)
educationalLevel
=
paddle
.
layer
.
data
(
name
=
'educationalLevel'
,
type
=
paddle
.
data_type
.
integer_value
(
4
)
)
educationalLevel_emb
=
paddle
.
layer
.
embedding
(
input
=
educationalLevel
,
size
=
16
)
educationalLevel_fc
=
paddle
.
layer
.
fc
(
input
=
educationalLevel_emb
,
size
=
16
)
job
=
paddle
.
layer
.
data
(
name
=
'job'
,
type
=
paddle
.
data_type
.
integer_value
(
6
)
)
job_emb
=
paddle
.
layer
.
embedding
(
input
=
job
,
size
=
16
)
job_fc
=
paddle
.
layer
.
fc
(
input
=
job_emb
,
size
=
16
)
usr_combined_features
=
paddle
.
layer
.
fc
(
input
=
[
gender_fc
,
age_fc
,
lifeStage_fc
,
trade_fc
,
educationalLevel_fc
,
job_fc
]
,
size
=
200
,
act
=
paddle
.
activation
.
Tanh
(
)
)
return
usr_combined_features
def
get_words_conv
(
)
:
""
"
words的输入,进入conv
:return:
"
""
# 词表大小,这个数字来自于分词后进入list的size
word_dict_len
=
73614
emb_dim
=
8
hid_dim
=
256
# 注意这里的integer_value_sequence,意思是[1,2,3,4]这种形式
words
=
paddle
.
layer
.
data
(
name
=
'words'
,
type
=
paddle
.
data_type
.
integer_value_sequence
(
word_dict_len
)
)
words_emb
=
paddle
.
layer
.
embedding
(
input
=
words
,
size
=
emb_dim
)
# 搭建卷积网络,这类可以是多个卷积层
conv1
=
paddle
.
networks
.
sequence_conv_pool
(
input
=
words_emb
,
context_len
=
3
,
hidden_size
=
hid_dim
)
conv2
=
paddle
.
networks
.
sequence_conv_pool
(
input
=
words_emb
,
context_len
=
4
,
hidden_size
=
hid_dim
)
return
conv1
,
conv2
def
train
(
)
:
""
"
执行训练
"
""
# 初始化paddle
paddle
.
init
(
use_gpu
=
False
,
trainer_count
=
1
)
# network config
y
=
paddle
.
layer
.
data
(
name
=
'label'
,
type
=
paddle
.
data_type
.
integer_value
(
6
)
)
# 获取网络预测结果
y_predict
=
convr_perceptron
(
)
# 设定cost为分类误差
cost
=
paddle
.
layer
.
classification_cost
(
input
=
y_predict
,
label
=
y
)
# 随机初始化参数
parameters
=
paddle
.
parameters
.
create
(
cost
)
# 创建优化器,主要是设定L2的正则化和学习率
adam_optimizer
=
paddle
.
optimizer
.
Adam
(
learning_rate
=
2e
-
4
,
regularization
=
paddle
.
optimizer
.
L2Regularization
(
rate
=
0.9
)
,
model_average
=
paddle
.
optimizer
.
ModelAverage
(
average_window
=
0.5
,
max_average_window
=
10000
)
)
# 使用SGD做训练器
trainer
=
paddle
.
trainer
.
SGD
(
cost
=
cost
,
parameters
=
parameters
,
update_equation
=
adam_optimizer
)
# 报错拓扑结构,这个拓扑结构将来回用于infer
inference_topology
=
paddle
.
topology
.
Topology
(
layers
=
y_predict
)
with
open
(
"inference_topology_conv.pkl"
,
'wb'
)
as
f
:
inference_topology
.
serialize_for_inference
(
f
)
# 保存训练误差和测试误差,用于将来的曲线绘制
fout_pass_err
=
open
(
"train_pass_error_conv.txt"
,
"w"
)
fout_pass_err
.
write
(
"passid\ttest_data_accurcy\ttrain_data_accurcy\n"
)
# 保存中间信息
def
event_handler
(
event
)
:
if
isinstance
(
event
,
paddle
.
event
.
EndIteration
)
:
if
event
.
batch_id
%
100
==
0
:
print
"\nPass %d, Batch %d, Cost %f, %s"
%
(
event
.
pass_id
,
event
.
batch_id
,
event
.
cost
,
event
.
metrics
)
else
:
sys
.
stdout
.
write
(
'.'
)
sys
.
stdout
.
flush
(
)
if
isinstance
(
event
,
paddle
.
event
.
EndPass
)
:
with
open
(
'./params_pass_conv_%d.tar'
%
event
.
pass_id
,
'w'
)
as
f
:
trainer
.
save_parameter_to_tar
(
f
)
result_test
=
trainer
.
test
(
reader
=
paddle
.
batch
(
paddle
.
reader
.
shuffle
(
reader_paddle_sequence
.
test_reader
,
buf_size
=
50000
)
,
batch_size
=
100
)
,
feeding
=
feeding
)
print
"\nTest with Pass %d, %s"
%
(
event
.
pass_id
,
result_test
.
metrics
[
"classification_error_evaluator"
]
)
result_train
=
trainer
.
test
(
reader
=
paddle
.
batch
(
paddle
.
reader
.
shuffle
(
reader_paddle_sequence
.
train_reader
,
buf_size
=
50000
)
,
batch_size
=
100
)
,
feeding
=
feeding
)
print
"\nTrain with Pass %d, %s"
%
(
event
.
pass_id
,
result_train
.
metrics
[
"classification_error_evaluator"
]
)
fout_pass_err
.
write
(
"%s\t%s\t%s\n"
%
(
str
(
event
.
pass_id
)
,
str
(
float
(
result_test
.
metrics
[
"classification_error_evaluator"
]
)
)
,
str
(
float
(
result_train
.
metrics
[
"classification_error_evaluator"
]
)
)
)
)
fout_pass_err
.
flush
(
)
# 执行训练
trainer
.
train
(
reader
=
paddle
.
batch
(
paddle
.
reader
.
shuffle
(
reader_paddle_sequence
.
train_reader
,
buf_size
=
50000
)
,
batch_size
=
100
)
,
feeding
=
feeding
,
event_handler
=
event_handler
,
num_passes
=
300
)
fout_pass_err
.
flush
(
)
fout_pass_err
.
close
(
)
if
__name__
==
'__main__'
:
train
(
)
|
训练后,会在当前目录下生成如下文件:
1
2
3
4
5
|
-
rw
-
r
--
r
--
1
baidu
staff
2.4M
May
22
16
:
06
params_pass_conv_0
.
tar
-
rw
-
r
--
r
--
1
baidu
staff
2.4M
May
22
16
:
06
params_pass_conv_1
.
tar
-
rw
-
r
--
r
--
1
baidu
staff
2.4M
May
22
16
:
07
params_pass_conv_2
.
tar
-
rw
-
r
--
r
--
1
baidu
staff
6.0K
May
17
16
:
14
inference_topology
.
pkl
|
同时训练过程会打印过程信息:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
I0523
15
:
53
:
02.666565
2921214848
Util
.
cpp
:
166
]
commandline
:
--
use_gpu
=
False
--
trainer_count
=
1
I0523
15
:
53
:
02.690153
2921214848
GradientMachine
.
cpp
:
94
]
Initing
parameters
.
.
I0523
15
:
53
:
02.708933
2921214848
GradientMachine
.
cpp
:
101
]
Init
parameters
done
.
Pass
0
,
Batch
0
,
Cost
1.769033
,
{
'classification_error_evaluator'
:
0.7699999809265137
}
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Pass
0
,
Batch
100
,
Cost
1.773714
,
{
'classification_error_evaluator'
:
0.8100000023841858
}
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Test
with
Pass
0
,
0.750410497189
Train
with
Pass
0
,
0.740233063698
Pass
1
,
Batch
0
,
Cost
1.779777
,
{
'classification_error_evaluator'
:
0.75
}
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Pass
1
,
Batch
100
,
Cost
1.677371
,
{
'classification_error_evaluator'
:
0.7200000286102295
}
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Test
with
Pass
1
,
0.610837459564
Train
with
Pass
1
,
0.5671235919
|
试了试GPU和CPU的区别,真的发现GPU那是好多倍的速度呀,深度学习是建立在GPU上的技术果然不差;
同时可以试着打印训练集和错误集的准确率曲线:
可以看到,在12轮的时候打到了局部最优,之后出现过拟合;整体效果最好的是88%准确率;
在尝试多次调整dropout和l2正则化参数后依然是这个准确率,因此停止了调整;有待收集更多的数据才可以提升准确率;
利用模型做预测
既然已经训练完毕,那么怎么使用呢,看代码
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
|
# coding:utf8
""
"
使用模型做预测
"
""
import
os
import
paddle
.
v2
as
paddle
import
reader_paddle_sequence
import
sys
# 需要和训练的时候的feeding一样
feeding
=
{
'gender'
:
0
,
'age'
:
1
,
'lifeStage'
:
2
,
'trade'
:
3
,
'educationalLevel'
:
4
,
'job'
:
5
,
'words'
:
6
,
'label'
:
7
}
def
test
(
)
:
paddle
.
init
(
use_gpu
=
False
,
trainer_count
=
1
)
# 读取最优的那个参数集文件
tarfn
=
"params_pass_conv_1.tar"
# 读取模型拓扑文件
topology_filepath
=
"inference_topology_conv.pkl"
# 加载参数和拓扑到一个infer对象
with
open
(
tarfn
)
as
param_f
,
open
(
topology_filepath
)
as
topo_f
:
params
=
paddle
.
parameters
.
Parameters
.
from_tar
(
param_f
)
inferer
=
paddle
.
inference
.
Inference
(
parameters
=
params
,
fileobj
=
topo_f
)
# 使用测试集合的一条数据,进行Infer
# 这也说明了,对于要预测的输入,需要处理成和训练集、测试集一样的格式才可以
reader
=
reader_paddle_sequence
.
test_reader
for
k
in
reader
(
)
:
print
k
[
:
-
1
]
res
=
inferer
.
infer
(
input
=
[
k
,
]
,
feeding
=
feeding
)
print
res
break
if
__name__
==
'__main__'
:
# 两个选项
test
(
)
|
打印一下运行结果:
1
2
3
4
|
I0523
16
:
00
:
56.536753
2921214848
Util
.
cpp
:
166
]
commandline
:
--
use_gpu
=
False
--
trainer_count
=
1
[
1
,
0
,
6
,
7
,
3
,
1
,
[
11069
,
36027
,
15862
,
11069
,
48152
,
36027
,
11069
,
33830
,
48152
,
36027
,
11069
,
50730
,
11069
,
50730
,
11069
,
47002
]
]
[
[
0.25260347
0.15734845
0.1648475
0.17209302
0.14046918
0.11263847
]
]
|
可以看到最后一行打印了预测的6个分类的概率;
总结
以上讲述了使用paddle搭建神经网络的整个流程,包括数据读取、网络搭建、训练、模型应用等方面;
对于已经训练好的模型,可以用python flask或者django进行加载和对外提供远程调用;
其实一通百通,当一个网络搭建和实现之后,对paddle都有了很好的理解,自己同时尝试了全连接网络、LSTM网络,都和卷积网络非常类似,只要替换卷积层部分即可;
对于paddle,虽然当前还不完善,但毕竟是国内的深度学习框架,同时也能够实现业务目标,这一点还是要继续支持滴;
附带Paddle链接:
- 最新文档:http://www.paddlepaddle.org/docs/develop/documentation/zh/getstarted/index_cn.html
- 开放模型库:https://github.com/PaddlePaddle/models
- 教程:https://github.com/PaddlePaddle/book
- 中文BOOK:http://www.paddlepaddle.org/docs/develop/book/04.word2vec/index.cn.html
本文链接:http://www.crazyant.net/2177.html,转载请注明来源。