从一个简单数据分析看机器学习的概念_利用机器学习分析数据的概念-CSDN博客

本文链接：https://blog.csdn.net/blunt/article/details/73289719

本文通过一个简单的数据分析实例，展示了机器学习如何从数据中发现规律。使用Python生成了三个数相加的数据集，用TensorFlow训练单层神经网络。初始训练次数不足时得出的模型不准确，增加迭代次数或更换优化器后，模型逐渐接近真实关系。通过数据标准化，训练轮次显著减少，强调了预处理在数据分析和机器学习中的重要性。

摘要由CSDN通过智能技术生成

从一个简单数据分析看机器学习的概念

给出一堆数据，能利用机器学习发现其中的规律吗？请看以下实验。

首先，利用一个Python脚本生成数据。分别生成三个数：随机数，递增序列和指数
把这三个数相加得到的第四个数，我们看机器能不能经过800个训练数据推测出200个测试数据。
基于这里只是几个数简单相加，我们用单层网络就好了。

import numpy as np
num=1000

a=np.random.rand(num)
b=np.arange(0,0+0.1*num,0.1)
c=np.logspace(0,2,num)
d = np.add(c,np.add(a,b))

for i in range(num):
    print "%f,%f,%f,%f,%f" % (a[i],b[i],c[i],i,d[i])

生成的数据如下：

0.919645,0.000000,1.000000,1.919645
0.944756,0.100000,1.004620,2.049377
0.099436,0.200000,1.009262,1.308698
0.113406,0.300000,1.013925,1.427332
0.655098,0.400000,1.018610,2.073709
......
0.932760,99.500000,98.172981,198.605743
0.558361,99.599998,98.626587,198.784943
0.476555,99.699997,99.082283,199.258835
0.585844,99.800003,99.540085,199.925934
0.869372,99.900002,100.000000,200.769379

简单的TensorFlow代码如下：

learning_rate = 0.01
training_epochs = 1000
display_step = 50
TrainRatio=0.8

x_train,y_train,x_test,y_test = import_data.readdata(datefile,TrainRatio)

X = tf.placeholder(tf.float32, [None,3],name="X")
Y = tf.placeholder(tf.float32,[None,1],name="Y")
with tf.name_scope('hidden') as scope:
    W = tf.Variable( tf.random_normal( [3, 1] ),name="weights")
    b = tf.Variable( tf.random_normal([1]) ,name="bias")
    model = tf.matmul(X, W) + b 

with tf.Session() as sess:
                init = tf.global_variables_initializer()
                sess.run(init)

                lostindex = np.arange(int(training_epochs/display_step))*display_step
                lostarray = np.empty([int(training_epochs/display_step),],dtype=np.float32)
                for epoch in range(training_epochs):
                        sess.run(train, feed_dict={X: x_train, Y: y_train})
                        if (epoch+1) % display_step == 0:
                                c = sess.run(loss, feed_dict={X: x_train, Y:y_train})
                                mw,mb = sess.run([W,b])
                                lostarray[int((epoch+1)/display_step)-1] = c 
                                print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(c),"W=", mw, "b=", mb) 

                trainxindex = np.arange(x_train.shape[0])
                testxindex = np.arange(x_train.shape[0],x_train.shape[0]+x_test.shape[0])
                trainypred = sess.run(model, feed_dict={X: x_train, Y:y_train})
                testypred = sess.run(model, feed_dict={X: x_test, Y:y_test})

                ax1=plt.subplot(121)    
                ax1.plot(trainxindex, y_train.reshape(y_train.shape[0]), 'r', label='TrainY')
                ax1.plot(trainxindex, trainypred.reshape(trainypred.shape[0]), 'ro', label='TrainYPred')
                ax1.plot(testxindex, y_test.reshape(y_test.shape[0]), 'b', label='TestY')
                ax1.plot(testxindex, testypred.reshape(testypred.shape[0]), 'bo', label='TestYPred')

                ax2=plt.subplot(122)
                ax2.plot(lostindex, lostarray, 'r', label='LostChange')
                plt.legend()
                plt.show()

我们看看1000轮训练的效果：
这里写图片描述
左图是匹配度：
横轴是样本序列，纵轴是数据值。其中，红色的是训练数据，蓝色的是验证数据。
线是输入数据，点是我们的预测数据：
右图是错误率：
横轴是时间，纵轴是错误率，可以看到，随着训练的进行，错误的确是递减的。

我们看看实际是不是得出了结果是三者相加的结论，以下是日志输出：

Epoch: 0900 cost= 0.038475495 W= [[ 1.52011728]  [-0.11934275]  [ 1.32213342]] b= [-0.16419499]
Epoch: 0950 cost= 0.036056194 W= [[ 1.52403808]  [-0.08816025]  [ 1.31231415]] b= [-0.15297912]
Epoch: 1000 cost= 0.033793017 W= [[ 1.52779269]  [-0.05798531] [ 1.30281508]] b= [-0.14216207]

d = 1.528*a -0.0579*b + 1.302*c - 0.142

这是什么鬼？

难道是迭代次数不够？我们试试一万次
这里写图片描述

Epoch: 9900 cost= 0.000012248 W= [[ 1.10678041]  [ 1.00669312]  [ 0.99822539]] b= [-0.06418084]
Epoch: 9950 cost= 0.000012226 W= [[ 1.10668504]  [ 1.00671101] [ 0.99822122]] b= [-0.06414843]
Epoch: 10000 cost= 0.000012205 W= [[ 1.10658967] [ 1.00672889]  [ 0.99821711]] b= [-0.06411602]

这次看起来象点样子了， d = 1.1*a+1*b+0.99*c-0.06

等等，这么个简单参数分析竟然要训练一万次？这也太费劲了！
好吧，我们换个优化器：

train = tf.train.AdamOptimizer(learning_rate).minimize(loss)

我们再来看看效果如何，这是3400次之后的结果，错误率也很低。
这里写图片描述

最终结果是： d = 1.019*a + 1.018*b + 0.9962889*c + -0.0483

等等，还有什么方法可以提高学习效率呢？我们先看看下面的结果：
这里写图片描述

看着效果不错吧，其实我们仔细分析输入数据，发现第一列数据特别小，而第二第三列数据数值特别大，导致训练极度不平衡，也就是说w2，w3要有一个微小的变化的话，w1就要有个很大的变化。
因此，如果我们首先把他们标准化的话，实际训练就会容易得多。
所谓标准化就是使他们每列的平均值为0，方差为1。

def StdMatrix(dataMat):
        means = np.mean(dataMat, axis=0)       
        meansub = dataMat - means              

        std = meansub.std(0)
        return np.concatenate((means.reshape([1,len(means)]), std.reshape([1,len(std)])), axis=0)

def GetStdData(trainxmeta,x_train,x_test):
        x_train_std = (x_train - trainxmeta[0,:])/trainxmeta[1,:]
        x_test_std = (x_test - trainxmeta[0,:])/trainxmeta[1,:]
        return x_train_std,x_test_std

def RestorePara(metadata,w,b):
        w = w.reshape([len(w)])
        neww = w/metadata[1,:]
        newb = b - np.sum(w*metadata[0,:]/metadata[1,:])
        return neww,newb

在下面的函数中，首先用StdMatrix计算出每列的平均值和方差。然后用GetStdData改变数据，最后用RestorePara还原出原始的w和b.
实际上，在这次计算中，原始的w和b分别是：

[ 0.29941723， 2.33080626，10.3090601 ] [ 15.039711]

而转换后的w和b分别是：

[ 1.00500536  1.00934756  0.9978804 ]  [-0.01703835]

其中关键的StdMatrix为：

[[  0.50496942   3.95000005  10.58476448]
 [  0.29792601   2.30922055  10.33095741]]

看来要进行数据分析，对原始数据进行预处理还是很重要的，训练轮次直接从10000次减到了3000次。