(批量) 线性回归

基于python3使用梯度下降的方法

In this post, we are going to have a look at a program written in Python3 using NumPy as our data processing library to see how a (batch) linear regression using the gradient descent method is implemented
在这篇文章中,我们将看到一个用Python3编写的程序,它使用NumPy作为我们的数据处理库,以了解如何使用梯度下降法实现(批量)线性回归

I will explain the workings of the code part by part, how every part of the code works. In the end, I will attach the link to the whole code hosted on GitHub along with the dataset used in the example.
我将逐一解释代码的工作原理,以及代码的每个部分是如何工作的。最后,我将给出托管整个代码的GitHub链接,包括示例中使用的数据集。

我们要用这个公式来计算梯度。
Here x(i) vector is one data point with N being the size of the data set. n(eta) is our learning rate. y(i) vector is the target output. f(x) vector is the linear function of the regression defined as f(x) = Sum(wx), here sum is the sigma function. Also, we are going to consider the initial bias w0 = 0 and intercept x0 = 1. All weights are initialized as 0.
这里x(i)向量是一个数据点,N是数据集的大小。N (eta)是我们的学习率。y(i)向量是目标输出。f(x)向量是定义为f(x) =Sum(w
x)的回归的线性函数,这里的和是sigma函数。同样,我们将考虑初始偏差w0 = 0和截距x0 = 1。所有的权值初始化为0。

In this implementation, we are using the Sum of Squared Errors as the error calculation function.
在这个实现中,我们使用误差平方和作为误差计算函数.

Instead of minimizing the SSE to zero, we are going to measure the change in SSE at every iteration and compare that to a threshold which is provided before the program is executed. If the change in SSE goes below the threshold the program exits.
我们将度量每次迭代时SSE的变化,并将其与程序执行前提供的阈值进行比较,而不是将SSE最小化为零。如果SSE的变化低于程序退出的阈值。?
In the program, we are providing three inputs from the command line. They are:
在程序中,我们从命令行提供三个输入。它们是:
1.threshold — The threshold, that the change in error has to fall below before the algorithm terminates.
2.data — The location of the data file.
3.learningRate — The learning rate of the gradient descent approach.
1.阈值-在算法终止之前,误差的变化必须低于该阈值。
2.数据-数据文件的位置。
3.学习率-梯度下降法的学习率。
Therefore, the program should be able to start like this:
因此,程序应该能够这样开始:
python3 linearregr.py — data random.csv — learningRate 0.0001 — threshold 0.0001
One last thing before we dive into the code, the output of our program will look like this:
在我们深入研究代码之前,最后一件事,我们的程序的输出将是这样的:

iteration_number,weight0,weight1,weight2,…,weightN,sum_of_squared_errors

The program consists of 6 parts and we are going to have a look at them one at a time.
这个程序由6个部分组成,我们将逐一介绍。
The import statements
导入语句
import argparse # to read inputs from command line
import csv # to read the input data set file
import numpy as np # to work with the data set

The code execution initializer block
代码执行初始化块

initialise argument parser and read arguments from command line with the respective flags and then call the main() functionif name == ‘main’:

parser = argparse.ArgumentParser()
parser.add_argument("-d", "--data", help="Data File")
parser.add_argument("-l", "--learningRate", help="Learning Rate")    
parser.add_argument("-t","--threshold", help="Threshold")    
main()

The main() function
main()函数
def main():
args = parser.parse_args()
file, learningRate, threshold = args.data, float(
args.learningRate), float(args.threshold) # save respective command line inputs into variables

# read csv file and the last column is the target output and is separated from the input (X) as Y
with open(file) as csvFile:
    reader = csv.reader(csvFile, delimiter=',')
    X = []
    Y = []
    for row in reader:
        X.append([1.0] + row[:-1])
        Y.append([row[-1]])

# Convert data points into float and initialise weight vector with 0s.
n = len(X)
X = np.array(X).astype(float)
Y = np.array(Y).astype(float)
W = np.zeros(X.shape[1]).astype(float)
# this matrix is transposed to match the necessary matrix dimensions for calculating dot product
W = W.reshape(X.shape[1], 1).round(4)

# Calculate the predicted output value
f_x = calculatePredicatedValue(X, W)

# Calculate the initial SSE
sse_old = calculateSSE(Y, f_x)

outputFile = 'solution_' + \
             'learningRate_' + str(learningRate) + '_threshold_' \
             + str(threshold) + '.csv'
'''
    Output file is opened in writing mode and the data is written in the format mentioned in the post. After the
    first values are written, the gradient and updated weights are calculated using the calculateGradient function.
    An iteration variable is maintained to keep track on the number of times the batch linear regression is executed
    before it falls below the threshold value. In the infinite while loop, the predicted output value is calculated 
    again and new SSE value is calculated. If the absolute difference between the older(SSE from previous iteration) 
    and newer(SSE from current iteration) SSE is greater than the threshold value, then above process is repeated.
    The iteration is incremented by 1 and the current SSE is stored into previous SSE. If the absolute difference 
    between the older(SSE from previous iteration) and newer(SSE from current iteration) SSE falls below the 
    threshold value, the loop breaks and the last output values are written to the file.
'''
with open(outputFile, 'w', newline='') as csvFile:
    writer = csv.writer(csvFile, delimiter=',', quoting=csv.QUOTE_NONE, escapechar='')
    writer.writerow([*[0], *["{0:.4f}".format(val) for val in W.T[0]], *["{0:.4f}".format(sse_old)]])

    gradient, W = calculateGradient(W, X, Y, f_x, learningRate)

    iteration = 1
    while True:
        f_x = calculatePredicatedValue(X, W)
        sse_new = calculateSSE(Y, f_x)

        if abs(sse_new - sse_old) > threshold:
            writer.writerow([*[iteration], *["{0:.4f}".format(val) for val in W.T[0]], *["{0:.4f}".format(sse_new)]])
            gradient, W = calculateGradient(W, X, Y, f_x, learningRate)
            iteration += 1
            sse_old = sse_new
        else:
            break
    writer.writerow([*[iteration], *["{0:.4f}".format(val) for val in W.T[0]], *["{0:.4f}".format(sse_new)]])
print("Output File Name: " + outputFile)

The flow of the main() function is like this:
main()函数的流程是这样的:
1.Save respective command line inputs into variables
1.将相应的命令行输入保存到变量中
2.Read CSV file and the last column is the target output and is separated from the input(stored as X) and stored as Y
2.读取CSV文件,最后一列是目标输出,与输入(存储为X)和存储为Y分隔开
3.Convert data points into float and initialize weight vector with 0s
3.将数据点转换为浮点数,并用0初始化权向量
4.Calculate the predicted output value using the calculatePredicatedValue function
4.使用calculatePredicatedValue函数计算预测的输出值
5.Calculate the initial SSE using the calculateSSE function
5.使用calculateSSE函数计算初始SSE
6.The output file is opened in writing mode and the data is written in the format mentioned in the post. After the first values are written, the gradient and updated weights are calculated using the calculateGradient function. An iteration variable is maintained to keep track of the number of times the batch linear regression is executed before it falls below the threshold value. In the infinite while loop, the predicted output value is calculated again and the new SSE value is calculated. If the absolute difference between the older(SSE from the previous iteration) and newer(SSE from current iteration) SSE is greater than the threshold value, then the above process is repeated. The iteration is incremented by 1 and the current SSE is stored into previous SSE. If the absolute difference between the older(SSE from the previous iteration) and newer(SSE from current iteration) SSE falls below the threshold value, the loop breaks and the last output values are written to the file
6.输出文件以写入模式打开,数据以文章中提到的格式写入。在第一个值写入之后,使用calculateGradient函数计算梯度和更新后的权值。维护一个迭代变量来跟踪批处理线性回归在低于阈值之前执行的次数。在无限while循环中,再次计算预测的输出值,并计算新的SSE值。如果旧的(来自前一个迭代的SSE)和新的(来自当前迭代的SSE) SSE之间的绝对差异大于阈值,则重复上述过程。迭代增加1,当前的SSE存储到以前的SSE中。如果旧的(来自前一个迭代的SSE)和新的(来自当前迭代的SSE) SSE之间的绝对差值低于阈值,则循环中断,最后的输出值被写入文件

The calculatePredicatedValue() function
calculatePredicatedValue()函数
Here the predicted output is calculated by performing dot product of input matrix X and weight matrix W.
在这里,通过做输入矩阵X和权重矩阵W的点积来计算预测输出

dot product of X(input) and W(weights) as numpy matrices and returning the result which is the predicted output

def calculatePredicatedValue(X, W):
f_x = np.dot(X, W)
return f_x

The calculateSSE() function
calculateSSE()函数
The SSE is calculated using the formula mentioned above.
SSE的计算采用上述公式
def calculateSSE(Y, f_x):
sse = np.sum(np.square(f_x - Y))
return sse

Now, that the whole code is out there. Let’s have a look at the execution of the program.
现在,所有的代码都出来了。让我们看看程序的执行情况。

Here is how the output looks like:
下面是输出的样子:

The final program
最后的程序

import argparse
import csv
import numpy as np

def main():
args = parser.parse_args()
file, learningRate, threshold = args.data, float(
args.learningRate), float(args.threshold) # save respective command line inputs into variables

# read csv file and the last column is the target output and is separated from the input (X) as Y
with open(file) as csvFile:
    reader = csv.reader(csvFile, delimiter=',')
    X = []
    Y = []
    for row in reader:
        X.append([1.0] + row[:-1])
        Y.append([row[-1]])

# Convert data points into float and initialise weight vector with 0s.
n = len(X)
X = np.array(X).astype(float)
Y = np.array(Y).astype(float)
W = np.zeros(X.shape[1]).astype(float)
# this matrix is transposed to match the necessary matrix dimensions for calculating dot product
W = W.reshape(X.shape[1], 1).round(4)

# Calculate the predicted output value
f_x = calculatePredicatedValue(X, W)

# Calculate the initial SSE
sse_old = calculateSSE(Y, f_x)

outputFile = 'solution_' + \
             'learningRate_' + str(learningRate) + '_threshold_' \
             + str(threshold) + '.csv'
'''
    Output file is opened in writing mode and the data is written in the format mentioned in the post. After the
    first values are written, the gradient and updated weights are calculated using the calculateGradient function.
    An iteration variable is maintained to keep track on the number of times the batch linear regression is executed
    before it falls below the threshold value. In the infinite while loop, the predicted output value is calculated 
    again and new SSE value is calculated. If the absolute difference between the older(SSE from previous iteration) 
    and newer(SSE from current iteration) SSE is greater than the threshold value, then above process is repeated.
    The iteration is incremented by 1 and the current SSE is stored into previous SSE. If the absolute difference 
    between the older(SSE from previous iteration) and newer(SSE from current iteration) SSE falls below the 
    threshold value, the loop breaks and the last output values are written to the file.
'''
with open(outputFile, 'w', newline='') as csvFile:
    writer = csv.writer(csvFile, delimiter=',', quoting=csv.QUOTE_NONE, escapechar='')
    writer.writerow([*[0], *["{0:.4f}".format(val) for val in W.T[0]], *["{0:.4f}".format(sse_old)]])

    gradient, W = calculateGradient(W, X, Y, f_x, learningRate)

    iteration = 1
    while True:
        f_x = calculatePredicatedValue(X, W)
        sse_new = calculateSSE(Y, f_x)

        if abs(sse_new - sse_old) > threshold:
            writer.writerow([*[iteration], *["{0:.4f}".format(val) for val in W.T[0]], *["{0:.4f}".format(sse_new)]])
            gradient, W = calculateGradient(W, X, Y, f_x, learningRate)
            iteration += 1
            sse_old = sse_new
        else:
            break
    writer.writerow([*[iteration], *["{0:.4f}".format(val) for val in W.T[0]], *["{0:.4f}".format(sse_new)]])
print("Output File Name: " + outputFile

def calculateGradient(W, X, Y, f_x, learningRate):
gradient = (Y - f_x) * X
gradient = np.sum(gradient, axis=0)
# gradient = np.array([float("{0:.4f}".format(val)) for val in gradient])
temp = np.array(learningRate * gradient).reshape(W.shape)
W = W + temp
return gradient, W

def calculateSSE(Y, f_x):
sse = np.sum(np.square(f_x - Y))

return sse

def calculatePredicatedValue(X, W):
f_x = np.dot(X, W)
return f_x

if name == ‘main’:
parser = argparse.ArgumentParser()
parser.add_argument("-d", “–data”, help=“Data File”)
parser.add_argument("-l", “–learningRate”, help=“Learning Rate”)
parser.add_argument("-t", “–threshold”, help=“Threshold”)
main()

This post walks through the mathematical concepts involved in batch linear regression using gradient descent. Here, the error function (in this case Sum of Squared Errors) is taken into account. Instead of minimizing the SSE, which may not be possible always (there needs to be tuning for the learning rate) we saw how to make your linear regression converge with the help of a threshold value.
这篇文章介绍了使用梯度下降法进行批量线性回归的数学概念。这里,考虑了误差函数(在本例中是误差平方和)。我们看到了如何在阈值的帮助下使线性回归收敛,而不是最小化SSE,这可能不总是可能的(需要对学习率进行调优)。

This program used numpy for processing the data but it can be done with basics of python without using numpy but it will require nested looping and hence the complexity will increase to O(nn). Anyhow, the arrays and matrices provided by numpy are more memory efficient. Also, if you are comfortable working with pandas you are encouraged to use that and try to implement the same program with it.
这个程序使用numpy处理数据,但它可以用python的基础来完成,而不使用numpy,但它需要嵌套的循环,因此复杂性将增加到O(n
n)。无论如何,numpy提供的数组和矩阵具有更高的内存效率。另外,如果你觉得和熊猫一起工作很舒服,你可以使用它,并尝试用它来实现同样的程序。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值