如何标准化特征向量HOW TO NORMALISE FEATURE VECTORS

HOW TO NORMALISE FEATURE VECTORS

I was trying to create a sample file for training a neural network and ran into a common problem: the feature values are all over the place. In this example I’m working with demographical real-world values for countries. For example, a feature for GDP per person in a country ranges from 551.27 to 88286.0, whereas estimates for corruption range between -1.56 to 2.42. This can be very confusing for machine learning algorithms, as they can end up treating bigger values as more important signals.

To handle this issue, we want to scale all the feature values into roughly the same range. We can do this by taking each feature value, subtracting its mean (thereby shifting the mean to 0), and dividing by the standard deviation (normalising the distribution). This is a piece of code I’ve implemented a number of times for various projects, so it’s time to write a nice reusable script. Hopefully it can be helpful for others as well. I chose to do this in python, as it’s easies to run compared to C++ and Java (doesn’t need to be compiled), but has better support for real-valued numbers compared to bash scripting.

Each line in the input file is assumed to be a feature vector, with values separated by whitespace. The first element is an integer class label that will be left untouched. This is followed by a number of floating point feature values which will be normalised. For example:

1
2
1 0.563 13498174.2 -21.3
0 0.114 42234434.3 15.67

We’re assuming dense vectors, meaning that each line has an equal number of features.

To execute it, simply use

1
python feature-normaliser.py < in.txt > out.txt

The complete script that will normalise feature vectors is here:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import sys;
import fileinput;
import numpy;
 
data = []
linecount = 0
for line in fileinput. input ():
   if line.strip():
     index = 0
     for value in line.split():
       if linecount = = 0 :
         data.append([])
       if index = = 0 :
         data[index].append( int (value))
       else :
         data[index].append( float (value))
       index + = 1
     linecount + = 1
 
for row in range ( 0 , linecount):
   for col in range ( 0 , index):
     if col = = 0 :
       sys.stdout.write( str (data[col][row]))
     else :
       val = (data[col][row] - numpy.mean(data[col])) / numpy.std(data[col])
       sys.stdout.write( "\t" + str (val))
   sys.stdout.write( "\n" )

from: http://www.marekrei.com/blog/normalise-feature-vectors/
  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值