如何标准化特征向量HOW TO NORMALISE FEATURE VECTORS

最新推荐文章于 2023-07-04 15:16:24 发布

GarfieldEr007

最新推荐文章于 2023-07-04 15:16:24 发布

阅读量5.6k

点赞数 1

分类专栏：计算机视觉CV 文章标签：标准化特征向量 NORMALISE FEATURE VECTORS

计算机视觉CV 专栏收录该内容

327 篇文章 28 订阅

订阅专栏

HOW TO NORMALISE FEATURE VECTORS

I was trying to create a sample file for training a neural network and ran into a common problem: the feature values are all over the place. In this example I’m working with demographical real-world values for countries. For example, a feature for GDP per person in a country ranges from 551.27 to 88286.0, whereas estimates for corruption range between -1.56 to 2.42. This can be very confusing for machine learning algorithms, as they can end up treating bigger values as more important signals.

To handle this issue, we want to scale all the feature values into roughly the same range. We can do this by taking each feature value, subtracting its mean (thereby shifting the mean to 0), and dividing by the standard deviation (normalising the distribution). This is a piece of code I’ve implemented a number of times for various projects, so it’s time to write a nice reusable script. Hopefully it can be helpful for others as well. I chose to do this in python, as it’s easies to run compared to C++ and Java (doesn’t need to be compiled), but has better support for real-valued numbers compared to bash scripting.

Each line in the input file is assumed to be a feature vector, with values separated by whitespace. The first element is an integer class label that will be left untouched. This is followed by a number of floating point feature values which will be normalised. For example:

 
         1 0.563 13498174.2 -21.3 
        
         0 0.114 42234434.3 15.67

We’re assuming dense vectors, meaning that each line has an equal number of features.

To execute it, simply use

 
         python feature-normaliser.py < in.txt > out.txt

The complete script that will normalise feature vectors is here:

 
         import 
          sys; 
        
         import 
          fileinput; 
        
         import 
          numpy; 
        
         data  
         = 
          [] 
        
         linecount  
         = 
          0 
        
         for 
          line  
         in 
          fileinput. 
         input 
         (): 
        
         if 
          line.strip(): 
        
         index  
         = 
          0 
        
         for 
          value  
         in 
          line.split(): 
        
         if 
          linecount  
         = 
         = 
          0 
         : 
        
         data.append([]) 
        
         if 
          index  
         = 
         = 
          0 
         : 
        
         data[index].append( 
         int 
         (value)) 
        
         else 
         : 
        
         data[index].append( 
         float 
         (value)) 
        
         index 
         + 
         = 
         1 
        
         linecount 
         + 
         = 
         1 
        
         for 
          row  
         in 
          range 
         ( 
         0 
         , linecount): 
        
         for 
          col  
         in 
          range 
         ( 
         0 
         , index): 
        
         if 
          col  
         = 
         = 
          0 
         : 
        
         sys.stdout.write( 
         str 
         (data[col][row])) 
        
         else 
         : 
        
         val  
         = 
          (data[col][row]  
         - 
          numpy.mean(data[col])) 
         / 
         numpy.std(data[col]) 
        
         sys.stdout.write( 
         "\t" 
          + 
          str 
         (val)) 
        
         sys.stdout.write( 
         "\n" 
         )