写给程序员的数据挖掘指南_程序员清洁指南，用于处理混乱的传感器数据

最新推荐文章于 2024-09-20 00:09:16 发布

cumo7370

最新推荐文章于 2024-09-20 00:09:16 发布

阅读量202

点赞数

文章标签：数据挖掘 python 数据分析 java linux

原文链接：https://opensource.com/article/17/9/messy-sensor-data

版权

写给程序员的数据挖掘指南

在本教程中，我将说明如何使用Pandas和Python处理凌乱的数据。如果您以前从未使用过Pandas并且了解Python的基础知识，那么本教程适合您。

如何使用Python处理日期和时间。

让我们从头开始，将杂乱的文件变成有用的数据集。整个源代码可在GitHub上获得。

读取CSV文件

您可以使用以下命令在Pandas中打开CSV文件：

pandas.read_csv（） ：打开CSV文件作为DataFrame（如表格）。
DataFrame.head（） ：显示前5个条目。

DataFrame就像Pandas中的表格一样；它具有固定数量的列和索引。 CSV文件非常适合DataFrame，因为它们位于数据的列和行中。



   
   
    
     
     
     import pandas 
     
     as pd
     
     

     
     
     


     
     # Open a comma-separated values (CSV) file as a DataFrame 
     
     

weather_observations 
     
     = \  
     
     

  pd. 
     
     read_csv 
     
     ( 
     
     'observations/Canberra_observations.csv' 
     
     ) 
     
     

     
     
     


     
     # Print the first 5 entries 
     
     

weather_observations. 
     
     head 
     
     ( 
     
     )

看起来我们的数据实际上是由\ t制表符分隔的。那里有一些有趣的东西，看起来似乎是时间。

pandas.read_csv（）提供了针对不同情况的通用关键字参数。在这里，您有一个用于日期的列，另一个用于时间的列。您可以引入一些关键字参数来增加一些智能：

sep ：列之间的分隔符
parse_dates ：将一列或多列视为日期
dayfirst ：使用DD.MM.YYYY格式，而不是月初
infer_datetime_format ：告诉熊猫猜测日期格式
na_values ：添加值以将其视为空

使用这些关键字参数可以对数据进行预格式化，并让Pandas完成一些繁重的工作。



   
   
    
     
     
     # Supply pandas with some hints about the file to read 
     
     

weather_observations 
     
     = \
     
     

  pd. 
     
     read_csv 
     
     ( 
     
     'observations/Canberra_observations.csv' 
     
     , 
     
     

     sep 
     
     = 
     
     ' \t ' 
     
     , 
     
     

     parse_dates 
     
     = 
     
     { 
     
     'Datetime' : 
     
     [ 
     
     'Date' 
     
     , 
     
     'Time' 
     
     ] 
     
     } 
     
     , 
     
     

     dayfirst 
     
     = 
     
     True 
     
     , 
     
     

     infer_datetime_format 
     
     = 
     
     True 
     
     , 
     
     

     na_values 
     
     = 
     
     [ 
     
     '-' 
     
     ] 
     
     


     
     )

Pandas很好地将两列Date和Time转换为单列Datetime ，并以标准格式呈现。

这里有一个NaN值，请勿与“非数字”浮点数混淆。这只是熊猫说的是空的。

按顺序排序数据

让我们看一下熊猫如何处理数据顺序。

DataFrame.sort_values（） ：按顺序重新排列。
DataFrame.drop_duplicates（） ：删除重复的项目。
DataFrame.set_index（） ：指定要用作索引的列。

因为时间似乎在倒退，所以我们对其进行排序：



   
   
    
     
     
     # Sorting is ascending by default, or chronological order 
     
     

sorted_dataframe 
     
     = weather_observations. 
     
     sort_values 
     
     ( 
     
     'Datetime' 
     
     ) 
     
     

sorted_dataframe. 
     
     head 
     
     ( 
     
     )

为什么会有两个午夜？事实证明，我们的数据集（原始数据）在每天的结尾和开头都包含午夜。您可以将其中一个作为重复项丢弃，因为第二天还有另一个午夜。

此处的逻辑顺序是丢弃重复项，对数据进行排序，然后设置索引：



   
   
    
     
     
     # Sorting is ascending by default, or chronological order 
     
     

sorted_dataframe 
     
     = weather_observations. 
     
     sort_values 
     
     ( 
     
     'Datetime' 
     
     ) 
     
     


     
     


     
     # Remove duplicated items with the same date and time 
     
     

no_duplicates 
     
     = sorted_dataframe. 
     
     drop_duplicates 
     
     ( 
     
     'Datetime' 
     
     , keep 
     
     = 
     
     'last' 
     
     ) 
     
     


     
     


     
     # Use `Datetime` as our DataFrame index 
     
     

indexed_weather_observations 
     
     = \
     
     

  sorted_dataframe. 
     
     set_index 
     
     ( 
     
     'Datetime' 
     
     ) 
     
     

indexed_weather_observations. 
     
     head 
     
     ( 
     
     )

现在，您有了一个以时间为索引的DataFrame，它将在以后派上用场。首先，让我们改变风向。

转换列值

要准备用于天气建模的风力数据，您可以使用数字格式的风力值。按照惯例，北风（↓）为0度，顺时针⟳。东风（←）为90度，依此类推。您将利用Pandas进行转换：

Series.apply（） ：使用函数转换每个条目。

为了确定每个风向的确切值，我手工编写了一个字典，因为只有16个值。这是整洁且易于理解的。



   
   
    
     
     
     # Translate wind direction to degrees 
     
     

wind_directions 
     
     = 
     
     { 
     
     

     
     
     'N' :   
     
     0 . 
     
     , 
     
     'NNE' :  
     
     22.5 
     
     , 
     
     'NE' :  
     
     45 . 
     
     , 
     
     'ENE' :  
     
     67.5 
     
     , 
     
     

     
     
     'E' :  
     
     90 . 
     
     , 
     
     'ESE' : 
     
     112.5 
     
     , 
     
     'SE' : 
     
     135 . 
     
     , 
     
     'SSE' : 
     
     157.5 
     
     , 
     
     

     
     
     'S' : 
     
     180 . 
     
     , 
     
     'SSW' : 
     
     202.5 
     
     , 
     
     'SW' : 
     
     225 . 
     
     , 
     
     'WSW' : 
     
     247.5 
     
     , 
     
     

     
     
     'W' : 
     
     270 . 
     
     , 
     
     'WNW' : 
     
     292.5 
     
     , 
     
     'NW' : 
     
     315 . 
     
     , 
     
     'NNW' : 
     
     337.5 
     
     }

您可以像使用Python字典那样通过索引访问器访问DataFrame列（在Pandas中称为Series） 。转换后，将Series替换为新值。



   
   
    
     
     
     # Replace wind directions column with a new number column 
     
     


     
     # `get()` accesses values fomr the dictionary safely 
     
     

indexed_weather_observations 
     
     [ 
     
     'Wind dir' 
     
     ] 
     
     = \
     
     

    indexed_weather_observations 
     
     [ 
     
     'Wind dir' 
     
     ] . 
     
     apply 
     
     ( wind_directions. 
     
     get 
     
     ) 
     
     


     
     


     
     # Display some entries 
     
     

indexed_weather_observations. 
     
     head 
     
     ( 
     
     )

现在，每个有效风向都是一个数字。值是字符串还是其他类型的数字都没有关系。您可以使用Series.apply（）对其进行转换。

设定索引频率

深入研究，您会在数据集中发现更多缺陷：



   
   
    
     
     
     # One section where the data has weird timestamps ... 
     
     

indexed_weather_observations 
     
     [ 
     
     1800 : 
     
     1805 
     
     ]

00:33:00 ？ 01:11:00 ？这些是奇怪的时间戳。有一项功能可以确保频率一致：

DataFrame.asfreq（） ：在索引上强制使用特定频率，并丢弃其余频率。



     
     
      
       
       
       # Force the index to be every 30 minutes 
       
       

regular_observations 
       
       = \
       
       

  indexed_weather_observations. 
       
       asfreq 
       
       ( 
       
       '30min' 
       
       ) 
       
       

         
       
       


       
       # Same section at different indices since setting   
       
       


       
       # its frequency :) 
       
       

regular_observations 
       
       [ 
       
       1633 : 
       
       1638 
       
       ]

熊猫会丢弃任何与频率不匹配的索引，如果不存在则添加一个空行。现在您有了一致的索引频率。让我们对其进行绘图，以查看其与流行的绘图库matplotlib的外观：



   
   
    
     
     
     import matplotlib. 
     
     pyplot 
     
     as plt
     
     


     
     


     
     # Make the graphs a bit prettier 
     
     

pd. 
     
     set_option 
     
     ( 
     
     'display.mpl_style' 
     
     , 
     
     'default' 
     
     ) 
     
     

plt. 
     
     rcParams 
     
     [ 
     
     'figure.figsize' 
     
     ] 
     
     = 
     
     ( 
     
     18 
     
     , 
     
     5 
     
     ) 
     
     


     
     


     
     # Plot the first 500 entries with selected columns 
     
     

regular_observations 
     
     [ 
     
     [ 
     
     'Wind spd' 
     
     , 
     
     'Wind gust' 
     
     , 
     
     'Tmp' 
     
     , 
     
     'Feels like' 
     
     ] 
     
     ] 
     
     [ : 
     
     500 
     
     ] . 
     
     plot 
     
     ( 
     
     )

仔细观察，似乎在1月6日，7日及以后还有差距。您需要用有意义的内容填充这些内容。

插值并填充空白行

要填充间隙，您可以线性插值，或从间隙的两个端点绘制一条线并相应地填充每个时间戳。

Series.interpolate（） ：根据索引填写空值。

在这里，您还可以使用inplace关键字参数来告诉Pandas执行该操作并自行替换。



   
   
    
     
     
     # Interpolate data to fill empty values 
     
     


     
     for column 
     
     in regular_observations. 
     
     columns :
     
     

    regular_observations 
     
     [ column 
     
     ] . 
     
     interpolate 
     
     ( 
     
     'time' 
     
     , inplace 
     
     = 
     
     True 
     
     , limit_direction 
     
     = 
     
     'both' 
     
     ) 
     
     


     
     


     
     # Display some interpolated entries     
     
     

regular_observations 
     
     [ 
     
     1633 : 
     
     1638 
     
     ]

NaN值已被替换。让我们再次绘制：



   
   
    
     
     
     # Plot it again - gap free! 
     
     

regular_observations 
     
     [ 
     
     [ 
     
     'Wind spd' 
     
     , 
     
     'Wind gust' 
     
     , 
     
     'Tmp' 
     
     , 
     
     'Feels like' 
     
     ] 
     
     ] 
     
     [ : 
     
     500 
     
     ] . 
     
     plot 
     
     ( 
     
     )