如果您使用像pandas这样的高级库,您可以更容易地解决这个问题。我来演示一下:
假设您在file.csv中保存了下一个数据文件:2013-07-18 04:54:15.871 UDP 172.12.332.11:20547 172.12.332.11:20547 -> 172.56.213.80:53 CREATE Ignore 0
2013-07-18 04:54:15.841 UDP 192.33.230.81:37192 192.81.130.82:37192 -> 172.81.123.70:53 CREATE Ignore 0
2013-07-18 04:54:15.831 TCP 172.12.332.11:42547 172.12.332.11:42547 -> 172.56.213.80:53 CREATE Ignore 0
2013-07-18 04:54:15.821 UDP 192.33.230.81:37192 192.81.130.82:37192 -> 172.81.123.70:53 CREATE Ignore 0
2013-07-18 04:54:15.811 TCP 172.12.332.11:42547 172.12.332.11:42547 -> 172.56.213.80:53 CREATE Ignore 0
首先,我们将其读入数据帧:
^{pr2}$
我们只需要第0列,第4列和第6列>> df = df[['0_1', 4, 6]]
>> print df.to_string()
0_1 4 6
0 2013-07-18 04:54:15.871000 172.12.332.11:20547 172.56.213.80:53
1 2013-07-18 04:54:15.841000 192.81.130.82:37192 172.81.123.70:53
2 2013-07-18 04:54:15.831000 172.12.332.11:42547 172.56.213.80:53
3 2013-07-18 04:54:15.821000 192.81.130.82:37192 172.81.123.70:53
4 2013-07-18 04:54:15.811000 172.12.332.11:42547 172.56.213.80:53
然后我们应该修复IP地址并删除端口:>>> df[4] = df[4].str.split(':').str.get(0)
>>> df[6] = df[6].str.split(':').str.get(0)
>>> print df.to_string()
0_1 4 6
0 2013-07-18 04:54:15.871000 172.12.332.11 172.56.213.80
1 2013-07-18 04:54:15.841000 192.81.130.82 172.81.123.70
2 2013-07-18 04:54:15.831000 172.12.332.11 172.56.213.80
3 2013-07-18 04:54:15.821000 192.81.130.82 172.81.123.70
4 2013-07-18 04:54:15.811000 172.12.332.11 172.56.213.80
假设您对源地址172.12.332.11和目的地172.56.213.80感兴趣。我们将筛选出:>>> filtered = df[(df[4] == '172.12.332.11') & (df[6] == '172.56.213.80')]
>>> print filtered.to_string()
0_1 4 6
0 2013-07-18 04:54:15.871000 172.12.332.11 172.56.213.80
2 2013-07-18 04:54:15.831000 172.12.332.11 172.56.213.80
4 2013-07-18 04:54:15.811000 172.12.332.11 172.56.213.80
现在我们需要计算时间戳之间的差异:>>> timestamps = filtered['0_1']
>>> diffs = (timestamps.shift() - timestamps).dropna()
>>> print diffs.to_string()
2 00:00:00.040000
4 00:00:00.020000
我们现在可以计算任何我们想要的统计数据:>>> diffs.mean() # this is in nanoseconds
30000000.0
>>> diffs.std()
14142135.62373095
编辑:对于您发送给我的数据import io
import pandas as pd
def load_dataframe(filename):
# First you read the data as a regular csv file and extract the _raw column values
values = pd.read_csv(filename)['_raw'].values
# Cleanup the values: remove newline character
values = map(lambda x: x.replace('\n', ' '), values)
# Add them to a stream
s = io.StringIO(u'\n'.join(values))
# And now everithing is the same just read it from the stream
df = pd.read_table(s, sep='\s+', header=None, parse_dates=[[0,1]])[['0_1',4, 6]]
df[4] = df[4].str.split(':').str.get(0)
df[6] = df[6].str.split(':').str.get(0)
return df
def get_diffs(df, source, destination):
timestamps = df[(df[4] == source) & (df[6] == destination)]['0_1']
return (timestamps.shift() - timestamps).dropna()
def main():
filename = raw_input('Enter filename: ')
df = load_dataframe(filename)
while True:
source = raw_input('Enter source IP: ').strip()
destination = raw_input('Enter destination IP: ').strip()
diffs = get_diffs(df, source, destination)
for i, row in enumerate(diffs):
print('row %d - row %d = %s' % (i+2, i+1, row.astype('timedelta64[ms]')))
print('Mean: %s' % diffs.mean())
yn = raw_input('Again? [y/n]: ').lower().strip()
if yn != 'y':
return
if __name__ == '__main__':
main()
用法示例:$ python test.py
Enter filename: Data.csv
Enter source IP: 172.16.122.21
Enter destination IP: 172.55.102.107
Mean: 3333333.33333
Std: 5773502.6919
Again? [y/n]: n