python计算csv文件内的数据_使用python计算CSV文件数据的持续时间和平均值

本文展示了如何使用pandas处理CSV文件,提取特定IP地址间的时间戳,并计算时间间隔的平均值和标准差。通过读取CSV,筛选数据,计算时间差,最终得出统计信息。
摘要由CSDN通过智能技术生成

如果您使用像pandas这样的高级库,您可以更容易地解决这个问题。我来演示一下:

假设您在file.csv中保存了下一个数据文件:2013-07-18 04:54:15.871 UDP 172.12.332.11:20547 172.12.332.11:20547 -> 172.56.213.80:53 CREATE Ignore 0

2013-07-18 04:54:15.841 UDP 192.33.230.81:37192 192.81.130.82:37192 -> 172.81.123.70:53 CREATE Ignore 0

2013-07-18 04:54:15.831 TCP 172.12.332.11:42547 172.12.332.11:42547 -> 172.56.213.80:53 CREATE Ignore 0

2013-07-18 04:54:15.821 UDP 192.33.230.81:37192 192.81.130.82:37192 -> 172.81.123.70:53 CREATE Ignore 0

2013-07-18 04:54:15.811 TCP 172.12.332.11:42547 172.12.332.11:42547 -> 172.56.213.80:53 CREATE Ignore 0

首先,我们将其读入数据帧:

^{pr2}$

我们只需要第0列,第4列和第6列>> df = df[['0_1', 4, 6]]

>> print df.to_string()

0_1 4 6

0 2013-07-18 04:54:15.871000 172.12.332.11:20547 172.56.213.80:53

1 2013-07-18 04:54:15.841000 192.81.130.82:37192 172.81.123.70:53

2 2013-07-18 04:54:15.831000 172.12.332.11:42547 172.56.213.80:53

3 2013-07-18 04:54:15.821000 192.81.130.82:37192 172.81.123.70:53

4 2013-07-18 04:54:15.811000 172.12.332.11:42547 172.56.213.80:53

然后我们应该修复IP地址并删除端口:>>> df[4] = df[4].str.split(':').str.get(0)

>>> df[6] = df[6].str.split(':').str.get(0)

>>> print df.to_string()

0_1 4 6

0 2013-07-18 04:54:15.871000 172.12.332.11 172.56.213.80

1 2013-07-18 04:54:15.841000 192.81.130.82 172.81.123.70

2 2013-07-18 04:54:15.831000 172.12.332.11 172.56.213.80

3 2013-07-18 04:54:15.821000 192.81.130.82 172.81.123.70

4 2013-07-18 04:54:15.811000 172.12.332.11 172.56.213.80

假设您对源地址172.12.332.11和目的地172.56.213.80感兴趣。我们将筛选出:>>> filtered = df[(df[4] == '172.12.332.11') & (df[6] == '172.56.213.80')]

>>> print filtered.to_string()

0_1 4 6

0 2013-07-18 04:54:15.871000 172.12.332.11 172.56.213.80

2 2013-07-18 04:54:15.831000 172.12.332.11 172.56.213.80

4 2013-07-18 04:54:15.811000 172.12.332.11 172.56.213.80

现在我们需要计算时间戳之间的差异:>>> timestamps = filtered['0_1']

>>> diffs = (timestamps.shift() - timestamps).dropna()

>>> print diffs.to_string()

2 00:00:00.040000

4 00:00:00.020000

我们现在可以计算任何我们想要的统计数据:>>> diffs.mean() # this is in nanoseconds

30000000.0

>>> diffs.std()

14142135.62373095

编辑:对于您发送给我的数据import io

import pandas as pd

def load_dataframe(filename):

# First you read the data as a regular csv file and extract the _raw column values

values = pd.read_csv(filename)['_raw'].values

# Cleanup the values: remove newline character

values = map(lambda x: x.replace('\n', ' '), values)

# Add them to a stream

s = io.StringIO(u'\n'.join(values))

# And now everithing is the same just read it from the stream

df = pd.read_table(s, sep='\s+', header=None, parse_dates=[[0,1]])[['0_1',4, 6]]

df[4] = df[4].str.split(':').str.get(0)

df[6] = df[6].str.split(':').str.get(0)

return df

def get_diffs(df, source, destination):

timestamps = df[(df[4] == source) & (df[6] == destination)]['0_1']

return (timestamps.shift() - timestamps).dropna()

def main():

filename = raw_input('Enter filename: ')

df = load_dataframe(filename)

while True:

source = raw_input('Enter source IP: ').strip()

destination = raw_input('Enter destination IP: ').strip()

diffs = get_diffs(df, source, destination)

for i, row in enumerate(diffs):

print('row %d - row %d = %s' % (i+2, i+1, row.astype('timedelta64[ms]')))

print('Mean: %s' % diffs.mean())

yn = raw_input('Again? [y/n]: ').lower().strip()

if yn != 'y':

return

if __name__ == '__main__':

main()

用法示例:$ python test.py

Enter filename: Data.csv

Enter source IP: 172.16.122.21

Enter destination IP: 172.55.102.107

Mean: 3333333.33333

Std: 5773502.6919

Again? [y/n]: n

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值