高级计算机网络实验——Internet Measurements

最新推荐文章于 2023-04-16 22:50:39 发布

-何以寄相思

最新推荐文章于 2023-04-16 22:50:39 发布

阅读量403

点赞数 1

文章标签：计算机网络网络

本文链接：https://blog.csdn.net/weixin_42866128/article/details/127531248

版权

高级计算机网络实验——Internet Measurements

普林斯顿大学网络实验

COS-461 Assignment 4: Internet Measurements (princeton.edu)

0. Prepare

实验给出了一个 .csv 文件，该文件是通过 1/100 数据包采样收集 Netflow 测量值，因此数据反映了路由器上 1% 的流量。首先，需要我们对实验数据有一个直观的认识。直接通过Excel查看，可以看出第一行为属性，其余每行是聚合好的数据流。

数据集

为此，我们也可以通过代码展现csv文件信息，有87w+条数据流。

def print_header(file):
    reader = csv.reader(file)
    header_row = next(reader)

    print(header_row)
    # unix_secs,unix_nsecs,sysuptime,exaddr,dpkts,doctets,first,last,engine_type,engine_id,srcaddr,dstaddr,nexthop,input,output,srcport,dstport,prot,tos,tcp_flags,src_mask,dst_mask,src_as,dst_as

    for index, column_header in enumerate(header_row):
        print(index, column_header)

    '''
    [873733 rows x 24 columns]
    0 unix_secs:    UNIX秒数，从1970年1月1日0点0分至今的秒数
    1 unix_nsecs:   UNIX秒数，从1970年1月1日0点0分至今的纳秒数
    2 sysuptime:    启动时间，系统运行至今的时间（毫秒）
    3 exaddr:       采集设备IP地址
    4 dpkts:        流分组数
    5 doctets:      流字节数
    6 first:        流首分组到达时间
    7 last:         流尾分组离开时间
    8 engine_type:  处理设备类型
    9 engine_id:    处理设备ID
    10 srcaddr:     流源IP地址
    11 dstaddr:     流目的IP地址
    12 nexthop:     下一跳路由器IP地址
    13 input:       路由器输入端口
    14 output:      路由器输出端口
    15 srcport:     流源端口
    16 dstport:     流目的端口
    17 prot:        协议类型，1：ICMP；6：TCP；17：UDP
    18 tos:         服务类型
    19 tcp_flags:   TCP标志位
    20 src_mask:    源IP地址掩码长度
    21 dst_mask:    目的IP地址掩码长度
    22 src_as:      源AS
    23 dst_as:      目的AS
    '''

1. Traffic Measurement

1.1 What is the average packet size, across all traffic in the trace? Describe how you computed this number.

首先我们要先区分数据流和数据包的概念

TCP\IP 数据流与数据包_Ghost丶的博客-CSDN博客_数据流和数据包的区别

平均数据包大小 = 总数据包大小 / 总数据包个数

def average_packnum(datafile):
    sum_dpkts = datafile['dpkts'].sum()
    sum_doctets = datafile['doctets'].sum()
    print('A1:')
    print('\t流总字节数：' + str(sum_doctets))
    print('\t总数据包数：' + str(sum_dpkts))
    print('\t平均包大小：' + str(sum_doctets/sum_dpkts))

实验结果

1.2 Plot the Complementary Cumulative Probability Distribution (CCDF) of flow durations (i.e., the finish time minus the start time) and of flow sizes (i.e., number of bytes, and number of packets). First plot each graph with a linear scale on each axis, and then a second time with a logarithmic scale on each axis. What are the main features of the graphs? What artifacts of Netflow and of network protocols could be responsible for these features? Why is it useful to plot on a logarithmic scale?

绘制流持续时间（即完成时间减去开始时间）和流大小（即字节数和包数）的互补累积概率分布(CCDF)。首先在每个轴上用线性比例绘制每个图，然后在每个轴上用log函数再绘制一次。这些图的主要特征是什么？ Ne网络协议的哪些部分可以解释这些特性？为什么取log绘制是更好的？

1.2.1 作图

为画出响应的图像，我们需要对先提取出流持续时间和流大小，其次要了解CCDF的概念。我参考了以下文章：

累计分布函数CDF、互补累计分布函数CCDF、期望Expection_Anne033的博客-CSDN博客_cdf分布

python 绘制CCDF图_夏荷影的博客-CSDN博客_ccdf

随后在对数据的处理中发现，所给数据中存在一些持续时间为0的流（即流到达时间-流发送时间=0），究其原因是因为该实验数据是通过1%的抽样得来，在抽样过程中，可能会存在一个流中只抽取一个数据包的情况，因此持续时间为0。故此类数据没有分析的价值，应当首先对数据进行清洗，处理掉这类不合理的情况。

def print_ccdf(fn, datafile):
    temp_file = r'experiment_1\ft-temp.csv'
    if not os.path.exists(temp_file):                                         # 判断数据是否已经过清洗
        with open(temp_file, 'w') as f_temp, open(fn, 'r') as f_old:          # 清洗数据
            f_csv_old = csv.reader(f_old)
            f_csv_temp = csv.writer(f_temp)
            for i, rows in enumerate(f_csv_old):
                if i == 0:
                    f_csv_temp.writerow(rows)
                    continue
                if int(rows[7]) - int(rows[6]) != 0:
                    f_csv_temp.writerow(rows)
        data_tmp = pandas.read_csv(temp_file)                                # 去除空白行
        res = data_tmp.dropna(how="all")
        res.to_csv(r'experiment_1\ft-temp.csv', index=False)

随后，根据上述代码得到的新文件中的数据进行作图。

def deal_ccdf(data):
    numpy.sort(data)                                                        # 排序
    val, cnt = numpy.unique(data, return_counts=True)                       # 计数
    # print(val, cnt)
    pmf = cnt / len(data)                                                   # 求概率

    fs_rv_dist2 = stats.rv_discrete(name='fs_rv_dist2', values=(val, pmf))  # 构造离散随机变量

    plt.plot(val, 1-fs_rv_dist2.cdf(val))                                   # ccdf = 1 - cdf
    plt.title("CCDF")
    # plt.xlim(0, 60000)                                                      # 缩小X轴显示范围，凸显图像特征
    plt.show()


def deal_logccdf(data):
    numpy.sort(data)                                                        # 排序
    val, cnt = numpy.unique(data, return_counts=True)                       # 计数
    # print(val, cnt)
    pmf = cnt / len(data)                                                   # 求概率

    fs_rv_dist2 = stats.rv_discrete(name='fs_rv_dist2', values=(val, pmf))  # 构造离散随机变量

    plt.plot(numpy.log(val), numpy.log(1-fs_rv_dist2.cdf(val)))         # 对x和y取log
    plt.show()

with open(temp_file, 'r') as f:
        reader = csv.reader(f)
        next(reader)                                                         # 跳过表头
        durations = []                                                       # 填充序列
        sizes = []
        for row in reader:
            durations.append(int(row[7])-int(row[6]))
            sizes.append(int(row[5]))
        # print(durations, sizes)
        deal_ccdf(durations)                                                 # 画出ccdf曲线
        deal_ccdf(sizes)
        deal_logccdf(durations)                                              # 画出ccdf曲线
        deal_logccdf(sizes)

初步作图后，发现流大小的CCDF图特征集中在[0, 60000]之间，故调整x显示范围，将图像特征表现出来，得到以下图像。

流持续时间-CCDF
流大小-CCDF
流持续时间-logCCDF
流大小-logCCDF

1.2.2

流持续时间的CCDF图表明了，再去除大量持续时间为0的流后，流持续的时间逐步减小。流持续时间越大其数量越少；流持续时间的log-CCDF图像表明，当持续时间大于某个值x后，流的数量急剧减少。
流大小的CCDF图已经经过了我们的优化（只展示[0，60000]范围内的图像，说明大量的数据包的大小都集中在这较小的范围内，因此当超过了这个阈值，数据包数量迅速下降。
这两者的log图像都显示了包密度在某一点后的急剧下降。流大小的log-CCDF图在x=4左侧的值一直为0，这是由于数据包的最小可接受大小至少是8字节(UDP)和20字节(TCP)，因此在X大于某个平衡值之后数据包可接受的范围急剧减小。
取log之后不会改变数据的性质和相关关系，但压缩了变量的尺度，可以更好的展现数据信息

1.3 Summarize the traffic by which TCP/UDP port numbers are used. Create two tables, listing the top-ten port numbers by sender traffic volume (i.e., by source port number) and by receiver traffic volume (i.e., by destination port number), including the percentage of traffic (by bytes) they contribute. Where possible, explain what applications are likely responsible for this traffic. (See the IANA port numbers reference for details.) Explain any significant differences between the results for sender vs. receiver port numbers.

总结使用TCP/UDP端口号的通信量。构造两张表，按发送方通信量（即
按源端口号）和接收方通信量（即目的端口号）列出前十个端口号，包括它们所贡献的通信量的百分比（按字节）。在可能的情况下，解释哪些应用程序可能响应此通信量。请解释发送端口号与接收端口号结果之间的显著差异。

1.3.1首先我们需要先筛选出TCP和UDP流量，根据最初我们打印的表头信息，可以通过prot自断辨别协议类型。随后构造字典对端口通信量进行统计分析。

def stat_port(ports, flows, prots):
    stat_ports = {}

    for port in ports:
        stat_ports[port] = 0

    total_flows = 0

    for port, flow, prot in list(zip(ports, flows, prots)):
        if (prot == 6) or (prot == 17):
            stat_ports[port] += flow
        total_flows += flow

    for key in stat_ports:
        stat_ports[key] = (stat_ports[key]/total_flows)*100

    temp = sorted(stat_ports.items(), key=operator.itemgetter(1), reverse=True)[:10]

    print(temp)

发送端口统计
接收端口统计

其中80端口占比较大，此端口多用于http协议，22端口与443端口端口多用于ssh协议与https协议。

通常我们使用主机访问服务器时，本机会启用用户可自定义的大号端口与服务器端的80端口进行通信，服务器通常会返回大量的内容信息（框架、图片、视频等），因此以80为源端口的数据流量会占比较多，而接收方端口多以大号端口为主。

1.4 Aggregate the traffic volumes based on the source IP prefix. What fraction of the total traffic comes from the most popular (by number of bytes) 0.1% of source IP prefixes? The most popular 1% of source IP prefixes? The most popular 10% of source IP prefixes? Some flows will have a source mask length of 0. Report the fraction of traffic (by bytes) that has a source mask of 0, and then exclude this traffic from the rest of the analysis. That is, report the top 0.1%, 1%, and 10% of source prefixes that have positive mask lengths.

基于源IP前缀聚合通信量。总流量的多少部分来自最流行的（按字节数计
算）0.1%的源IP前缀？最流行的1%源IP前缀？最流行的10%源IP前缀？某
些流的源掩码长度为0。报告源掩码为0的通信量的分数（按字节），然后
将此通信量排除在分析的其余部分之外。即，报告掩码长度为正的源前缀
的前0.1%、1%和10%。

我们需要先统计出各IP前缀所贡献的流量占比，随后按题目要求比例提取IP前缀列表，统计流量计算其总流量占比。通过传递masked的值来确定是否去除掩码为0的数据流。（该流可能是路由配置过程中的广播包）

def stat_AS(fn, masked):
    stat_flow = {}                                                           # 定义数据流字典
    for ip in fn.srcaddr:                                                    # 构造ip索引
        stat_flow[ip] = 0

    total_flow = 0
    total_flow0 = 0
    for ip, flow, mask in list(zip(fn.srcaddr, fn.doctets, fn.src_mask)):    # 压缩数组
        if masked:                                                           # 去除src_mask为0流量
            if mask == 0:
                total_flow0 += flow
            else:
                stat_flow[ip] += flow
            total_flow += flow
        else:
            stat_flow[ip] += flow
            total_flow += flow

    fraction = [0.001, 0.01, 0.1]
    temp = sorted(stat_flow.items(), key=operator.itemgetter(1), reverse=True)  # 对字典内容按流量数降序排序

    for frac in fraction:
        temp_stat = dict(temp[:int(len(stat_flow)*frac)])                       # 从头取前frac形成新字典
        print(frac*100, sum(temp_stat.values())/total_flow)
    print(total_flow0/total_flow)                                               # src_mask 为0占比

总流量的一部分来自最流行的（按字节数）：

源IP前缀的0.1%=0.58944
源IP前缀的1%=0.82253
源IP前缀的10%=0.98383
源掩码为0=0.43259的通信量的百分比（按字节）

当排除源掩码为0的通信量后，总通信量的一部分来自最流行的通信量（按字节数计算）：

源IP前缀的0.1%=0.27005
源IP前缀的1%=0.45622
源IP前缀的10%=0.56383

1.5 Princeton has the 128.112.0.0/16 address block. What fraction of the traffic (by bytes and by packets) in the trace is sent by Princeton? To Princeton?

普林斯顿拥有128.112.0.0/16地址块。它发送的数据流占比多少？接收的占比多少？

我们可以对源地址和目的地址进行字符串匹配（从前至后），统计其数量，计算占比。

def stat_flow(ip_string, mask, addrs, flows):
    total_flow = 0
    temp_flow = 0
    mask = int(mask/2)
    for ip, flow in list(zip(addrs, flows)):
        if ip_string in ip[:mask]:
            temp_flow += flow
        total_flow += flow
    print(temp_flow/total_flow)
    print(total_flow)