【论文笔记】Learning to log

paper链接:http://www.academia.edu/download/36281506/jmzhu_icse2015.pdf

Abstract

作者先进行了背景介绍,
在这篇paper里提出了一个 learning to log 的架构,旨在提供logging的指导;

其中的一个实现就是他们做出的一个工具:LogAdviser;从已有的logging实例中学习 where to log 这个问题。
目标问题的三个因素为:

  • 结构特征
  • 文本特征
  • 语义特征

然后应用机器学习(特征选择,分类器训练)的方法求解。

评估:

  • 2 from MS
  • 2 from Github

一共有 19.1M LOC(Line of Code) and 100.6K

loggingstatements
LOC的解释
结果很好。

Introduction

前面说logging不能太少也不能太多。。还举例论证。。废话有点多

作者之前做过调研
就连 MS 也没有对logging的明确严格的标准。

在一些论坛的帖子里发现了一些开发者们讨论最好的logging实践经验
developers still need to make their own decisions on where to log and what to log,
which in most cases depend on their own domain knowledge

所以logging是一个重要的问题

Observations and Motivation

上述系统的log是经得起考验的

Observations:

  • Pervasiveness of logging(log的广泛性)
    主要讲了…
    a line of logging code in every 58 LOC

  • where to log

exceptions
return-valuecheck snippets

  • why not to log everything

还把这个问题去问starkoverflow。。。

  • Logging decision and the context

logging decision is
highly dependent on the context of this code snippet, including
the exception type

Motivation

现有的log基于developer的专业知识
我们希望提出一个工具来更好的提供有价值的log建议;降低对开发者专业知识的要求程度

Learing to Log

overview

Instances collection(选训练集)
  • exception snippets
    records the exception context after an exception is captured in the catch block
    在catch模块里,抛出异常的时候记录异常信息
  • return-value-check snippets
    the situation where an unexpected value (e.g., -1/null/false/empty) is returned from a function call
    函数调用时异常返回值时记录信息
Label identification (标label)

logged 包含logging 语句
unlogged 不包含 logging 语句

searching some keywords in all method names, such as

log/logging, trace, write/writeline

Feature extraction (特征提取)

The details on feature extraction are described
in Section III-B

Feature selection (特征选择)
Model training (模型训练)

classification model
Decision Tree

Logging suggestion(预测)

predictive model to perform accurate logging predictions


Structural features

error type
(每种错误的频率 做特征)
associated methods
帮助理解函数功能和操作
采用函数名作为特征
通过调用的先后顺序 BFS
比如: System.IO.Path.GetFullPath

  • namespace,
  • class name,
  • its (short) method name.

Textual features

代码中的变量名,变量类型,函数名。。。
与上述的Structrual features 结合 组成句子

词袋模型
分词,去停用词,tf-idf….

Syntactic features

  1. SettingFlag. We identify whether there is an assignment statement
    with an assigned value like -1/null/false/empty.
  2. Throw. Weidentify whether there is a throw statement.
  3. Return. We identify whether any special value (e.g., -1/null/false/empty)
    is returned.
  4. RecoverFlag. We check whether there is a new try statement inside.
  5. OtherOperation. We check whether there is any other operations included except the above five
    ones.
  6. EmptyBlock. We find that the developers sometimes catch and then do nothing. We thus identify whether the catch block is empty.

以上都是布尔变量

& NumOfMethods

Feature selection

特征太多,维度太大

  1. 设置一个频率的最小阈值
  2. 信息增益 (决策树)

    reduce the feature dimensionality to around 1000

Noise Handling

implicitly assume good logging quality in the training data

CLNI

logged 和unlogged比例严重失衡

SMOTE合成一些logged数据达到平衡

评估

RQ1: What is the accuracy of LogAdvisor?
RQ2: What is the effect of different learning models?
RQ3: What is the effect of noise handling?
RQ4: How does LogAdvisor perform in the cross-project learning scenario?

用DT:

  • good performance
  • ease of interpretation(可解释性强)

10-fold cross evaluation

balanced accuracy (BA) [19], which
is the average of the proportion of logged instances and the
proportion of unlogged instances that are correctly classified.
(正确分类的)

Results of RQ1: Prediction Accuracy

Baseline:

  • Random
  • Errlog

每种都超过0.5了

Results of RQ2: The Effect of Different Learning Models

Results of RQ3: The Effect of Noise Handling

经过处理之后Noise的data的比例很小
然后通过调阈值,让噪声数据的比例到5%
这样所有的实例中Noise Handling的效果都会好一些

Results of RQ4: Cross-Project Evaluation

User Study

省时又省钱
用户都说好

讨论:


  • logging的质量
  • 不同的软件系统(我们只做了C#, 还有其他的语言,系统等等)
  • what to log
    错误信息,栈空间等等,正在做LogEnhancer,能够自动填充log信息
  • 潜在的提高空间:


  1. 影响是否log的其他因素
  2. Interdependence of logging statements(logging之间的相互依赖)
  3. Runtime logging(这个内存中的log该什么时候打)

总结:

简单将数据挖掘算法应用到log当中,没有很难的算法。
算是对一个新领域的尝试,虽然2012年就有人分析过log,但是这篇应该是很早一批将log着手实验分析的。

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值