Note-Machine Learning Testing: Survey, Landscapes and Horizons

Content

  • PRELIMINARIES OF MACHINE LEARNING

    • elements

      • Dataset
        • Training data
        • Validation data
          • choose your model/fine tune your hiper-parameters
          • 防止模型过饱和
        • Test data
      • Learning program
      • Framework
    • different types of machine learning

      • Supervised learning

        a type of machine learning that learns from training data with labels as learning targets. It is the most widely used type of machine learning.

      • Unsupervised learning

        a learning methodology that learns from training data without labels and relies on understanding the data itself.

      • Reinforcement learning

        a type of machine learning where the data are in the form of sequences of actions, observations, and rewards, and the learner learns how to take actions to interact in a specific environment so as to maximise the specified rewards.

    • other classifications

      • classic machine learning

        • Decision Tree

        • SVM

          其主要思想为找到空间中的一个能够将所有数据样本划开的超平面,并且使得本本集中所有数据到这个超平面的距离最短。

          找超平面 → ​ \rightarrow​ 转化为优化问题

        • linear regression

          最小二乘法求解:偏导数等于零

        • Naive Bayes

          基于先验概率求后验概率

      • deep learning

        • DNNs: 层数较多的神经网络
        • CNNs: 卷积操作、特征提取
        • RNNs: 加入时间的概念,出了上一层,还受自身的影响
      • comparation

        • the same: 机器学习就是使用算法分析数据,从中学习并做出推断或预测。
        • the difference: deep learning applies Deep Neural Networks (DNNs) that uses multiple layers of nonlinear processing units for feature extraction and transformation.
  • testing workflow

    • two stage

      • Offline testing
      • Online Testing: to help find out which model is better, or whether the new model is superior to the old model under certain application contexts.
        • A/B testing: a splitting testing technique to compare two versions of the systems (e.g., web pages) that involve customers.
        • MRB (Multi-Rrmed Bandit): first conducts A/B testing for a short time and finds out the best model, then put more resources on the chosen model.
    • 1564900012699

    • Test Input Generation Techniques

      • Domain-specific Test Input Synthesis: 特定领域的方法

        • DeepXplore: a deep learning system, neuron coverage
        • DeepTest: autonomous driving systems, greedy search with nine different realistic image transformations
        • Generative adversarial networks (GANs): test generation with various weather conditions
          • 生成模型G 与 判别模型D
      • Fuzz and Search-based Test Input Generation

        • fuzz testing VS random testing: A special form of random testing, aims to breaking the software.
      • Symbolic Execution Based Test Input Generation

        一种程序分析技术,它可以通过分析程序来得到让特定代码区域执行的输入。顾名思义,使用符号执行分析一个程序时,该程序会使用符号值作为输入,而非一般执行程序时使用的具体值。在达到目标代码时,分析器可以得到相应的路径约束,然后通过约束求解器来得到可以触发目标代码的具体值。

      • Synthetic Data to Test Learning Program: 根据样本分布,人工合成input。

    • Test Oracle

      • Oracle Problem: enable the judgement of bug existence
      • Metamorphic Relations as Test Oracles
      • Cross-Referencing as Test Oracles: detects bugs by observing whether similar applications yield different outputs regarding identical inputs
      • Measurement Metrics for Designing Test Oracles: 自己制定评价标准
    • Test Adequacy

      • Test Coverage
        • Neuron coverage
        • MC/DC coverage variants
        • Layer-level coverage: checks the combinatorial activation status of the neurons in each layer
        • Limitations of Coverage Criteria: it is not clear how such criteria directly relate to the system decision logic.
      • Mutation Testing
      • Surprise Adequacy: They argued that a ‘good’ test input should be ‘sufficiently but not overly surprising’ comparing with the training data.
      • Rule-based Checking of Test Adequacy: 自己制定的规则
    • Test Prioritisation and Reduction

      • prioritise test inputs, for test
      • ranks the test instances based on their sensitivity to noises, for generation
    • Bug Report Analysis

  • testing properties

    • basic functional requirements

      • correctness

        • principle: to isolate test data via data sampling to check whether the trained model fits new cases

          • cross-validation
          • bootstrap: 有放回地抽
        • correctness measurements

          • accuracy: T P + T N T P + T N + F P + F N ​ {TP+TN} \over {TP+TN+FP+FN}​ TP+TN+FP+FNTP+TN; 被分对的样本数除以所有的样本数

            比如某个地区某天地震的预测,假设我们有一堆的特征作为地震分类的属性,类别只有两个:0:不发生地震、1:发生地震。一个不加思考的分类器,对每一个测试用例都将类别划分为0,那那么它就可能达到99%的准确率,但真的地震来临时,这个分类器毫无察觉,这个分类带来的损失是巨大的。为什么99%的准确率的分类器却不是我们想要的,因为这里数据分布不均衡,类别1的数据太少,完全错分类别1依然可以达到很高的准确率却忽视了我们关注的东西。

          • precision: P = T P T P + F P ​ P = { {TP} \over {TP+FP} }​ P=TP+FPTP; 预测为正的样本中有多少是对的

          • recall: R = T P T P + F N R = { TP \over {TP+FN} } R=TP+FNTP; 样本中的正例有多少被预测正确了

          • F-test: F = 2 ∗ P ∗ R P + R F = {2*P*R \over P+R} F=P+R2PR; 综合考虑P指标和R指标

          • ROC (Receiver Operating Characteristic) 与 AUC

      • overfitting: lead to high correctness on the existing training data yet low correctness on the unseen data.

        • Cross-validation

        • Perturbed Model Validation (PMV)

          PMV operates by injecting noise to the training data, re-training the model against the perturbed data, then using the training accuracy decrease rate to assess model relevance. A larger decrease rate indicates better concept-hypothesis fit.

    • non-functional requirements

      • robustness: check the correctness of the system with the existence of noise

        • adversarial robustness
          • Perturbation Targeting Test Data: adversarial example generation approaches
          • Perturbation Targeting the Whole System
      • security

        low robustness is just one cause for high security risk.

      • data privacy

        the current research mainly focus on data privacy is how to present privacy-preserving machine learning, instead of detecting privacy violations

      • interpretability

        • Manual Assessment of Interpretability
        • Automatic Assessment of Interpretability
          • The metric measures whether the learned has actually learned the object in object identification scenario via occluding the surroundings of the objects.
          • He identified several models with good interpretability, including linear regression, logistic regression and decision tree models.
  • Testing Components

    • Bug Detection in Data

      • purpose

        • whether the data is sufficient for training or test a model
        • whether the data is representative of future data
        • whether the data contains a lot of noise such as biased labels
        • whether there is skew between training data and test data
        • whether there is data poisoning or adversary information that may affect the model’s performance
      • aspects

        • Bug Detection in Training Data

        • Bug Detection in Test Data

        • Skew Detection in Training and Test Data

          The training instances and the instances that the model predicts should be consistent in aspects such as features and distributions.

        • Frameworks in Detecting Data Bugs

    • Bug Detection in Learning Program

      • purpose
        • the algorithm is designed, chosen, or configured improperly
        • the developers make typos or errors when implementing the designed algorithm
      • aspect
        • Unit Tests for ML Learning Program
        • Algorithm Configuration Examination: compatibility problems
        • Algorithm Selection Examination: compare deep learning and classic learning
        • Mutant Simulations of Learning Program Faults
    • Bug Detection in Frameworks

      • purpose: checks whether the frameworks of machine learning have bugs that may lead to problems in the final system
      • Solutions towards Detecting Implementation Bugs:
        • use multiple implementations or differential testing to detect bugs
        • metamorphic testing
  • Software Testing vs. ML Testing

    • Component to test: traditional software testing detects bugs in the code

    • Behaviours under test: the behaviours of an ML model may frequently change as the update of training data

    • Test input: when testing the data, the test input could be a learning program

    • Test oracle

      • assumes the presence of a test oracle
      • the oracle is usually determined beforehand
      • the answers are usually unknown
    • Test adequacy criteria

      • line coverage, branch coverage, dataflow coverage
      • new test adequacy criteria are required so as to take the characteristics of machine learning software into consideration.
    • False positives in detected bugs

      ML testing tend to yield more false positives

    • Roles of testers

      data scientists or algorithm designers could also play the role of testers

  • application scenarios: domain-specific testing approaches

    • Autonomous Driving

    • Machine Translation

      Machine translation automatically translates text or speech from one language to another.

      • translation consistency
      • the algorithm for detecting machine translation violations
    • Natural Language Inference

      A Nature Language Inference (NLI) task judges the inference relationship of a pair of natural language sentences. For example, the sentence ‘A person is in the room’ could be inferred from the sentence ‘A girl is in the room’.

      • robustness test
  • research distribution

    • General Machine Learning and Deep Learning: Before 2017, papers mostly focus on general machine learning; after 2018, both general machine learning learning and deep learning testing notably arise.

    • Supervised/Unsupervised/Reinforcement Learning Testing: almost all the work we identified in this survey focused on testing supervised machine learning

      reason:

      • First, supervised learning is a widely-known learning scenario associated with classification, regression, and ranking problems. It is natural that researchers would emphasise the testing of widely-applied, known and familiar techniques at the beginning.
      • Second, supervised learning usually has labels in the dataset. It is thereby easier to judge and analyse test effectiveness.
    • Different Learning Tasks: almost all of them focus on classification

    • Different Testing Properties: around one-third (32.1%) of the papers test correctness. Another one-third of the papers focus on robustness and security problems. Fairness testing ranks the third among all the properties, with 13.8% papers.

      1565185836791

  • CHALLENGES

    • Test Input Generation
      • applying SBST on generating test inputs for testing ML systems (Search-based test generation (SBST) uses a metaheuristic optimising search technique, such as a Genetic Algorithm, to automatically generate test inputs.)
      • how to generate natural test inputs and how to automatically measure the naturalness of the generated inputs. (The existing test input generation techniques focus more on generating adversarial input to test the robustness of an ML system.)
    • Oracle Problem
      • Metamorphic relations: are proposed by human beings
      • A big challenge is thus to automatically identify and construct reliable test oracles for ML testing.
    • Testing Cost Reduction
      • A possible research direction of reducing cost is to represent an ML model into some kind of intermediate state to make it easier for testing.
      • We could also apply traditional cost reduction techniques such as test prioritisation to reduce the size of test cases while remaining the test correctness.
  • OPPORTUNITIES

    • More research works are highly desired for unsupervised learning and reinforcement learning.
    • Testing More Properties
    • there are very few benchmarks like CleverHans that are specially designed for the ML testing research (i.e., adversarial example construction) purpose.
    • no work has explored how to better design mutation operators for machine learning code so that the mutants could better simulate real-world machine learning bugs

Reference

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值