整理一些关于我要做的毕业设计——大语言模型的可验证鲁棒性研究——的文献。分为attack和certify robustness两部分,attack部分找的是进行恶意攻击的方法,certify robustness部分是找了一些验证鲁棒性/提升鲁棒性(特指randomized smoothing方法)的方法。
Attack
title | abbr. | time | model | others |
---|---|---|---|---|
Generating Natural Language Adversarial Examples | Alzantot,et al | 2018 | ||
Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency | Ren.et al | 2019 | ||
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment | TextFooler | 2020 | BERT | 黑盒模型,按重要性进行word substitution |
TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP | TextAttack | 2020 | ||
Evaluating the Robustness of Neural Language Models to Input Perturbations | reorder | 2021 | ||
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models | AdvGLUE,多种攻击方法(包括sentence-level) | 2021 | ||
Query-Efficient and Scalable Black-Box Adversarial Attacks on Discrete Sequential Data via Bayesian Optimization | Bayesian optimization | 2022 | ||
Tailor: Generating and Perturbing Text with Semantic Controls | semantic-preserving | Mar-22 | 用于sentence-level 的adversarial text的生成 | |
Large Language Models Can Be Easily Distracted by Irrelevant Context | irrelavant context | 2023 |
Certify Robustness
分为vision model(图像分类模型)和language model(语言分类模型)。
Vision Models
2019年之前都是IBP或者CROWN的那种硬算的方法(数学公式实在太多了,证明直接略去不看),近年来逐渐使用randomized smoothing这种概率方法(添加噪声),这种方法还在验证鲁棒性的同时还可以提升鲁棒性。
title | abbr. | time | model |
---|---|---|---|
AI2: Safety and Robustness Certification of Neural Networks with Abstract Interpretation | zonotope, abstract interpretation | 2018 | image |
Fast and Effective Robustness Certification | DeepZ, zonotope | 2018 | image |
Efficient Neural Network Robustness Certification with General Activation Functions | CROWN | 2018 | image |
Towards Fast Computation of Certified Robustness for ReLU Networks | Fast-Lin | Oct-18 | image |
An abstract domain for certifying neural networks | DeepPoly, | 2019 | image |
Certified Adversarial Robustness via Randomized Smoothing | randomized smoothing | 2019 | blackbox |
Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers | randomized smoothing | 2019 | blackbox |
Black-Box Certification with Randomized Smoothing: A Functional Optimization Based Framework | randomized smoothing | 2020 | blackbox |
TSS: Transformation-Specific Smoothing for Robustness Certification | TSS | 2021 | transformation-specific |
PRIMA: General and Precise Neural Network Certification via Scalable Convex Hull Approximations | PRIMA | Jan-22 | |
Certified Adversarial Robustness via Anisotropic Randomized Smoothing | randomized smoothing | 2022 | blackbox |
Language Models
大致上和vision model的思路差不多,只不过language model的输入是离散的,某个高维区域中几乎所有点都无法对应到对应的输入(token)。然后Text-CRS这篇把word-level感觉做的极致了,把各种perturbation方法都定义了一遍,tql。
title | abbr. | time | model | others |
---|---|---|---|---|
Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation | IBP robust training | 2019 | ||
Certified Robustness to Adversarial Word Substitutions | IBP robust training | 2019 | ||
Towards Stable and Efficient Training of Verifiably Robust Neural Networks | IBP robust training, CROWN-IBP | Nov-19 | ||
ROBUSTNESS VERIFICATION FOR TRANSFORMERS | 类CROWN | Feb-20 | transformer | sentiment classification |
SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions | randomized smoothing | 2020 | blackbox | word substitution |
Certified Robustness to Programmable Transformations in LSTMs | Sep-21 | LSTM only | ||
Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble | randomized smoothing | Aug-21 | BERT | |
TOWARDS ROBUSTNESS AGAINSTNATURAL LANGUAGE WORD SUBSTITUTIONS | ASCC, adversarial training | 2021 | LSTM, CBOM | ASCC生成攻击样本 |
Certified Robustness Against Natural Language Attacks by Causal Intervention | CISS | 2022 | ||
Certified Robustness to Text Adversarial Attacks by Randomized [MASK] | Randomized smoothing | Jun-23 | ||
Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks | Text-CRS | 2024 | BERT LSTM | 四种攻击操作的防御 |
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM | random prompt dropping | 2024 | chatgpt | |
CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language Models | RL and random [mask] | 2024 | BERT, chatgpt | Universal TP |
NLP Verification: Towards a General Methodology for Certifying Robustness | randomized smoothing | 2024 | semantic perturbation |
目前的想法
“NLP Verification: Towards a General Methodology for Certifying Robustness”这篇的想法和我不谋而合,目前语言模型上的robustness verification(同certified robustness)主要关注word-level的扰动(perturbation),如果使用一种sentence-level的扰动,即在保持句意的情况下进行paraphrase等操作,现有的方法就不太适合。