Certified Robustness for LLM毕业设计准备_crown-robustness-certification-CSDN博客

本文链接：https://blog.csdn.net/m0_64510761/article/details/144993848

整理一些关于我要做的毕业设计——大语言模型的可验证鲁棒性研究——的文献。分为attack和certify robustness两部分，attack部分找的是进行恶意攻击的方法，certify robustness部分是找了一些验证鲁棒性/提升鲁棒性（特指randomized smoothing方法）的方法。

Attack

title	abbr.	time	model	others
Generating Natural Language Adversarial Examples	Alzantot,et al	2018
Generating Natural Language Adversarial Examples through Probability Weighted Word Saliency	Ren.et al	2019
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment	TextFooler	2020	BERT	黑盒模型，按重要性进行word substitution
TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP	TextAttack	2020
Evaluating the Robustness of Neural Language Models to Input Perturbations	reorder	2021
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models	AdvGLUE，多种攻击方法（包括sentence-level）	2021
Query-Efficient and Scalable Black-Box Adversarial Attacks on Discrete Sequential Data via Bayesian Optimization	Bayesian optimization	2022
Tailor: Generating and Perturbing Text with Semantic Controls	semantic-preserving	Mar-22		用于sentence-level 的adversarial text的生成
Large Language Models Can Be Easily Distracted by Irrelevant Context	irrelavant context	2023

Certify Robustness

分为vision model（图像分类模型）和language model（语言分类模型）。

Vision Models

2019年之前都是IBP或者CROWN的那种硬算的方法（数学公式实在太多了，证明直接略去不看），近年来逐渐使用randomized smoothing这种概率方法（添加噪声），这种方法还在验证鲁棒性的同时还可以提升鲁棒性。

title	abbr.	time	model
AI2: Safety and Robustness Certification of Neural Networks with Abstract Interpretation	zonotope, abstract interpretation	2018	image
Fast and Effective Robustness Certification	DeepZ, zonotope	2018	image
Efficient Neural Network Robustness Certification with General Activation Functions	CROWN	2018	image
Towards Fast Computation of Certified Robustness for ReLU Networks	Fast-Lin	Oct-18	image
An abstract domain for certifying neural networks	DeepPoly,	2019	image
Certified Adversarial Robustness via Randomized Smoothing	randomized smoothing	2019	blackbox
Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers	randomized smoothing	2019	blackbox
Black-Box Certification with Randomized Smoothing: A Functional Optimization Based Framework	randomized smoothing	2020	blackbox
TSS: Transformation-Specific Smoothing for Robustness Certification	TSS	2021	transformation-specific
PRIMA: General and Precise Neural Network Certification via Scalable Convex Hull Approximations	PRIMA	Jan-22
Certified Adversarial Robustness via Anisotropic Randomized Smoothing	randomized smoothing	2022	blackbox

Language Models

大致上和vision model的思路差不多，只不过language model的输入是离散的，某个高维区域中几乎所有点都无法对应到对应的输入（token）。然后Text-CRS这篇把word-level感觉做的极致了，把各种perturbation方法都定义了一遍，tql。

title	abbr.	time	model	others
Achieving Verified Robustness to Symbol Substitutions via Interval Bound Propagation	IBP robust training	2019
Certified Robustness to Adversarial Word Substitutions	IBP robust training	2019
Towards Stable and Efficient Training of Verifiably Robust Neural Networks	IBP robust training, CROWN-IBP	Nov-19
ROBUSTNESS VERIFICATION FOR TRANSFORMERS	类CROWN	Feb-20	transformer	sentiment classification
SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutions	randomized smoothing	2020	blackbox	word substitution
Certified Robustness to Programmable Transformations in LSTMs		Sep-21	LSTM only
Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemble	randomized smoothing	Aug-21	BERT
TOWARDS ROBUSTNESS AGAINSTNATURAL LANGUAGE WORD SUBSTITUTIONS	ASCC, adversarial training	2021	LSTM, CBOM	ASCC生成攻击样本
Certified Robustness Against Natural Language Attacks by Causal Intervention	CISS	2022
Certified Robustness to Text Adversarial Attacks by Randomized [MASK]	Randomized smoothing	Jun-23
Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial Attacks	Text-CRS	2024	BERT LSTM	四种攻击操作的防御
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM	random prompt dropping	2024	chatgpt
CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language Models	RL and random [mask]	2024	BERT, chatgpt	Universal TP
NLP Verification: Towards a General Methodology for Certifying Robustness	randomized smoothing	2024		semantic perturbation

目前的想法

“NLP Verification: Towards a General Methodology for Certifying Robustness”这篇的想法和我不谋而合，目前语言模型上的robustness verification（同certified robustness）主要关注word-level的扰动（perturbation），如果使用一种sentence-level的扰动，即在保持句意的情况下进行paraphrase等操作，现有的方法就不太适合。