Certified Robustness for LLM毕业设计准备

整理一些关于我要做的毕业设计——大语言模型的可验证鲁棒性研究——的文献。分为attack和certify robustness两部分,attack部分找的是进行恶意攻击的方法,certify robustness部分是找了一些验证鲁棒性/提升鲁棒性(特指randomized smoothing方法)的方法。

Attack

titleabbr.timemodelothers
Generating Natural Language Adversarial ExamplesAlzantot,et al2018
Generating Natural Language Adversarial Examples through Probability Weighted Word SaliencyRen.et al2019
Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and EntailmentTextFooler2020BERT黑盒模型,按重要性进行word substitution
TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLPTextAttack2020
Evaluating the Robustness of Neural Language Models to Input Perturbationsreorder2021
Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language ModelsAdvGLUE,多种攻击方法(包括sentence-level)2021
Query-Efficient and Scalable Black-Box Adversarial Attacks on Discrete Sequential Data via Bayesian OptimizationBayesian optimization2022
Tailor: Generating and Perturbing Text with Semantic Controlssemantic-preservingMar-22用于sentence-level 的adversarial text的生成
Large Language Models Can Be Easily Distracted by Irrelevant Contextirrelavant context2023

Certify Robustness

分为vision model(图像分类模型)和language model(语言分类模型)。

Vision Models

2019年之前都是IBP或者CROWN的那种硬算的方法(数学公式实在太多了,证明直接略去不看),近年来逐渐使用randomized smoothing这种概率方法(添加噪声),这种方法还在验证鲁棒性的同时还可以提升鲁棒性。

titleabbr.timemodel
AI2: Safety and Robustness Certification of Neural Networks with Abstract Interpretationzonotope, abstract interpretation2018image
Fast and Effective Robustness CertificationDeepZ, zonotope2018image
Efficient Neural Network Robustness Certification with General Activation FunctionsCROWN2018image
Towards Fast Computation of Certified Robustness for ReLU NetworksFast-LinOct-18image
An abstract domain for certifying neural networksDeepPoly,2019image
Certified Adversarial Robustness via Randomized Smoothingrandomized smoothing2019blackbox
Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiersrandomized smoothing2019blackbox
Black-Box Certification with Randomized Smoothing: A Functional Optimization Based Frameworkrandomized smoothing2020blackbox
TSS: Transformation-Specific Smoothing for Robustness CertificationTSS2021transformation-specific
PRIMA: General and Precise Neural Network Certification via Scalable Convex Hull ApproximationsPRIMAJan-22
Certified Adversarial Robustness via Anisotropic Randomized Smoothingrandomized smoothing2022blackbox

Language Models

大致上和vision model的思路差不多,只不过language model的输入是离散的,某个高维区域中几乎所有点都无法对应到对应的输入(token)。然后Text-CRS这篇把word-level感觉做的极致了,把各种perturbation方法都定义了一遍,tql。

titleabbr.timemodelothers
Achieving Verified Robustness to Symbol Substitutions via Interval Bound PropagationIBP robust training2019
Certified Robustness to Adversarial Word SubstitutionsIBP robust training2019
Towards Stable and Efficient Training of Verifiably Robust Neural NetworksIBP robust training, CROWN-IBPNov-19
ROBUSTNESS VERIFICATION FOR TRANSFORMERS类CROWNFeb-20transformersentiment classification
SAFER: A Structure-free Approach for Certified Robustness to Adversarial Word Substitutionsrandomized smoothing2020blackboxword substitution
Certified Robustness to Programmable Transformations in LSTMsSep-21LSTM only
Defense against Synonym Substitution-based Adversarial Attacks via Dirichlet Neighborhood Ensemblerandomized smoothingAug-21BERT
TOWARDS ROBUSTNESS AGAINSTNATURAL LANGUAGE WORD SUBSTITUTIONSASCC, adversarial training2021LSTM, CBOMASCC生成攻击样本
Certified Robustness Against Natural Language Attacks by Causal InterventionCISS2022
Certified Robustness to Text Adversarial Attacks by Randomized [MASK]Randomized smoothingJun-23
Text-CRS: A Generalized Certified Robustness Framework against Textual Adversarial AttacksText-CRS2024BERT LSTM四种攻击操作的防御
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLMrandom prompt dropping2024chatgpt
CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language ModelsRL and random [mask]2024BERT, chatgptUniversal TP
NLP Verification: Towards a General Methodology for Certifying Robustnessrandomized smoothing2024semantic perturbation

目前的想法

“NLP Verification: Towards a General Methodology for Certifying Robustness”这篇的想法和我不谋而合,目前语言模型上的robustness verification(同certified robustness)主要关注word-level的扰动(perturbation),如果使用一种sentence-level的扰动,即在保持句意的情况下进行paraphrase等操作,现有的方法就不太适合。

To be continued…

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值