LLM的理论古往今来（持续更新ing...）_domain specialization as the key to make large lan-CSDN博客

本文链接：https://blog.csdn.net/PolarisRisingWar/article/details/131095096

诸神缄默不语-个人CSDN博文目录

要真说追是很难追上的，反正就……
作为一个笔记集锦。

文中的年份并不严格与内容对应，尤其在跨年的情况下。

数值推理、序列标注/信息抽取相关论文不写在本篇。
在我写prompt技巧的博文（prompt工程（持续更新ing…））中出现过的论文也不在本篇中再度赘述。

文章目录

2024年
2023年
2022年
2021年
2020年
2019年
2018年
2016年
参考资料

2024年

评估
1. (AAAI) Avoiding Data Contamination in Language Model Evaluation: Dynamic Test Construction with Latest Materials：LatestEval评估方法，利用最新文本避免数据污染
长文本
1. Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon
(印度理工学院) Large Language Models aren’t all that you need：这篇就是说在传统NER方法上叠加trick其实能超越LLM
(谷歌) LLM Augmented LLMs: Expanding Capabilities through Composition：LLM helps LLM。通过cross attention组合LLM

2023年

综述
1. （人大高瓴人工智能学院）《A Survey of Large Language Models》及其中文版《大语言模型综述》
  A Survey of Large Language Models
  https://github.com/RUCAIBox/LLMSurvey/blob/main/assets/LLM_Survey_Chinese_0418.pdf
2. Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey
3. On the Origin of LLMs: An Evolutionary Tree and Graph for 15,821 Large Language Models
4. (华为) Aligning Large Language Models with Human: A Survey
5. A Survey on Evaluation of Large Language Models
6. A Survey on Multimodal Large Language Models
7. A Comprehensive Overview of Large Language Models
8. Igniting Language Intelligence: The Hitchhiker’s Guide From Chain-of-Thought Reasoning to Language Agents
9. The Efficiency Spectrum of Large Language Models: An Algorithmic Survey
10. Data Management For Large Language Models: A Survey
11. (ACM Computing Surveys) Survey of Hallucination in Natural Language Generation
集锦
1. Open LLM Leaderboard - a Hugging Face Space by HuggingFaceH4
教程
1. (ACL) Retrieval-based Language Models and Applications
The Curse of Recursion: Training on Generated Data Makes Models Forget：（第一个版本的标题比较劲爆）大意就是说用LLM生成的数据再训练LLM会使LLM效果越来越烂
Intelligence Primer
长文本
1. Blockwise Parallel Transformer (BPT)：Blockwise Parallel Transformer for Long Context Large Models
2. The Impact of Positional Encoding on Length Generalization in Transformers：比较了不同PE的长度泛化效果
3. （微软）LongNet: Scaling Transformers to 1,000,000,000 Tokens：线性计算复杂性，dilated attention（注意力随距离增加指数下降）
4. LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models
5. 综述：Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey
6. (2023 TACL) Lost in the Middle: How Language Models Use Long Contexts
(2023 ACL) Sen2Pro: A Probabilistic Perspective to Sentence Embedding from Pre-trained Language Model：将句子表征为向量空间的probability density distribution。这个方法不用retrain，可以直接插到LLM上
为什么现在的LLM都是Decoder only的架构？ - 知乎
transformer中的attention为什么scaled? - 知乎
模型蒸馏
1. Knowledge Distillation of Large Language Models
模型量化
1. OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
ChatGPT大家族系列
1. 黑客 George Hotz 爆料 GPT-4 由 8 个 MoE 模型组成，真的吗？ - 知乎：听OpenAI讲，这就是大力出奇迹！
2. Let’s Verify Step by Step
ICL
1. A Survey on In-context Learning
2. Pre-Training to Learn in Context
3. (ACL Findings) Why Can GPT Learn In-Context? Language Models Implicitly Perform Gradient Descent as Meta-Optimizers
  1. 讲解博文：清北微软深挖GPT，把上下文学习整明白了！和微调基本一致，只是参数没变而已
4. (EMNLP) Label Words are Anchors: An Information Flow Perspective for Understanding In-Context Learning
微调
1. Symbol tuning improves in-context learning in language models
2. 优化器
  1. (微软) DeepSpeed
  2. DeepSpeedExamples/applications/DeepSpeed-Chat at master · microsoft/DeepSpeedExamples
3. MiniChain
4. (ICLR) Self-Consistency Improves Chain of Thought Reasoning in Language Models
  从语言模型中对不同的输出集（利用采样生成多种不同的推理路径）进行采样，并返回集合中最一致的答案→其实是集成方法，代替CoT的贪心策略
  ReadPaper AI讲解论文：https://readpaper.com/paper/666189129860624384
5. RLHF
  1. (ICLR) Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization
6. CoT论文列表：https://github.com/Timothyxxx/Chain-of-ThoughtsPapers
7. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning：prompt集合
8. (ACL) Self-Instruct: Aligning Language Models with Self-Generated Instructions：bootstraping扩展指令
9. Exploring Format Consistency for Instruction Tuning：数据集间指令格式统一
10. Instruction Tuning for Large Language Models: A Survey
11. Tied-Lora: Enhacing parameter efficiency of LoRA with weight tying
12. (EACL) DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation
13. Full Parameter Fine-tuning for Large Language Models with Limited Resources：LOMO方法
检索+生成
1. (ICLR) Generate rather than Retrieve: Large Language Models are Strong Context Generators：认为以前的工作都是先检索后阅读，本文先生成上下文文档，再阅读生成的文档，生成最终答案
  基于聚类的prompt方法：选择不同的prompt，以使生成的文档涵盖不同的视角，从而提高recall
  ReadPaper AI讲解论文：https://readpaper.com/paper/4670624394054221825
2. Retrieval meets Long Context Large Language Models
3. Model-enhanced Vector Index
4. Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models：思路是如果检索到不相关的信息可能会误导LLM，此外LLM本身可能不知道它自身与知识库能提供的知识够不够回答问题。本文提出CON方法来提高RAG的鲁棒性：为检索到的文档生成连续阅读笔记，评估其与问题的相关性，整合生成答案
5. Learning to Filter Context for Retrieval-Augmented Generation
6. (苹果) Context Tuning for Retrieval Augmented Generation
7. Large Language Models for Information Retrieval: A Survey
8. RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models：高质量语料，包含LLM+RAG的自动回复，标注了幻觉强度等
9. (ICLR) Quantifying Memorization Across Neural Language Models
(ICLR) APE
Large Language Models Are Human-Level Prompt Engineers：总之就是用LLM自己写prompt
(普林斯顿) InstructEval: Systematic Evaluation of Instruction Selection Methods：评估prompt
参考阅读博文：放弃评测大模型，普林斯顿大学已经开始评估Prompt了，提出Prompt评估框架
TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks
参考阅读博文：GPT-4使用效果不好？美国奥本大学提出Prompt分类法，另辟蹊径构建Prompt设计指南
推荐系统
1. Is ChatGPT a Good Recommender? A Preliminary Study
2. 新闻推荐
  1. (SIGIR) Prompt4NR
    Prompt Learning for News Recommendation
LLM+KG
1. KGQA
  1. KAPING
    Knowledge-Augmented Language Model Prompting for Zero-Shot Knowledge Graph Question Answering
可靠性¹
1. Measuring Faithfulness in Chain-of-Thought Reasoning：衡量CoT推理的忠实性
  发现有时模型会忽略生成的推理，就硬是照着自己想的来说（模型越大越容易发生这种事）
  解决方案是选择合适的模型
2. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
  给出了解决方案：将问题分解为子问题，并分别解答
3. (Meta) Head-to-Tail: How Knowledgeable are Large Language Models (LLM)? A.K.A. Will LLMs Replace Knowledge Graphs?：认为LLM不能替代LM，因为它不可靠，而且这不可靠很难优化（就常规优化LLM的方法很难解决不可靠问题本身）。
4. 幻觉问题的综述：
5. Fine-tuning Language Models for Factuality：评估文本事实真实性，微调语言模型使其产生较少的事实性错误
评测
(微软) A Survey on Evaluation of Large Language Models
(天大) Evaluating Large Language Models: A Comprehensive Survey
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Toolformer
Toolformer: Language Models Can Teach Themselves to Use Tools：LLM自学使用小工具
Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Dataset Augmented by ChatGPT
Large Language Models as Optimizers：用自然语言描述优化任务（这个提示也可以优化），让LLM自己优化，一步一步迭代
agent
1. A Survey on Large Language Model based Autonomous Agents
2. The Rise and Potential of Large Language Model Based Agents: A Survey
Detecting Pretraining Data from Large Language Models：检测某段文本是否被LLM用于预训练
(Meta) The ART of LLM Refinement: Ask, Refine, and Trust
可解释性：(Anthropic) Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
模型加速：Accelerating Generative AI with PyTorch II: GPT, Fast | PyTorch
LM-Cocktail: Resilient Tuning of Language Models via Model Merging：这篇的思路是微调模型会出现灾难性遗忘问题，所以本文将base模型和微调模型的参数加权平均，发现效果很好。但是相关fuse工作这样的结论早就有了，supermario（阿里）的就是直接求平均
Contrastive Chain-of-Thought Prompting：idea被scoop原来就是这样的感觉吗（喃喃）
Ghostbuster: Detecting Text Ghostwritten by Large Language Models：检索LLM生成的文本
Scalable Extraction of Training Data from (Production) Language Models：鼓励ChatGPT泄漏训练数据
Prompt Engineering a Prompt Engineer
(微软) Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine：这一篇是用提示工程在专业领域上解决任务的
SparQ Attention: Bandwidth-Efficient LLM Inference：推理加速
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
Why “classic” Transformers are shallow and how to make them go deep
MoDS: Model-oriented Data Selection for Instruction Tuning
Are Emergent Abilities of Large Language Models a Mirage?
Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models
Improving Text Embeddings with Large Language Models

2022年

prompt
1. (ACM Computing Surveys) Re33：读论文 Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Languag
2. (CHI LBW) PromptChainer: Chaining Large Language Model Prompts through Visual Programming：这个是将多个LLM链接（也就是先过一个LLM，然后再过一个LLM……etc）的可视化编程工具
3. LangChain
4. LLamaIndex
预训练 / 微调策略
1. (ICLR) LoRA: Low-Rank Adaptation of Large Language Models
2. RLHF / instruction following / 指令对齐
  (NeurIPS) Training language models to follow instructions with human feedback
  官方博文：Aligning language models to follow instructions
3. Scaling Instruction-Finetuned Language Models
4. instruction-based finetuning / instruction tuning / FLAN：通过instruction（可以理解成“简介”之类的）描述一堆数据集，LM在上面微调，然后LM就能在别的用instruction描述的数据集上面的表现效果也有所提升。就提升LM的零样本学习效果（拿来做实验用的是LaMDA模型）
  论文：(ICLR 谷歌) Finetuned Language Models Are Zero-Shot Learners
检索
1. (DeepMind) RETRO
  Improving language models by retrieving from trillions of tokens：检索文档chunk，调节AR语言模型。
  用BERT将token数据库嵌入chunk中
  chunked cross-attention mechanism
涌现能力：(TMLR) Emergent Abilities of Large Language Models

2021年

prefix-tuning
1. (ACL) Prefix-Tuning: Optimizing Continuous Prompts for Generation
  1. 官方讲解视频：Underline | Prefix-Tuning: Optimizing Continuous Prompts for Generation
LoRA: Low-Rank Adaptation of Large Language Models
1. GitHub：microsoft/LoRA: Code for loralib, an implementation of “LoRA: Low-Rank Adaptation of Large Language Models”
prompt
1. thunlp/OpenPrompt: An Open-Source Framework for Prompt-Learning.：支持transformers模型
2. (ACL) Making Pre-trained Language Models Better Few-shot Learners
  参考学习博文：【综述】Prompting: 更好地将语言模型应用到NLP任务
(EACL) Language Models as Knowledge Bases: On Entity Representations, Storage Capacity, and Paraphrased Queries
关注LLM中的知识如何表示，以及LLM的知识存量（和LLM规模基本线性增长）
(EMNLP Findings) Retrieval Augmentation Reduces Hallucination in Conversation：用检索解决幻觉问题

2020年

(ACL) Re26：读论文 Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks：大致是说LLM在通用域上预训练之后，在领域和任务上再继续预训练会提升模型效果
Adapter
1. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
2. K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters：解决注入知识的灾难性遗忘问题，用adapter实现
RAG
1. (NeurIPS) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Scaling Laws for Neural Language Models

2019年

Adapter：怎么说呢感觉就是在LLM里新加了一个模块，然后只调这个模块，不调全局参数。
这样使得微调效率upup
1. 论文：(ICML) Parameter-Efficient Transfer Learning for NLP adapters
2. 介绍博文：NLP中的Adapters是什么？ | Finisky Garden
3. AdapterHub - 572 adapters for 76 text tasks and 50 languages
LAnguage Model Analysis (LAMA)
1. 论文：(2019 EMNLP) Language Models as Knowledge Bases?：探查语言模型中的关系知识。具体的解决方案就是将问题作为填空题给出，让模型计算mask位置的结果。结果是LLM做得还不错（我感觉如果用新的LLM可能会效果更好）
(NAACL) Linguistic Knowledge and Transferability of Contextual Representations：对比了BERT / ELMo / GPT（文中叫OpenAI transformer），主要探讨了语言模型编码了什么特征和不同层、不同预训练任务的transferibility问题

（这篇论文的实验也太多了，就算以现在的目光来看也只能感慨真是有钱有时间！）
参考讲解博文：ELMo/GPT/BERT对比 - 知乎
(ACL 香侬科技) Is Word Segmentation Necessary for Deep Learning of Chinese Representations?：得到了字几乎总是优于词的结论（苏神认为这篇的实验不足，这篇实验对比字嵌入和词嵌入都用得是随机初始化的嵌入矩阵，这导致以词为单位的模型嵌入层参数更多，容易过拟合。但事实上一般用的都是预训练好的词向量，但这篇文章里没有关注到这个场景。苏神后来做的WoBERT认为以词为单位其实效果更好）
(ACL) Barack’s Wife Hillary: Using Knowledge Graphs for Fact-Aware Language Modeling