1. Architectural Components of Large Language Models (LLMs)
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing:
SentencePiece is a subword tokenizer and detokenizer that was developed by Google and released in 2018. It is a language-independent and simple tool that is designed to handle the subword segmentation task in neural text processing.
The main advantage of SentencePiece is that it can handle a large variety of languages, including those with complex writing systems such as Chinese, Japanese, and Korean. Additionally, it can handle tasks such as byte-pair encoding (BPE) and unigram language models, and can learn subword segmentation models from raw text data.
The SentencePiece algorithm works by first splitting text into sentences, and then segmenting each sentence into subwords. The subwords are generated by a process that involves iteratively merging the most frequently occurring pairs of adjacent characters or character sequences, until a specified vocabulary size is reached. The resulting subwords are then used as the input to a neural text processing model.
SentencePiece also includes a detokenizer module that can be used to convert subwords back into their original words or sentences. This is useful for generating human-readable text output from neural text processing models.
Overall, SentencePiece is a powerful and flexible tool that can be used for a wide range of neural text processing tasks, including machine translation, text generation, and language modeling. Its language independence and ability to handle complex writing systems make it a valuable tool for researchers and practitioners in the field of natural language processing.
Parameter-Efficient Transfer Learning for NLP:
The paper "Parameter-Efficient Transfer Learning for NLP" proposes a parameter-efficient transfer learning approach for natural language processing (NLP). This approach leverages the idea of parameter sharing in neural networks and fine-tunes a pre-trained model on different tasks by sharing its parameters.
Compared to previous transfer learning methods, the proposed approach achieves higher efficiency and performance. In the GLUE benchmark, it achieves comparable or even better performance than state-of-the-art techniques while using fewer parameters and computation resources.
The core idea of the approach is to use a pre-trained model based on the Transformer architecture, which is trained on a large corpus to learn universal language representations. Then, it fine-tunes the model on specific tasks with fewer parameters and computation resources.
The paper also introduces some techniques for model fine-tuning, such as dynamic masking and learning rate warm-up, which further improve the performance and stability of the model.
Overall, this paper proposes an effective transfer learning approach that can be applied to NLP and achieves good performance with fewer parameters and computation resources. This technology provides a promising direction for NLP research and applications.
Generating Long Sequences with Sparse Transformers:
"Generating Long Sequences with Sparse Transformers" proposes a new approach for efficiently generating long sequences using a sparse attention mechanism in the Transformer architecture. The paper introduces a sparse attention mechanism that reduces the computational cost and memory requirements of the standard Transformer model by attending only to a small subset of the input sequence at each time step. The authors also introduce a new training technique called "progressive sequence growing" to enable more efficient learning of long sequences. The approach achieves state-of-the-art performance on various natural language processing tasks while being much more efficient than the standard Transformer architecture. The authors are researchers at OpenAI with a strong background in deep learning and natural language processing, and the paper was published in the Proceedings of the 36th International Conference on Machine Learning (ICML).
Big Bird: Transformers for Longer Sequences:
"Big Bird: Transformers for Longer Sequences" is a research paper authored by Manzil Zaheer, Guru Guruganesh, Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed from Google Research. The paper proposes a new type of Transformer architecture that can handle longer sequences than the standard Transformer model. The authors introduce a novel technique called "global attention" that allows the model to attend to a small set of global tokens that represent the entire sequence, in addition to attending to the local tokens in the sequence. This technique allows the model to scale to sequences of length up to tens of thousands of tokens while maintaining good performance. Experimental results on a range of language modeling and question answering tasks demonstrate that the proposed approach achieves state-of-the-art performance while being significantly more efficient than other methods for handling long sequences.
Sparse is Enough in Scaling Transformers:
"Sparse is Enough in Scaling Transformers" is a research paper authored by Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, Łukasz Kaiser, Wojciech Gajewski, Henryk Michalewski, and Jonni Kanerva. The paper proposes a new approach for scaling Transformers, a popular architecture for natural language processing tasks. The authors introduce a sparse attention mechanism that can handle long sequences while reducing the computational and memory requirements of the model. The approach involves only attending to a small set of important tokens in the sequence, rather than attending to all tokens as in the standard Transformer. Experimental results on language modeling and machine translation tasks demonstrate that the proposed approach achieves state-of-the-art performance while being much more efficient than other methods for scaling Transformers. The authors are affiliated with the University of Warsaw, OpenAI, and Google Research.
ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models
"ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models" is a research paper authored by Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel from Google Research. The paper proposes a new pre-trained model architecture called ByT5, which is designed to operate at the byte level rather than the token level. The authors argue that this approach eliminates the need for tokenization, which can be a challenging and error-prone process in natural language processing. The ByT5 model is trained on a large corpus of text data using a combination of unsupervised and supervised learning, and can be fine-tuned on a wide range of downstream NLP tasks. Experimental results on a range of tasks, including machine translation and text summarization, demonstrate that the ByT5 model achieves state-of-the-art performance while being much simpler and more efficient than other pre-trained models. Overall, the ByT5 approach represents a promising direction for the future of pre-trained models in NLP.
2. Training, probing, and fine-tuning LLMs
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding:
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" is a research paper authored by Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova from Google AI Language. The paper introduces a new pre-training approach for natural language processing called Bidirectional Encoder Representations from Transformers (BERT). BERT is a deep learning model that is pre-trained on large amounts of text data using a masked language modeling objective and a next sentence prediction objective. The model is based on the Transformer architecture and can be fine-tuned on a wide range of downstream NLP tasks, including question answering, sentiment analysis, and language translation. Experimental results on a range of tasks demonstrate that BERT achieves state-of-the-art performance and outperforms other pre-trained models. BERT has become a popular and widely-used model in the NLP community, and its introduction has spurred further research in pre-trained language models.
ELECTRA: PRE-TRAINING TEXT ENCODERS AS DISCRIMINATORS RATHER THAN GENERATORS:
"ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators" is a research paper authored by Kevin Clark from Stanford University, and Minh-Thang Luong and Quoc V. Le from Google Brain. The paper introduces a new pre-training approach for natural language processing called ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately). Unlike other pre-training approaches that train the model to generate masked tokens, ELECTRA trains the model to discriminate between real and fake tokens. Specifically, ELECTRA introduces a new pre-training task called "replaced token detection", where the model is trained to determine whether a token in a sequence has been replaced by a fake token generated by another model. This approach allows ELECTRA to train more efficiently and effectively, and to achieve state-of-the-art performance on a range of NLP tasks, including question answering, sentiment analysis, and language modeling. Overall, the ELECTRA approach represents a promising direction for pre-training models in NLP, and has sparked further research in this area.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension:
"BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension" is a research paper authored by Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. The paper introduces a new sequence-to-sequence pre-training approach called BART (Bidirectional and Auto-Regressive Transformers). BART is pre-trained on a denoising autoencoder objective, which involves corrupting the input sequence and training the model to reconstruct the original sequence. This approach allows BART to learn to generate high-quality text with fluency and coherence. BART can be fine-tuned on a wide range of natural language processing tasks, including language generation, translation, and comprehension. Experimental results on a range of tasks demonstrate that BART achieves state-of-the-art performance and outperforms other pre-trained models. Overall, the BART approach represents a promising direction for pre-training models in NLP, and has sparked further research in this area.
Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks:
The paper "Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks" proposes a new approach for adapting pre-trained language models to specific domains and tasks. The authors argue that pre-trained language models, such as BERT and GPT, can be fine-tuned on specific tasks and domains to achieve even better performance. They introduce a new pre-training task called "span boundary objective", which involves predicting the boundaries of phrases or entities in a sentence. This task can be used to pre-train a language model on a specific domain or task, and then fine-tuned on downstream tasks using domain-specific data. Experimental results on a range of tasks, including named entity recognition and sentiment analysis, demonstrate that the proposed approach outperforms other fine-tuning methods and achieves state-of-the-art performance. The authors are researchers at the University of Washington and the Allen Institute for Artificial Intelligence (AI2).
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:
The "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks" paper introduces a new approach for generating sentence embeddings using Siamese BERT-Networks. The authors argue that traditional methods for generating sentence embeddings, such as averaging word embeddings, do not capture the semantic meaning of the sentence. The proposed approach involves training a Siamese BERT-Network to generate sentence embeddings that are closer together for semantically similar sentences and further apart for dissimilar sentences. This approach allows the model to learn a more meaningful representation of sentence semantics, which can be used for a range of downstream tasks, including text classification and information retrieval. Experimental results on a range of tasks demonstrate that the proposed approach achieves state-of-the-art performance and outperforms other methods for generating sentence embeddings. The authors are researchers at the UKP Lab at Technische Universität Darmstadt.
3. Prompting and In-Context Learning (ICL)
What Makes Good In-Context Examples for GPT-3:
"What Makes Good In-Context Examples for GPT-3?" is a research paper authored by Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. The paper investigates the factors that contribute to the effectiveness of in-context examples for improving the performance of GPT-3, a state-of-the-art language model. In-context examples involve providing the model with a piece of text and a prompt, and asking it to generate text that continues the prompt in a natural and coherent way. The authors investigate various factors that affect the quality of in-context examples, including the length of the prompt, the diversity of the training data, and the semantic similarity between the prompt and the context. Experimental results demonstrate that longer prompts and more diverse training data lead to better performance, while semantic similarity has a more complex relationship with performance. The findings of the study can be used to guide the creation of in-context examples for training and fine-tuning language models, and may contribute to further improvements in the performance of language models like GPT-3. The authors are affiliated with Duke University, Microsoft Research, and Alibaba Group.
Learning To Retrieve Prompts for In-Context Learning:
"Learning To Retrieve Prompts for In-Context Learning" is a research paper authored by Ohad Rubin, Jonathan Herzig, and Jonathan Berant. The paper proposes a new approach for selecting effective prompts for in-context learning, a technique used to improve the performance of language models like GPT-3. In-context learning involves providing the model with a piece of text and a prompt, and asking it to generate text that continues the prompt in a natural and coherent way. The authors propose a new model called "PromptRetriever" that learns to retrieve effective prompts from a large corpus of text. PromptRetriever is trained using a combination of supervised and unsupervised learning, and is based on a dual-encoder architecture that encodes both the context and the prompt. Experimental results demonstrate that the proposed approach outperforms other methods for prompt retrieval and leads to significant improvements in the performance of language models on a range of tasks, including question answering and text generation. Overall, the approach represents a promising direction for improving the efficiency and effectiveness of in-context learning. The authors are affiliated with Tel Aviv University and the Allen Institute for Artificial Intelligence (AI2).
Calibrate Before Use: Improving Few-Shot Performance of Language Models:
"Calibrate Before Use: Improving Few-Shot Performance of Language Models" is a research paper authored by Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. The paper proposes a new approach for improving the few-shot performance of language models, which involves training the model on a small amount of data. The authors argue that existing approaches for few-shot learning, which typically involve fine-tuning pre-trained language models, often result in poor performance due to the mismatch between the pre-trained model and the small amount of training data. The proposed approach involves calibrating the pre-trained model on a set of calibration tasks that are designed to be similar to the few-shot task. The calibrated model is then fine-tuned on the few-shot task using the small amount of available data. Experimental results on a range of tasks, including text classification and question answering, demonstrate that the proposed approach outperforms existing few-shot learning methods and achieves state-of-the-art performance. Overall, the approach represents a promising direction for improving the performance of language models on tasks with limited training data. The authors are affiliated with the University of California, Berkeley, the University of Illinois at Urbana-Champaign, and the University of California, Irvine.
Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right:
"Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right" is a research paper authored by Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. The paper proposes a new explanation for why language models sometimes generate incorrect answers, even when the correct answer is present in the training data. The authors argue that this phenomenon can be explained by "surface form competition", which occurs when multiple surface forms (i.e. different ways of expressing the same meaning) compete for the same meaning. When this happens, the language model may assign a high probability to a surface form that is semantically close to the correct answer, but not exactly the same. The authors demonstrate that surface form competition is a significant problem for existing language models, and propose several techniques for addressing it, including explicit modeling of surface form competition and incorporating syntactic and semantic information into the model. Experimental results on a range of tasks demonstrate that these techniques can improve the accuracy of language models and reduce the impact of surface form competition. Overall, the paper highlights the importance of understanding the factors that contribute to errors in language models, and represents a promising direction for improving their performance. The authors are affiliated with the University of Washington and the Allen Institute for Artificial Intelligence (AI2).
MULTITASK PROMPTED TRAINING ENABLES ZERO-SHOT TASK GENERALIZATION:
"Multitask Prompted Training Enables Zero-Shot Task Generalization" is a research paper authored by a large team of researchers from various institutions and organizations, including Hugging Face, Brown University, IBM Research, UC Berkeley, and Naver Labs Europe. The paper proposes a new approach for training language models that enables zero-shot task generalization, which involves the ability to perform well on tasks that the model has not been explicitly trained on. The authors argue that existing approaches for training language models, such as pre-training and fine-tuning, often lead to overfitting on specific tasks and do not enable zero-shot generalization. The proposed approach involves training the model on multiple tasks simultaneously using a set of prompts that are designed to be general across tasks. The multitask prompted training approach enables the model to learn to perform multiple tasks at once and to generalize to new tasks that share similar prompt structures. Experimental results on a range of tasks demonstrate that the proposed approach outperforms existing methods for zero-shot task generalization and achieves state-of-the-art performance. Overall, the approach represents a promising direction for training language models that are more flexible and adaptable to a wide range of tasks.
4. Reasoning using LLMs
Large Language Models are Zero-Shot Reasoners:
"Large Language Models are Zero-Shot Reasoners" is a research paper authored by Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. The paper investigates the zero-shot reasoning capabilities of large language models, such as GPT-3, which have been trained on massive amounts of text data. The authors argue that these models have the potential to perform a wide range of reasoning tasks without any additional task-specific training, due to their ability to generate coherent and contextually appropriate text. The paper presents a set of experiments that demonstrate the zero-shot reasoning capabilities of GPT-3 on a range of tasks, including simple arithmetic, logic puzzles, and even question-answering tasks that involve knowledge beyond the training data. The authors also investigate the limitations of GPT-3 and propose several directions for future research on zero-shot reasoning. Overall, the paper highlights the potential of large language models as flexible and powerful tools for natural language understanding and reasoning. The authors are affiliated with The University of Tokyo and Google Research.
Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations:
"Maieutic Prompting: Logically Consistent Reasoning with Recursive Explanations" is a research paper authored by Jaehun Jung, Lianhui Qin, Sean Welleck, Faeze Brahman, Chandra Bhagavatula, Ronan Le Bras, and Yejin Choi. The paper proposes a new approach for reasoning with language models that involves generating recursive explanations for the model's output. The authors argue that existing approaches for prompting language models often result in illogical or inconsistent outputs, and propose the use of recursive explanations to address this problem. The proposed approach, called Maieutic Prompting, involves generating a sequence of prompts that ask the model to justify its previous answers, and using the model's responses to construct a recursive explanation of its reasoning process. The authors demonstrate the effectiveness of the approach on a range of tasks, including arithmetic, question answering, and commonsense reasoning, and show that it leads to more logical and consistent outputs compared to other prompting methods. Overall, the paper represents a promising direction for improving the transparency and interpretability of language models, and highlights the importance of incorporating logical reasoning into their training and inference processes. The authors are affiliated with the University of Washington and the Allen Institute for Artificial Intelligence (AI2).
Iteratively Prompt Pre-trained Language Models for Chain of Thought:
"Iteratively Prompt Pre-trained Language Models for Chain of Thought" is a research paper authored by Boshi Wang, Xiang Deng, and Huan Sun. The paper proposes a new approach for prompting pre-trained language models, which involves iteratively generating prompts based on the model's previous output. The authors argue that existing approaches for prompting language models often involve generating a fixed set of prompts, which may not be optimal for the specific task at hand. The proposed approach involves generating prompts based on the model's previous output, which allows for more fine-grained control over the model's reasoning process. The authors demonstrate the effectiveness of the approach on the Chain of Thought task, which involves generating a coherent sequence of sentences based on a given prompt. Experimental results show that the iteratively prompted language model outperforms existing baselines and achieves state-of-the-art performance on the task. Overall, the paper highlights the importance of designing prompts that are tailored to the specific task and context, and represents a promising direction for improving the performance of language models on complex reasoning tasks. The authors are affiliated with The Ohio State University in Columbus, Ohio.
AUTOMATIC CHAIN OF THOUGHT PROMPTING IN LARGE LANGUAGE MODELS:
"Automatic Chain of Thought Prompting in Large Language Models" is a research paper authored by Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. The paper proposes an approach for automatically generating prompts for large language models that can perform the Chain of Thought task. The Chain of Thought task involves generating a coherent sequence of sentences based on a given prompt, and is a challenging task for language models due to its reliance on both syntactic and semantic coherence. The proposed approach involves training a separate model to generate prompts that are tailored to the specific task and context, and using these prompts to guide the language model's output. The authors demonstrate the effectiveness of the approach on a range of tasks, including generating summaries of news articles and writing creative stories, and show that it outperforms existing baselines. Overall, the paper represents a promising direction for improving the flexibility and adaptability of language models to a wide range of tasks, and highlights the importance of designing task-specific prompts. The authors are affiliated with Shanghai Jiao Tong University and Amazon Web Services.
COMPLEXITY-BASED PROMPTING FOR MULTI-STEP REASONING:
"Complexity-Based Prompting for Multi-Step Reasoning" is a research paper authored by Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. The paper proposes a new approach for prompting language models to perform multi-step reasoning tasks, which involves generating prompts that vary in complexity based on the model's previous output. The authors argue that existing approaches for prompting language models often involve generating a fixed set of prompts, which may not be optimal for complex reasoning tasks that involve multiple steps. The proposed approach involves generating prompts that increase in complexity as the model makes progress on the task, which allows for more fine-grained control over the model's reasoning process. The authors demonstrate the effectiveness of the approach on a range of multi-step reasoning tasks, including complex question answering and science question answering, and show that it outperforms existing baselines. Overall, the paper highlights the importance of designing prompts that are tailored to the specific task and context, and represents a promising direction for improving the performance of language models on complex reasoning tasks. The authors are affiliated with Carnegie Mellon University and Allen Institute for Artificial Intelligence (AI2).
LEAST-TO-MOST PROMPTING ENABLES COMPLEX REASONING IN LARGE LANGUAGE MODELS:
"Least-to-Most Prompting Enables Complex Reasoning in Large Language Models" is a research paper authored by Denny Zhou, Nathanael Scharli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, and Ed Chi. The paper proposes a new approach for prompting language models to perform complex reasoning tasks, which involves generating a sequence of prompts that gradually increase in complexity from the simplest to the most complex. The authors argue that existing approaches for prompting language models often involve generating a fixed set of prompts or a single complex prompt, which may not be optimal for complex reasoning tasks. The proposed approach involves iteratively generating prompts that build on the previous ones, allowing the model to gradually build up its understanding of the task. The authors demonstrate the effectiveness of the approach on a range of complex reasoning tasks, including text classification and natural language inference, and show that it outperforms existing baselines. Overall, the paper highlights the importance of designing prompts that are tailored to the specific task and context, and represents a promising direction for improving the performance of language models on complex reasoning tasks. The authors are affiliated with Google, Carnegie Mellon University, and ETH Zurich.
Self-Consistency Improves Chain of Thought Reasoning in Language Models:
"Self-Consistency Improves Chain of Thought Reasoning in Language Models" is a research paper authored by Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. The paper proposes a new approach for improving the performance of language models on the Chain of Thought task, which involves generating a coherent sequence of sentences based on a given prompt. The authors argue that existing approaches for prompting language models on this task often involve generating a fixed set of prompts or using external knowledge bases, which may not be optimal for capturing the full complexity of the task. The proposed approach involves training the language model to be self-consistent, by using the model's previous output as input for subsequent steps in the reasoning process. This allows the model to gradually build up its understanding of the task and incorporate information from previous steps. The authors demonstrate the effectiveness of the approach on the Chain of Thought task, and show that it outperforms existing baselines. Overall, the paper highlights the importance of designing prompts and training objectives that are tailored to the specific task and context, and represents a promising direction for improving the performance of language models on complex reasoning tasks. The authors are affiliated with Google and the University of Alberta.
Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning:
"Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning" is a research paper authored by Antonia Creswell, Murray Shanahan, and Irina Higgins. The paper proposes a new approach for performing logical reasoning tasks, which involves leveraging the capabilities of large language models to generate explanations that support the model's output. The authors argue that existing approaches for performing logical reasoning tasks often lack interpretability, and that explanations are important for enabling humans to understand and trust the model's output. The proposed approach involves generating explanations that highlight the most relevant evidence and reasoning steps that support the model's output. The authors demonstrate the effectiveness of the approach on a range of logical reasoning tasks, including visual question answering and natural language inference, and show that it outperforms existing baselines. Overall, the paper highlights the importance of designing models that not only perform well on the task, but also provide explanations that can be understood and trusted by humans. The authors are affiliated with University College London and DeepMind.
5. Knowledge in LLMs
Knowledge Neurons in Pretrained Transformers:
"Knowledge Neurons in Pretrained Transformers" is a research paper authored by Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. The paper proposes a new method for identifying and interpreting the knowledge captured by individual neurons in pre-trained Transformer language models. The authors argue that existing methods for interpreting neural models often lack specificity and fail to provide clear insights into what the model has learned. The proposed approach involves analyzing the activations of individual neurons in response to different inputs and identifying the specific knowledge patterns that are associated with each neuron. The authors demonstrate the effectiveness of the approach on several language tasks, including question answering and text classification, and show that it can improve the performance of existing models. The paper highlights the importance of interpretability in neural models and represents a promising direction for improving our understanding of the knowledge captured by large-scale language models. The authors are affiliated with Microsoft Research Asia and Huawei Noah's Ark Lab.
SKILL: Structured Knowledge Infusion for Large Language Models:
"SKILL: Structured Knowledge Infusion for Large Language Models" is a research paper authored by Fedor Moiseev, Zhe Dong, Enrique Alfonseca, and Martin Jaggi. The paper proposes a new approach for integrating structured knowledge into large language models, which can improve their ability to perform complex reasoning tasks. The authors argue that existing language models often lack the ability to reason about structured knowledge, such as knowledge graphs or ontologies, which limits their performance on tasks that require this type of reasoning. The proposed approach involves augmenting the input text with structured knowledge, which is used to guide the model's attention and improve its ability to reason about the task at hand. The authors demonstrate the effectiveness of the approach on several benchmark datasets, including question answering and natural language inference, and show that it outperforms existing baselines. Overall, the paper highlights the importance of incorporating structured knowledge into language models and represents a promising direction for improving their performance on complex reasoning tasks. The authors are affiliated with École polytechnique fédérale de Lausanne and Google Research.
A Systematic Investigation of Commonsense Knowledge in Large Language Models:
"A Systematic Investigation of Commonsense Knowledge in Large Language Models" is a research paper authored by Xiang Lorraine Li, Adhiguna Kuncoro, Jordan Hoffmann, Cyprien de Masson d'Autume, Phil Blunsom, and Aida Nematzadeh. The paper investigates the extent to which large language models, such as GPT-2 and GPT-3, capture commonsense knowledge. The authors argue that commonsense knowledge is critical for enabling natural language understanding and generation, and that existing models often lack this type of knowledge. The paper presents a systematic approach for evaluating the commonsense knowledge of language models, which involves testing their ability to make accurate predictions on a range of commonsense reasoning tasks. The authors show that while large language models do capture some commonsense knowledge, their performance on these tasks is often far from perfect. The paper highlights the importance of continuing to develop models that can capture and reason with commonsense knowledge, and suggests several avenues for future research in this area. The authors are affiliated with the University of Oxford, Facebook AI Research, and Carnegie Mellon University.
Time-Aware Language Models as Temporal Knowledge Bases:
"Time-Aware Language Models as Temporal Knowledge Bases" is a research paper authored by Bhuwan Dhingra, Jeremy R. Cole, Julian Martin Eisenschlos, Daniel Gillick, Jacob Eisenstein, and William W. Cohen. The paper proposes a new approach for incorporating temporal knowledge into language models, which can improve their ability to reason about events and their ordering over time. The authors argue that existing language models often lack a sense of temporal knowledge, which limits their performance on tasks that require reasoning about the temporal order of events. The proposed approach involves augmenting the input text with temporal information, which is used to guide the model's attention and improve its ability to reason about the temporal ordering of events. The authors demonstrate the effectiveness of the approach on several benchmark datasets, including temporal question answering and event ordering, and show that it outperforms existing baselines. Overall, the paper highlights the importance of incorporating temporal knowledge into language models and represents a promising direction for improving their performance on tasks that require temporal reasoning. The authors are affiliated with Carnegie Mellon University, Google Research, and Georgia Institute of Technology.
DISCOVERING LATENT KNOWLEDGE IN LANGUAGE MODELS WITHOUT SUPERVISION:
"Discovering Latent Knowledge in Language Models without Supervision" is a research paper authored by Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. The paper proposes a new approach for discovering latent knowledge in language models, without the need for explicit supervision. The authors argue that existing methods for discovering latent knowledge often require large amounts of labeled data, which can be difficult to obtain. The proposed approach involves training a generative model to reconstruct input text, using a latent space to capture underlying semantic concepts. The authors then use a clustering algorithm to identify semantically meaningful groups of latent variables, which correspond to different types of knowledge. They demonstrate the effectiveness of the approach on several benchmark datasets, including text classification and sentiment analysis, and show that it outperforms existing baselines. Overall, the paper represents a promising direction for discovering latent knowledge in language models and highlights the potential for unsupervised approaches to advance the field. The authors are affiliated with the University of California, Berkeley, and Peking University.
6. Scaling LLMs
BASE Layers: Simplifying Training of Large, Sparse Models:
"BASE Layers: Simplifying Training of Large, Sparse Models" is a research paper authored by Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. The paper proposes a new training technique for large, sparse models, which can simplify the training process and reduce computational cost. The authors argue that existing techniques for training large models often require complex optimization algorithms and large amounts of memory, which can be difficult to scale to very large models. The proposed technique involves partitioning the model into smaller, more manageable units, which are trained independently and then combined to form the final model. The authors demonstrate the effectiveness of the approach on several benchmark datasets, including language modeling and machine translation, and show that it outperforms existing baselines in terms of both performance and training efficiency. Overall, the paper represents a promising direction for training large, sparse models and highlights the potential for more efficient and scalable approaches in the field. The authors are affiliated with the University of Washington and OpenAI.
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts:
The paper proposes a new method for scaling up language models by using a mixture-of-experts (MoE) architecture. The MoE architecture allows the model to leverage a large number of experts, each specialized in different aspects of language processing, while also efficiently allocating computational resources based on the demands of the task at hand. The authors demonstrate the effectiveness of the approach on several benchmark datasets, showing that it achieves state-of-the-art results while requiring fewer parameters and less compute than existing approaches. Overall, the paper represents a promising direction for scaling up language models and highlights the potential for more efficient and effective approaches in the field. The authors are affiliated with Google Research.
SCALE EFFICIENTLY: INSIGHTS FROM PRE-TRAINING AND FINE-TUNING TRANSFORMERS:
The paper presents a study of pre-training and fine-tuning large-scale transformer models for NLP tasks. The authors propose several approaches to improve the efficiency of transformer models, such as task-adaptive pre-training and selective weight freezing. They also introduce a new pre-training objective called CLM-MLM, which combines the Masked Language Modeling (MLM) and Causal Language Modeling (CLM) objectives to better capture bidirectional dependencies. Experimental results show that the proposed approaches can significantly improve the efficiency of transformer models while maintaining competitive performance on various NLP benchmarks.
Training Compute-Optimal Large Language Models
The paper proposes a method for training large language models in a compute-efficient manner. The authors present a new metric, compute-days, which measures the amount of computational resources required to train a model. They then introduce a series of techniques for reducing compute-days, including model scaling, data-efficient training, and architecture search. These techniques are evaluated on several tasks, including language modeling, machine translation, and summarization, and are shown to significantly reduce compute-days while maintaining or improving performance. The proposed method has practical implications for the development of large language models, as it enables the training of more powerful models with fewer computational resources.
7. Concerns regarding LLMs
StereoSet: Measuring stereotypical bias in pretrained language models:
StereoSet is a dataset and tool for measuring stereotypical bias in pretrained language models (PLMs). The authors created StereoSet to address the limitations of existing bias detection datasets, which were found to have limited coverage and evaluation metrics. StereoSet includes three types of stereotypical bias: identity bias, historical bias, and intersectional bias. The authors evaluated several state-of-the-art PLMs on StereoSet and found that they exhibited stereotypical bias, indicating the need for further research and development of bias detection and mitigation techniques.
BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation:
The BOLD (Biases in Open-Ended Language Generation) dataset and metrics are designed to measure biases in open-ended language generation models. The dataset consists of prompts and human-generated responses that are annotated for biases along different dimensions such as gender, race, religion, and more. The metrics are designed to quantify the degree and types of biases in the generated responses. The authors argue that such datasets and metrics are essential for addressing biases in natural language generation, which can have serious real-world consequences.
Stereotyping Norwegian Salmon: An Inventory of Pitfalls in Fairness Benchmark Datasets:
The paper discusses the pitfalls of fairness benchmark datasets using the example of a dataset that aims to measure fairness in the classification of Norwegian salmon. The authors argue that the dataset has several shortcomings, including the use of problematic attributes, the lack of representation of relevant subgroups, and the assumption that all misclassifications are equally important. They conclude that fairness benchmark datasets need to be carefully designed and evaluated to ensure that they accurately capture the nuances of real-world scenarios.
Extracting Training Data from Large Language Models:
The authors propose a method for extracting labeled training data from large language models without human annotation. They introduce a framework called "GLUE" that generates natural language probes for a given task, and use them to query the language model for relevant examples. The retrieved examples are then labeled using a simple heuristic and used to train a downstream classifier. The authors evaluate their method on several benchmark datasets and show that it can achieve comparable performance to using human-annotated data. They also demonstrate that the method can be used to identify and correct biases in the language model.
Concealed Data Poisoning Attacks on NLP Models:
The paper discusses concealed data poisoning attacks on natural language processing (NLP) models, where the adversary can inject small amounts of malicious data into the training data. The paper presents two attack scenarios, namely, the training set poisoning and the fine-tuning set poisoning attacks. The authors show that these attacks can result in severe consequences, such as model degradation, leakage of sensitive information, and a reduction in model fairness. The paper proposes a defense mechanism called backdoor detection and removal (BDR), which can detect and remove poisoned examples from the training data. The authors evaluate the effectiveness of BDR on several NLP tasks and show that it can effectively mitigate the effects of data poisoning attacks.
Challenges in Detoxifying Language Models:
The paper discusses the challenges and complexities of detoxifying language models, which refers to mitigating the negative societal impacts of models that perpetuate biases, stereotypes, and harmful language. The authors identify three main challenges: 1) identifying and defining harmful content, 2) developing effective mitigation strategies that preserve model performance, and 3) ensuring ethical and responsible deployment of detoxified models. The paper also highlights the need for interdisciplinary collaboration and emphasizes the importance of transparency and accountability in the development and deployment of language models.
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models:
The paper proposes a framework, RealToxicityPrompts, to evaluate the neural toxic degeneration in language models. The proposed framework involves generating toxicity-promoting prompts and using them to score the toxicity of model-generated outputs. The framework aims to help researchers and developers build models that are less likely to generate toxic or harmful content. The paper also provides a comprehensive analysis of various language models, including GPT-2, GPT-3, RoBERTa, and T5, using the proposed framework.
QUANTIFYING MEMORIZATION ACROSS NEURAL LANGUAGE MODELS:
The paper proposes a framework to quantify memorization in neural language models by measuring how well they remember specific input/output pairs. They demonstrate the effectiveness of their framework on a variety of models and datasets, and show that memorization can be detected even in models that have not been explicitly trained to memorize. They also discuss the implications of memorization for model robustness and fairness.
8. Retrieval-based LLMs
GENERALIZATION THROUGH MEMORIZATION: NEAREST NEIGHBOR LANGUAGE MODELS:
The authors propose Nearest Neighbor Language Models (NNLMs), which are language models that memorize and retrieve stored examples instead of generating new text. They show that NNLMs can achieve high accuracy on several natural language processing tasks while avoiding issues related to generalization and adversarial examples that are common with traditional language models. The authors also analyze the performance of NNLMs with respect to various hyperparameters and find that smaller models can achieve similar performance to larger ones with carefully chosen hyperparameters.
REALM: Retrieval-Augmented Language Model Pre-Training:
REALM is a language model pre-training approach that leverages retrieval-based techniques to improve the quality of text representation. In this method, the model is trained to retrieve relevant passages from a knowledge source (such as Wikipedia) given a query, and then uses these passages to generate a context-aware representation for the input text. This approach allows the model to better capture the relationships between entities and concepts in a given domain, and improves its ability to perform downstream tasks such as question answering and text classification. REALM has been shown to outperform traditional pre-training methods on several benchmark datasets.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks:
The paper presents a retrieval-augmented generation (RAG) framework for knowledge-intensive NLP tasks. RAG combines a pre-trained language model with a knowledge retriever that searches a large-scale knowledge base to retrieve relevant context for the generated text. The proposed method achieves state-of-the-art results on several benchmark datasets for question answering and open-domain dialogue generation. The authors also introduce a new benchmark dataset, WebQuestions-Context, which is used to evaluate RAG's ability to incorporate context in question answering.
Relevance-guided Supervision for OpenQA with ColBERT:
The paper presents a method called Relevance-guided Supervision (RGS) that uses a pre-trained model called ColBERT for open question answering (OpenQA) tasks. The authors show that by leveraging relevance scores provided by ColBERT, they can achieve state-of-the-art results on several OpenQA benchmarks.
Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval:
The Baleen paper presents a new method for multi-hop reasoning that uses condensed retrieval, allowing for more robust and efficient reasoning at scale. The approach involves retrieving a small set of highly relevant documents for each step in the reasoning process, then using a model to perform multi-hop reasoning on this condensed set. The authors demonstrate the effectiveness of Baleen on several large-scale benchmarks, achieving state-of-the-art results.