2024年6月26、27工作记录-CSDN博客

本文链接：https://blog.csdn.net/Tankoldbang/article/details/139984293

模型预训练数据清洗

baby_llama2_chinese的数据清晰可以用多线程优化

ujson读取大文件比json快多了

gihub项目ChatLM-mini-Chinese学习

charent/ChatLM-mini-Chinese: 中文对话0.2B小模型（ChatLM-Chinese-0.2B），开源所有数据集来源、数据清洗、tokenizer训练、模型预训练、SFT指令微调、RLHF优化等流程的全部代码。支持下游任务sft微调，给出三元组信息抽取微调示例。 (github.com)

该项目是不是mask掩码训练，用的是text-to-text。

基础处理

remove_duplicate_punctuation

删除句子中重复的标点符号、重复的空格，同时将换行变为特殊字符'\n'

onvert_en_punctuation_to_zh_punct 将句子中的英文标点替换文中文标点

get_sentences_dice_similarity

获取两个句子的Dice相似度（Dice similarity）

s(a, b) = 2 * len( set(a) & set(b) ) / (len(set(a)) + len(set(b)))

数据清晰代码目录utils/raw_data_process.py

web数据处理函数 process_web_text

筛选时用点赞和字数下限筛选

百度百科处理 process_bake_qa

用字数下限筛选

同时对比title和desc的dice相似度

通过测试中文句子就是先把句子拆成字

min_hash去重

MinHash-LSH 哈希模糊去重：如何解决医学大模型的大规模数据去重？_minhash lsh 去重复-CSDN博客

大模型预训数据质量评估：困惑度、错误L2范数和记忆化三种度量方法

再看大模型预训数据质量如何评估：困惑度、错误L2范数和记忆化三种度量方法的效果对比分析研究 - 智源社区 (baai.ac.cn)

如何减枝的在文章中没有说

看看原文

The primary goal of choosing a pruning algorithm ξ is to identify a systematic method for selecting a subset of training instances from the large-scale dataset D. This selected subset, denoted as Pξ, is chosen based on computed pruning scores for each data point. The goal is to ensure that when a language model is trained on this pruned subset, denoted as Dˆξ, the model's performance on a specific task (represented by Pτ) is not significantly diminished.

In other words, the pruning algorithm aims to efficiently reduce the size of the training dataset while retaining as much of the original performance as possible. This is crucial for optimizing the training process, as working with a smaller, high-quality subset can lead to faster training times and more efficient use of computational resources.

选择修剪算法ξ的主要目标是确定一种系统方法，以从大规模数据集D中选择一部分训练实例。这个选择的子集，表示为Pξ，是基于对每个数据点计算的修剪分数而选择的。目标是确保在使用这个修剪过的子集（表示为Dˆξ）训练语言模型时，模型在特定任务上（由Pτ表示）的性能不会显著下降。

换句话说，修剪算法的目的是在尽量保持原始性能的同时，有效地减少训练数据集的大小。这对于优化训练过程至关重要，因为使用更小但高质量的子集可以导致更快的训练速度，并更有效地利用计算资源。

【LLM】sft和pretrain数据处理和筛选方法_nlp_山顶夕景-华为开发者联盟HarmonyOS专区 (csdn.net)

这里提到通过基础模型对预训练数据进行计算困惑度来去重

CC_Cleaner：一种丝滑高效且易扩展的数据清洗流程 (high-flyer.cn)

还有ROUGE-L

一个github项目的全流程复现

jiangnanboy/llm_corpus_quality: 大模型预训练中文语料清洗及质量评估 Large model pre-training corpus cleaning (github.com)

java写的总结的很到位

突然发现了一个点增量预训练