探索LESS:有针对性的教学调整中的重要数据选择

探索LESS:有针对性的教学调整中的重要数据选择

LESSICML 2024: Less: Selecting Influential Data for Targeted Instruction Tuning项目地址:https://gitcode.com/gh_mirrors/less/LESS

🚀 预印本论文

LESS是一个创新的开源项目,它提供了从大量数据中选择影响力数据以诱导特定能力的方法。这个项目的核心是优化机器学习模型的训练过程,通过智能地选择一小部分关键数据,以实现更高效和精确的指令调优。

快速入门链接 🔗

安装要求 📦

首先确保已安装PyTorch。然后按照以下步骤安装LESS所需的所有依赖:

  1. 运行pip3 install torch==2.1.2 torchvision torchaudio安装基础库。
  2. 进入LESS项目目录并运行pip install -r requirement.txt安装其他依赖。
  3. 使用pip install -e .以开发模式安装less包。

数据准备 📁

遵循open-instruct的说明准备四个指令调优数据集:Flan v2、COT、Dolly和Open Assistant。用于评估时,我们还使用了MMLU、Tydiqa和BBH三个额外的数据集。这些处理过的文件可在此处获取:huggingface/datasets/princeton-nlp/less_data

数据选择管道 💡

数据选择分为四步:

  1. 预热训练:使用LoRA方法对少量数据进行初步训练。
  2. 构建梯度数据库:收集所有训练数据的梯度信息。
  3. 任务特定数据选择:计算每个训练数据点的影响得分,并选择高影响力数据。
  4. 选定数据训练:使用精选数据进一步微调模型。

每一步都有详尽的示例脚本指导,包括如何执行预热训练、创建梯度数据库以及进行数据选取和模型训练。

评估 📊

evaluation文件夹内查看详细的评价指南,了解如何评估模型在选定数据上的表现。

问题与反馈 ❓

有任何问题或遇到bug,请联系Mengzhou (mengzhou@princeton.edu) 或者直接在项目仓库中打开新问题。

引用 📝

若你在工作中受益于该项目,请引用以下论文:

@article{xia2024less,
  title={Less: Selecting Influential Data for Instruction Tuning},
  author={Xia, Mengzhou and Malladi, Sadhika and Gururangan, Suchin and Arora, Sanjeev and Chen, Danqi},
  year={2024}
}

LESS项目是一个强大的工具,不仅有助于提升模型的性能,还能有效减少训练时间和资源消耗。无论是学术研究还是工业应用,它都是一个值得尝试的优秀解决方案。立即加入,体验影响力数据选择的魅力吧!

LESSICML 2024: Less: Selecting Influential Data for Targeted Instruction Tuning项目地址:https://gitcode.com/gh_mirrors/less/LESS

3.4 Feature Engineering Feature engineering is the process of selecting and transforming raw data into features that can be used by a machine learning algorithm. In our case, we used various NLP techniques to extract features from the GeoNames data. We first extracted the name, feature class, and feature code of each GeoNames record. We then used a part-of-speech (POS) tagger to identify the parts of speech of each word in the name field. We also used a named entity recognizer (NER) to identify the entities in the name field, such as countries, cities, and rivers. We then created several new features based on the extracted information. For example, we created a feature that indicated whether the record was a country or not. We also created features that indicated the number of words in the name field, the number of entities in the name field, and the average length of the words in the name field. In addition to the NLP-based features, we also created several other features. For example, we created a feature that indicated the distance of each record from the equator, as this is known to be a strong predictor of climate and vegetation patterns. We also created features that indicated the population density and area of each record. Finally, we used a feature selection algorithm to select the most important features for our machine learning algorithm. We used a random forest classifier, which is a type of ensemble learning algorithm that combines multiple decision trees to improve performance. We found that the most important features were the feature class, distance from the equator, population density, and number of entities in the name field. Overall, our feature engineering process helped us to extract meaningful information from the raw GeoNames data and create features that were useful for our machine learning algorithm.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

宋溪普Gale

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值