譬如拿到一个数据集,想从几十万条数据中随机抽取几万条作为要用的测试集。
Python代码实现:
import random
# 从A.txt中读取所有行
with open("train.txt", "r", encoding = 'utf-8') as f:
lines = f.readlines()
# 随机选择不重复的20000行,可根据个人需求修改
selected_lines = random.sample(lines, 20000)
# 将选定的行写入B.txt
with open("test.txt", "w", encoding = 'utf-8') as f:
for line in selected_lines:
f.write(line)