规则清洗
gopher、C4、Fineweb 论文的规则
算子名称 算子描述 算子来源 应用维度 过滤条件
duplicate_line_fraction 行重复率 Gopher 文档级别 ≤ 0.30
duplicate_paragraph_fraction 自然段重复率 Gopher 文档级别 ≤ 0.30
duplicate_line_character_fraction 行字符重复率 Gopher 行级别 ≤ 0.20
duplicate_paragraph_character_fraction 自然段字符重复率 Gopher 自然段级别 ≤ 0.20
top_2-gram_character_fraction 前2-gram字符占比 Gopher 文档级别 ≤ 0.20
top_3-gram_character_fraction 前3-gram字符占比 Gopher 文档级别 ≤ 0.18
top_4-gram_character_fraction 前4-gram字符占比 Gopher 文档级别 ≤ 0.16
duplicate_5-gram_character_fraction 5-gram字符重复占比 Gopher 文档级别 ≤ 0.15
duplicate_6-gram_character_fraction 6-gram字符重复占比 Gopher 文档级别 ≤ 0.14
duplicate_7-gram_character_fraction 7-gram字符重复占比 Gopher 文档级别 ≤ 0.13
duplicate_8-gram_character_fraction 8-gram字符重复占比 Gopher 文档级别 ≤ 0.12
du