如何进行大模型预训练数据爬取？

llmbe

已于 2025-03-02 19:14:40 修改

阅读量2.4k

点赞数 34

分类专栏：大模型的训练数据文章标签：数据挖掘人工智能 database 大数据 python

于 2025-03-02 14:29:21 首次发布

本文链接：https://blog.csdn.net/qq_33474415/article/details/145962253

版权

大模型的训练数据专栏收录该内容

1 篇文章

订阅专栏

这里写目录标题

**加粗样式**第一章扩展版：全球互联网数据矿开采地图🗺️

加粗样式第一章扩展版：全球互联网数据矿开采地图🗺️

1.1 全领域数据源清单（含爬取代码模板）

A. 公开数据集超市（无需爬虫，直接下载）

https://www.kaggle.com/datasets

https://archive.ics.uci.edu/

https://datasetsearch.research.google.com

# 政府开放数据（全球范围）
GOV_OPEN_DATA = [
    'data.gov',         # 美国
    'data.gov.uk',      # 英国
    'data.gov.cn',      # 中国
    'data.europa.eu'    # 欧盟
]

# 学术数据集（论文伴侣）
ACADEMIC_DATASETS = [
    ('Kaggle', 'https://www.kaggle.com/datasets', '需注册+遵守License'),
    ('UCI Machine Learning', 'https://archive.ics.uci.edu/ml/index.php', '直接下载'),
    ('Google Dataset Search', 'https://datasetsearch.research.google.com/', '聚合搜索')
]
# 直接下载示例（Kaggle API）
import kaggle
kaggle.api.dataset_download_files('allen-institute-for-ai/CORD-19-research-challenge', path='./data', unzip=True)

B. 文本矿场（需爬虫开采）

知识类网站（高营养食材）
维基百科：https://www.wikipedia.org
知乎专栏：https://zhuanlan.zhihu.com
StackExchange：https://stackexchange.com/sites

新闻类（时效性强）
纽约时报存档（需订阅）：https://archive.nytimes.com
路透社（动态加载反爬）：https://www.reuters.com
BBC新闻（地区限制）：https://www.bbc.com/news

论坛类（口语化数据）
Reddit（需API+遵守规则）：https://www.reddit.com
Quora（反爬严格）：https://www.quora.com
豆瓣小组（需登录+验证码）：https://www.douban.com/group

爬虫脚本

# 知识类网站（高营养食材）
KNOWLEDGE_SITES = [
    ('维基百科', 'https://www.wikipedia.org', '遵守CC协议', '可使用API'),
    ('知乎专栏', 'https://zhuanlan.zhihu.com', '需登录+反爬强'),
    ('StackExchange', 'https://stackexchange.com/sites', 'API限流')
]

# 新闻类（时效性强）
NEWS_SITES = [
    ('纽约时报存档', 'https://archive.nytimes.com', '需订阅'),
    ('路透社', 'https://www.reuters.com', '动态加载反爬'),
    ('BBC新闻', 'https://www.bbc.com/news', '地区限制')
]

# 论坛类（口语化数据）
FORUMS = [
    ('Reddit', 'https://www.reddit.com', '需API+遵守规则'),
    ('Quora', 'https://www.quora.com', '反爬严格'),
    ('豆瓣小组', 'https://www.douban.com/group', '需登录+验证码')
]

# 代码示例：维基百科爬虫（优雅版）
import wikipediaapi
wiki_wiki = wikipediaapi.Wikipedia('MyProject (merons@example.com)', 'en')
page_py = wiki_wiki.page('Large_language_models')
with open('wiki_llm.txt', 'w') as f:
    f.write(page_py.text[:50000])  # 控制单文件大小

C. 隐藏矿脉（特殊技巧开采）

暗网数据（需Tor网络，仅供学术研究）
OnionLand Search：http://3g2upl4pq6kufc4m.onion
DarkSearch：添加链接描述

社交媒体（合规获取方式）
Twitter（需开发者API）：https://twitter.com
微博（高级会员+反爬）：https://weibo.com
Telegram公共频道（需解析消息历史）：https://t.me/s/publicarchives

# 暗网数据（需Tor网络，仅供学术研究）
DARK_WEB = [
    ('OnionLand Search', 'http://3g2upl4pq6kufc4m.onion'),
    ('DarkSearch', 'https://darksearch.io/')
]

# 社交媒体（合规获取方式）
SOCIAL_MEDIA = [
    ('Twitter', 'https://twitter.com', '需开发者API'),
    ('微博', 'https://weibo.com', '高级会员+反爬'),
    ('Telegram公共频道', 'https://t.me/s/publicarchives', '需解析消息历史')
]

# 代码示例：Twitter API V2 合规抓取
import tweepy
client = tweepy.Client(bearer_token='YOUR_TOKEN')
tweets = client.search_recent_tweets(
    query="LLM -is:retweet", 
    max_results=100,
    tweet_fields=['created_at', 'lang']
)
for tweet in tweets.data:
    print(tweet.text)

1.2 网络菜农防坑手册 🛡️

陷阱1：法律雷区（轻则警告，重则律师函）

# 检查robots.txt的Python实现
from urllib.robotparser import RobotFileParser

def check_robots_permission(url, user_agent='*'):
    rp = RobotFileParser()
    rp.set_url(url + '/robots.txt')
    rp.read()
    return rp.can_fetch(user_agent, url)

# 使用示例
if check_robots_permission('https://www.zhihu.com'):
    print("允许抓取！")
else:
    print("此区域禁止采摘！🚫")

陷阱2：反爬虫三剑客

# 反制措施代码包
import random
import time
from fake_useragent import UserAgent

# 伪装术：随机User-Agent
headers = {
    'User-Agent': UserAgent().random,
    'Accept-Language': 'en-US,en;q=0.9',
}

# 隐身术：代理IP池
PROXY_POOL = [
    'http://203.0.113.1:8080',
    'socks5://user:pass@127.0.0.1:9050'
]
proxies = {'http': random.choice(PROXY_POOL)}

# 时间魔法：随机延迟
time.sleep(random.uniform(1, 3))  # 重要！避免触发速率限制

陷阱3：动态加载迷宫

# Selenium破阵指南（Chrome示例）
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")  # 无头模式
options.add_argument("--disable-blink-features=AutomationControlled")

driver = webdriver.Chrome(options=options)
driver.get("https://www.reddit.com/r/MachineLearning/")
time.sleep(5)  # 等待动态加载
content = driver.find_element_by_tag_name('body').text

陷阱4：蜜罐诱捕系统

# 蜜罐检测代码（CSS隐藏陷阱）
page_html = requests.get(url).text
if 'display:none' in page_html or 'visibility:hidden' in page_html:
    print("发现隐藏陷阱！可能有蜜罐链接！")
    # 应跳过包含大量隐藏链接的页面

陷阱5：数据沼泽（低质量内容）

# 快速质量检测函数
def is_high_quality(text):
    # 排除条件列表
    bad_signs = [
        len(text) < 500,  # 过短
        text.count(' ') < 50,  # 低信息密度
        re.search(r'<script>', text),  # 含脚本
        text.upper() == text  # 全大写
    ]
    return not any(bad_signs)

附：数据源合规性速查表

📋

网站类型	推荐获取方式	法律风险等级	反爬强度
政府开放数据	直接下载	⭐	🛡️
学术论文	官方API/爬虫	⭐⭐	🛡️🛡️
社交媒体	平台官方API	⭐⭐⭐	🛡️🛡️🛡️
新闻网站	RSS订阅/合作采购	⭐⭐	🛡️🛡️🛡️
论坛类	用户授权内容采集	⭐⭐⭐⭐	🛡️🛡️🛡️🛡️