在 Common Crawl 查询下载网页数据，本地 pywb 浏览 WARC 存档

gaoshu883

于 2025-02-12 12:50:07 发布

阅读量513

点赞数 4

文章标签： python

本文链接：https://blog.csdn.net/gaoshu883/article/details/145589184

版权

看《Deep Dive into LLMs like ChatGPT》的时候，博主提到 FineWeb。我从 FineWeb 项目了解到 Common Crawl 网站，想亲眼看看 Common Crawl 里存的数据……我使用的系统是 macOS。

访问 Common Crawl Index Server ，随意选择 Index：November/December 2022 Index，URL Pattern填写：https://tiku.zujuan.com/ 。

搜索后的结果：

{
"urlkey": "com,zujuan,tiku)/",
"timestamp": "20221209232539",
"url": "https://tiku.zujuan.com/",
"mime": "text/html",
"mime-detected": "text/html",
"status": "200",
"digest": "BOE7JOHXZ3F6DIURLIIC4Q7UMARLSECE",
"length": "4054",
"offset": "592332064",
"filename": "crawl-data/CC-MAIN-2022-49/segments/1669446711552.8/warc/CC-MAIN-20221209213503-20221210003503-00842.warc.gz",
"languages": "zho",
"encoding": "UTF-8"
}

通过：“https://data.commoncrawl.org/[数据文件路径]” 的方式下载上述文件，“数据文件路径”是 filename 的值。我下载的 xxx.warc.gz 压缩包1GB以上。下载下来后，我用 gzip 工具对它进行解压缩。

本地安装和配置 pywb ，搭建起一个可以浏览 WARC 存档的环境。具体操作是：

# 安装 pywb
pip3 install pywb
# 创建一个文件夹
mkdir common-crawl
cd common-crawl
# 初始化一个新的集合
wb-manager init tiku_collections

common-crawl 下的目录结构是：

collections/
├── tiku_collections/
│   ├── archive/
│   │   └── (WARC files go here)
│   ├── indexes/
│   │   └── (CDXJ index files will be generated here)
│   ├── static/
│   │   └── (Static assets for the replay interface)
│   ├── templates/
│       └── (Custom templates for the replay interface)

把之前下载的 xxx.warc 文件搬到 tiku_collections/archive/ 目录下。在 common-crawl 项目下执行如下操作：

# 生成索引文件
wb-manager index tiku_collections ./collections/tiku_collections/archive/CC-MAIN-20221209213503-20221210003503-00842.warc
# 执行完毕，会在 ./collections/tiku_collections/indexes 中创建 index.cdxj 文件

最后，在 common-crawl 项目下执行：

# 启动本地服务器
wayback --port 8080

访问 http://localhost:8080/tiku_collections/ ，在集合中搜索页面快照。

输入 URL 再选择快照时间的结果：