Mirroring HTML Files Only

最新推荐文章于 2024-08-30 09:24:48 发布

javaite

最新推荐文章于 2024-08-30 09:24:48 发布

阅读量95

点赞数

分类专栏： heritrix3

本文链接：https://blog.csdn.net/javaite/article/details/84366507

版权

heritrix3 专栏收录该内容

11 篇文章 0 订阅

订阅专栏

you would like to save the crawled files in a file/directory format instead of saving them in WARC files.
First, create a job with a single seed, http://foo.org/bar/. Configure the warcWriter bean so that its class is org.archive.modules.writer.MirrorWriterProcessor. This Processor will store files in a directory structure that matches the crawled URIs. The files will be stored in the crawl job's mirror directory.