ivspider - wget 封装计划的第二篇_wget是由什么封装的-CSDN博客

本文链接：https://blog.csdn.net/yiivon/article/details/6825164

这一篇要说的是tt ，这是对ivspider 使用的一个小例子。

TT 就是对ivspider 的一个简单包装，运行在windows 的命令行控制台之下。你可以免费下载它的发布版本程序自由使用或源码自由更改使用：http://yiivon.com/download/ 。
下面是对 TT 用法的详细介绍。

调用TT的方法有两种：
一、在cmd 中，转到tt 所在目录，然后使用 tt -url http://xxxx … 。当然后也可以把 tt.exe 及 ivspider.dll 复制到 c:windowssystem32 下，即可直接在cmd 中调用 tt；
二、在批处理文件中调用。

当在cmd 中不带参数调用tt 时（或带 -h, –help, 参数出错），会直接显示使用方法（由于cmd普遍存在中文乱码的问题，我使用了自己的”英文“^_~），如：

D:LABSPIDERPublictt>tt

tt.exe 1.0.0.1

Usage:
tt -URL url [-LEVEL num] [-FULL] [-NORMAL] [...] ...
Note:
  parameter include in '[]' is optional
  parameter do not case sensitive. -URL equal to -url and so on

-URL   url=string  Need, url which need scan.
-LEVEL   level=num, Opt, max level of recursive, 2 default.
-MAX   max=num, Opt, max amout of downloaded url, 30 default.
-FULL   full-mode, Opt, scan all sub links, default not set.
-NORMAL   normal-mode, Opt, only scan url in <script>, <[i]frame>, <embed>, <link>, <meta>
, default by set.
-IMG   image-mode, Opt, only retrieve <img> href in current page, default not set.
-FW   list-follow, Opt, a list of marks to be grabbed separate by comma, default dependent
 on STR_CMD_NORMAL.
-IG   list-ignore, Opt, a list of marks to ignore grabbed separate by comma, default not s
et.
-NR   no-relative, Opt, scan the relative url only, default not set.
-NC   no-cache, Opt, no web cache, default by set.
-NDC   no-dns-cache, Opt, clear DNS cache under per download, default not set.
-NP   no-parent, Opt, do not retriever parent page in sub page, default by set.
-LR   limit-rate=num, Opt, limit-rate of download from http server, 0 default.
-CT   conn-timeout=num(by ms), Opt, timeout of connect to server, 20 default.
-RT   read-timeout=num(by ms), Opt, timeout of read data from server, 10 default.
-DNST   dns-timeout=num(by ms), Opt, timeout of dns parsing, 20 default.
-SH   span-host, Opt, ignore corss-site url, default not set.
-RTRY   retrys=num, Opt, numbers of retry per url, default 3.
-RCF   retry-conn-refused, Opt, retry connect after server refused, default not set.
-PING   Ping web, server and retrieve response http-code(as like 200) only, default not se
t.
-TREE   tree-show, Opt, show grab result as a tree, default not set.
-FILE   write-to-file=string, Opt, marking and write its data to local file, default not s
et.
-DIR   directory-download-file=string, Opt, directory storing all download file. default a
pp-root.
-H   Display this help.
--help   Display this help.

Such as:
tt -URL http://yahoo.com.cn
tt -URL http://yahoo.com.cn -LEVEL 5 -FULL -CT 10
...
Any question, please visit out site: http://www.yiivon.com/document/

最简单的使用（使用默认设置）：

tt -url http://yiivon.com

这样表示抓取：
以http://yiivon.com （网站首页）为第一个页面
最多抓取2级页面、30个链接
只抓取<link>,<script>,<meta>,<iframe>,<frame>标记中的链接页面
每个链接（或页面）的DNS获取超时值为20秒、连接超时值为10秒、读取数据值为20秒
不抓取非首页域名的链接
等

其结果截图（部分）

测试指定的标记链接是否有效。如需要验证 http://yiivon.com 首页上所有的js文件与css文件是否有效，请使用 -FW 命令：

tt -url http://yiivon.com -fw script,link

注：css 是中。
这样，通过抓取结果就可以看到js, css 链接上的状态了(ok, loss …)。

抓取保存某个页面及其子页面（<a>标记中的链接）中所有的图片，请用 -FW 与 -FILE 命令：

tt -url http://yiivn.com -fw a,img -file img

这样，就会抓取<a>与<img>标记，但只保存<img>中的图片到本地。
注：如图片或锚来自别的网站（域名），请加上 -sh。

让爬虫每次都取得最新的数据，请使用 -NC 与 -NDC。两命令参数分别为：不使用页数数据缓存与不使用DNS缓存。如：

tt -url http://yiivon.com/ -NC -NDC

防止爬虫无休止织网，请使用 -MAX, -LEVEL 来分别限定最多抓取的链接数及深度：

tt -url http://yiivon.com/ -fw a,script -max 100 -level 3

ivspider - http://yiivon.com/ivspider/

tt - http://yiivon.com/download/tt/