一、wget下载程序的优势
1)支持断点下传功能
2)同时支持FTP和HTTP下载方式
3)支持代理服务器
4)设置方便简单
5)程序小,完全免费
二、下载和安装wget程序
wget的官网如下:Wget for Windows
不过很奇怪的是, 所有的下载链接,都出现了301错误。
于是我不得不换一个网站下载,推荐网站GNU Wget 1.21.3 for Windows
我下载了1.21.3 的64位版本zip文件包。
解压后,将这个文件夹整体移动到c盘下的system32目录下,配置环境变量,即安装完成。
三、wget命令的使用方法
3.1 wget命令的帮助信息——wget --help
如果你执行这个命令是正常的,那么就说明安装成功,如果报错——wget既不是内部命令也不是外部命令,那么说明安装失败了。
往往是你环境变量的配置有问题。
C:\Users\Administrator>wget --help
GNU Wget 1.21.3, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...
Mandatory arguments to long options are mandatory for short options too.
Startup:
-V, --version display the version of Wget and exit
-h, --help print this help
-b, --background go to background after startup
-e, --execute=COMMAND execute a `.wgetrc'-style command
Logging and input file:
-o, --output-file=FILE log messages to FILE
-a, --append-output=FILE append messages to FILE
-d, --debug print lots of debugging information
-q, --quiet quiet (no output)
-v, --verbose be verbose (this is the default)
-nv, --no-verbose turn off verboseness, without being quiet
--report-speed=TYPE output bandwidth as TYPE. TYPE can be bits
-i, --input-file=FILE download URLs found in local or external FILE
--input-metalink=FILE download files covered in local Metalink FILE
-F, --force-html treat input file as HTML
-B, --base=URL resolves HTML input-file links (-i -F)
relative to URL
--config=FILE specify config file to use
--no-config do not read any config file
--rejected-log=FILE log reasons for URL rejection to FILE
Download:
-t, --tries=NUMBER set number of retries to NUMBER (0 unlimits)
--retry-connrefused retry even if connection is refused
--retry-on-http-error=ERRORS comma-separated list of HTTP errors to retry
-O, --output-document=FILE write documents to FILE
-nc, --no-clobber skip downloads that would download to
existing files (overwriting them)
--no-netrc don't try to obtain credentials from .netrc
-c, --continue resume getting a partially-downloaded file
--start-pos=OFFSET start downloading from zero-based position OFFSET
--progress=TYPE select progress gauge type
--show-progress display the progress bar in any verbosity mode
-N, --timestamping don't re-retrieve files unless newer than
local
--no-if-modified-since don't use conditional if-modified-since get
requests in timestamping mode
--no-use-server-timestamps don't set the local file's timestamp by
the one on the server
-S, --server-response print server response
--spider don't download anything
-T, --timeout=SECONDS set all timeout values to SECONDS
--dns-servers=ADDRESSES list of DNS servers to query (comma separated)
--bind-dns-address=ADDRESS bind DNS resolver to ADDRESS (hostname or IP) on local host
--dns-timeout=SECS set the DNS lookup timeout to SECS
--connect-timeout=SECS set the connect timeout to SECS
--read-timeout=SECS set the read timeout to SECS
-w, --wait=SECONDS wait SECONDS between retrievals
(applies if more then 1 URL is to be retrieved)
--waitretry=SECONDS wait 1..SECONDS between retries of a retrieval
(applies if more then 1 URL is to be retrieved)
--random-wait wait from 0.5*WAIT...1.5*WAIT secs between retrievals
(applies if more then 1 URL is to be retrieved)
--no-proxy explicitly turn off proxy
-Q, --quota=NUMBER set retrieval quota to NUMBER
--bind-address=ADDRESS bind to ADDRESS (hostname or IP) on local host
--limit-rate=RATE limit download rate to RATE
--no-dns-cache disable caching DNS lookups
--restrict-file-names=OS restrict chars in file names to ones OS allows
--ignore-case ignore case when matching files/directories
-4, --inet4-only connect only to IPv4 addresses
-6, --inet6-only connect only to IPv6 addresses
--prefer-family=FAMILY connect first to addresses of specified family,
one of IPv6, IPv4, or none
--user=USER set both ftp and http user to USER
--password=PASS set both ftp and http password to PASS
--ask-password prompt for passwords
--use-askpass=COMMAND specify credential handler for requesting
username and password. If no COMMAND is
specified the WGET_ASKPASS or the SSH_ASKPASS
environment variable is used.
--no-iri turn off IRI support
--local-encoding=ENC use ENC as the local encoding for IRIs
--remote-encoding=ENC use ENC as the default remote encoding
--unlink remove file before clobber
--keep-badhash keep files with checksum mismatch (append .badhash)
--metalink-index=NUMBER Metalink application/metalink4+xml metaurl ordinal NUMBER
--metalink-over-http use Metalink metadata from HTTP response headers
--preferred-location preferred location for Metalink resources
Directories:
-nd, --no-directories don't create directories
-x, --force-directories force creation of directories
-nH, --no-host-directories don't create host directories
--protocol-directories use protocol name in directories
-P, --directory-prefix=PREFIX save files to PREFIX/..
--cut-dirs=NUMBER ignore NUMBER remote directory components
HTTP options:
--http-user=USER set http user to USER
--http-password=PASS set http password to PASS
--no-cache disallow server-cached data
--default-page=NAME change the default page name (normally
this is 'index.html'.)
-E, --adjust-extension save HTML/CSS documents with proper extensions
--ignore-length ignore 'Content-Length' header field
--header=STRING insert STRING among the headers
--compression=TYPE choose compression, one of auto, gzip and none. (default: none)
--max-redirect maximum redirections allowed per page
--proxy-user=USER set USER as proxy username
--proxy-password=PASS set PASS as proxy password
--referer=URL include 'Referer: URL' header in HTTP request
--save-headers save the HTTP headers to file
-U, --user-agent=AGENT identify as AGENT instead of Wget/VERSION
--no-http-keep-alive disable HTTP keep-alive (persistent connections)
--no-cookies don't use cookies
--load-cookies=FILE load cookies from FILE before session
--save-cookies=FILE save cookies to FILE after session
--keep-session-cookies load and save session (non-permanent) cookies
--post-data=STRING use the POST method; send STRING as the data
--post-file=FILE use the POST method; send contents of FILE
--method=HTTPMethod use method "HTTPMethod" in the request
--body-data=STRING send STRING as data. --method MUST be set
--body-file=FILE send contents of FILE. --method MUST be set
--content-disposition honor the Content-Disposition header when
choosing local file names (EXPERIMENTAL)
--content-on-error output the received content on server errors
--auth-no-challenge send Basic HTTP authentication information
without first waiting for the server's
challenge
HTTPS (SSL/TLS) options:
--secure-protocol=PR choose secure protocol, one of auto, SSLv2,
SSLv3, TLSv1, TLSv1_1, TLSv1_2, TLSv1_3 and PFS
--https-only only follow secure HTTPS links
--no-check-certificate don't validate the server's certificate
--certificate=FILE client certificate file
--certificate-type=TYPE client certificate type, PEM or DER
--private-key=FILE private key file
--private-key-type=TYPE private key type, PEM or DER
--ca-certificate=FILE file with the bundle of CAs
--ca-directory=DIR directory where hash list of CAs is stored
--crl-file=FILE file with bundle of CRLs
--pinnedpubkey=FILE/HASHES Public key (PEM/DER) file, or any number
of base64 encoded sha256 hashes preceded by
'sha256//' and separated by ';', to verify
peer against
--random-file=FILE file with random data for seeding the SSL PRNG
--ciphers=STR Set the priority string (GnuTLS) or cipher list string (OpenSSL) directly.
Use with care. This option overrides --secure-protocol.
The format and syntax of this string depend on the specific SSL/TLS engine.
HSTS options:
--no-hsts disable HSTS
--hsts-file path of HSTS database (will override default)
FTP options:
--ftp-user=USER set ftp user to USER
--ftp-password=PASS set ftp password to PASS
--no-remove-listing don't remove '.listing' files
--no-glob turn off FTP file name globbing
--no-passive-ftp disable the "passive" transfer mode
--preserve-permissions preserve remote file permissions
--retr-symlinks when recursing, get linked-to files (not dir)
FTPS options:
--ftps-implicit use implicit FTPS (default port is 990)
--ftps-resume-ssl resume the SSL/TLS session started in the control connection when
opening a data connection
--ftps-clear-data-connection cipher the control channel only; all the data will be in plaintext
--ftps-fallback-to-ftp fall back to FTP if FTPS is not supported in the target server
WARC options:
--warc-file=FILENAME save request/response data to a .warc.gz file
--warc-header=STRING insert STRING into the warcinfo record
--warc-max-size=NUMBER set maximum size of WARC files to NUMBER
--warc-cdx write CDX index files
--warc-dedup=FILENAME do not store records listed in this CDX file
--no-warc-compression do not compress WARC files with GZIP
--no-warc-digests do not calculate SHA1 digests
--no-warc-keep-log do not store the log file in a WARC record
--warc-tempdir=DIRECTORY location for temporary files created by the
WARC writer
Recursive download:
-r, --recursive specify recursive download
-l, --level=NUMBER maximum recursion depth (inf or 0 for infinite)
--delete-after delete files locally after downloading them
-k, --convert-links make links in downloaded HTML or CSS point to
local files
--convert-file-only convert the file part of the URLs only (usually known as the basename)
--backups=N before writing file X, rotate up to N backup files
-K, --backup-converted before converting file X, back up as X.orig
-m, --mirror shortcut for -N -r -l inf --no-remove-listing
-p, --page-requisites get all images, etc. needed to display HTML page
--strict-comments turn on strict (SGML) handling of HTML comments
Recursive accept/reject:
-A, --accept=LIST comma-separated list of accepted extensions
-R, --reject=LIST comma-separated list of rejected extensions
--accept-regex=REGEX regex matching accepted URLs
--reject-regex=REGEX regex matching rejected URLs
--regex-type=TYPE regex type (posix|pcre)
-D, --domains=LIST comma-separated list of accepted domains
--exclude-domains=LIST comma-separated list of rejected domains
--follow-ftp follow FTP links from HTML documents
--follow-tags=LIST comma-separated list of followed HTML tags
--ignore-tags=LIST comma-separated list of ignored HTML tags
-H, --span-hosts go to foreign hosts when recursive
-L, --relative follow relative links only
-I, --include-directories=LIST list of allowed directories
--trust-server-names use the name specified by the redirection
URL's last component
-X, --exclude-directories=LIST list of excluded directories
-np, --no-parent don't ascend to the parent directory
Email bug reports, questions, discussions to <bug-wget@gnu.org>
and/or open issues at https://savannah.gnu.org/bugs/?func=additem&group=wget.
3.2 wget命令的基本用法——wget site
假设我们要下载B站某个视频的封面图,我们可以执行命令【wget https://i2.hdslb.com/bfs/archive/75c3cff8734a76c3a671d9729eb50dbb7f7dc1c6.jpg@672w_378h_1c.webphttps://funimg.pddpic.com/ddjb/2020-09-16/804f5f88-82d4-4b3f-9cfe-06d4d172fec3.png.slim.pnghttps://i2.hdslb.com/bfs/archive/75c3cff8734a76c3a671d9729eb50dbb7f7dc1c6.jpg@672w_378h_1c.webp】。
下载成功。
在2345看图王中打开这个webp格式的图片,正常无误。
注意:不是所有图片都支持用wget命令下载,比如谷歌的logo图片链接是https://www.google.com/images/branding/googlelogo/2x/googlelogo_color_92x30dp.png
,我们用wget命令下载的时候,发现服务器拒绝回应。
但是我们用浏览器直接访问这个链接,就是正常的。
成功下载。
下载得到的文件有待补充扩展名。
补充扩展名后,打开,非常成功。
最后,我们尝试下载一下网页,比如百度搜索首页,让我们执行命令【wget www.baidu.com】。
下载到当前目录后,让我们打开看看?还行达到预期了。
3.3 下载日志输出至目标文件——参数-o(小写)
简版参数-o的长版是--output-file,因此假设我们想要下载百度搜索网页,那么我们可以执行命令【wget -o log.txt www.baidu.com】或者【wget --output-file=log.txt www.baidu.com】。
执行命令后,在当前目录下新增了一个日志文件、一个html文件,但是日志并没有显示在cmd窗口中了。
3.4 从文本文件中批量取出url——参数-i
简版参数-i的长版是--input-file,因此假设我们想要下载一个批量的资源,并且这些资源的url在一个文本文件中存储,那么我们可以执行命令【wget -i url.txt】或者【wget --input-file=url.txt】。
新建一个txt文件,里面包括了两行url。
执行命令 【wget -i url.txt】,成功下载。
3.5 下载并重命名——参数-O(大写)
从上面几个小节,我们容易得出“如果不主动给下载的资源文件命名,那么程序一般会自动加html扩展名或者不加”。
因此,为了省掉后续在资源管理器中重命名的麻烦,我们可以直接在执行下载命令时就做好这一操作。
简版参数-O的长版是--output-document,如果你想要下载并重命名,可以执行命令【wget -O filename url】或者【wget --output-document=filename url】。
执行命令【wget -O baidu.html www.baidu.com】。
如果你想存在指定目录,并且对文件重命名,那么你可以写成【wget -O filepath url】的形式,其中filepath中最后一段应该是一个文件名。
可见,-O参数不仅可以重命名,也可以实现后面提到的参数-P的功能。
3.6 下载至指定文件夹——参数-P(大写)
简版参数-P的长版是--directory-prefix,因此假设我们将资源下载到指定文件夹,那么我们可以执行命令【wget -O directory url】或者【wget --directory-prefix=directory url】。
将下载的资源存到目录webpage(可以是未存在的目录)下,执行命令【wget -P webpage www.baidu.com】。
3.7 递归下载整个网站——参数-r
简版参数-r的长版是--recursive,因此假设我们想要下载对应网站的全部资源,那么我们可以执行命令【wget -r url】或者【wget --recursive url】。
不妨试试对CSDN的网站进行爬虫,因为下载的资源都会保存至以url命名的文件夹中,所以执行命令【wget -r www.csdn.net】后也不需担心各种文件下载后会和其他文件混乱。
执行完后,我们发现只下载了两个文件,其中有一个名称为robot.txt的文件,说明CSDN网站存在反爬虫的机制,毕竟是商业大站,我认怂...
我们试试爬一些小站,比如说北京某知名高校的官网(在这里温馨提示一下,不要乱搞,你如果没爬出问题还好,爬出问题来了,惹祸了,小心高校到时候给你发律师函,所以不要持续地爬,试试就行了)。
执行命令【wget -r www.tsinghua.edu.cn】,然后我们可以看到cmd窗口一直在滚动,说明这个网站没有反爬虫机制。
但是出于不想太“刑”的考量,我还是按下ctrl+c键停止了爬虫,下载到的资源文件如下图所示。
四、对Cmd命令的小结和反思
在上面用wget命令对参数-i进行实验时,我发现执行命令【wget -i=url.txt】会找不到文件。
而后,经过一段时间,我看了网上其他文章对这种参数有简版、也有长版形式的情况,参数值最好是与参数用空格隔开。
因为对于简版形式,参数和参数值不能用等号隔开,但是可以用空格隔开;而对于长版形式,参数和参数值之间可以用等号或者空格隔开。
如果采用了不合适的语法,就会导致参数值被错误地赋予给参数。
比如执行命令【wget -o=log.txt www.baidu.com】后,生成的日志文件名称不叫“log.txt”,而叫“=log.txt”。