![268379de37fdd149c3deaf6eca203c5c.png](https://i-blog.csdnimg.cn/blog_migrate/330f9b8dffcdd5747bf837f371403938.jpeg)
如果想在linux服务器里,如centos或ubuntu命令行直接访问网页,尤其是无GUI界面的环境中,可以使用curl,但curl只能抓取页面,如果想读页面的内容,还得将其保存下来然后查看,也就是说curl相当于一个爬虫下载工具。如果想直接读到网页中的内容,就得另请高明了。下面介绍两种工具,links和lynx,实现从命令行浏览器来查看页面内容。
1、先来看一下curl的使用
直接在命令上开始使用:curl www.baidu.com,如下显示
root@VM-0-10-ubuntu:~# curl www.baidu.com
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8>
<meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer>
<link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>
<title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head>
<div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg>
<img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div>
<form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1>
<input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8>
<input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1>
<input type=hidden name=tn value=baidu><span class="bg s_ipt_wr">
<input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span>
<span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form>
</div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a>
<a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图
</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>©2017&nbs
p;Baidu <a href=http://www.baidu.com/duty/>使用百度前必读</a> <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a> 京ICP证030173号 <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
从上面运行结果可以看到,返回的就是百度首页的源代码,而不是我们看到的百度首页页面。这应该类似于爬虫的操作了,就是爬取源代码文件。如果想保存上述的内容,可以在curl后面增加一个-o参数:
root@VM-0-10-ubuntu:~# curl -o baidu.dat www.baidu.com #获取百度首页源代码并保存到baidu.dat
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2381 100 2381 0 0 14152 0 --:--:-- --:--:-- --:--:-- 14172
root@VM-0-10-ubuntu:~# ll
total 80
drwx------ 6 root root 4096 Dec 2 08:28 ./
drwxr-xr-x 25 root root 4096 Dec 2 08:29 ../
-rw-r--r-- 1 root root 2381 Dec 2 08:28 baidu.dat
root@VM-0-10-ubuntu:~# more baidu.dat #查看baidu.dat内容
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8>
<meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=alw
ays name=referrer><link rel=stylesheet type=text
......省略
如果想显示curl访问网页时的通信过程,可以增加一个-v参数,如下:
root@VM-0-10-ubuntu:~# curl -v www.baidu.com
* Rebuilt URL to: www.baidu.com/
* Trying 119.63.197.139...
* Connected to www.baidu.com (119.63.197.139) port 80 (#0)
> GET / HTTP/1.1
> Host: www.baidu.com
> User-Agent: curl/7.47.0
> Accept: */*
>
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Connection: keep-alive
< Content-Length: 2381
< Content-Type: text/html
< Date: Wed, 02 Dec 2020 00:34:06 GMT
< Etag: "588604ec-94d"
< Last-Modified: Mon, 23 Jan 2017 13:28:12 GMT
< Pragma: no-cache
< Server: bfe/1.0.8.18
< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/
<
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1
所以许多文献在介绍curl时,总结该命令为:curl是一种HTTP命令行工具,作用是发出网络请求,然后得到和提取数据。所以如果在linux下来爬取网络数据,curl是一个非常好的选项。
2. links的使用
curl获取的是网页源代码,而不是网页呈现的内容。links就可以直接查看内容,我们先需要安装一下:
root@VM-0-10-ubuntu:~# apt install links #安装links
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following NEW packages will be installed:
links
0 upgraded, 1 newly installed, 0 to remove and 245 not upgraded.
Need to get 394 kB of archives.
After this operation, 1,348 kB of additional disk space will be used.
Get:1 http://mirrors.tencentyun.com/ubuntu xenial/universe amd64 links amd64 2.12-1 [394 kB]
Fetched 394 kB in 0s (4,296 kB/s)
Selecting previously unselected package links.
(Reading database ... 99048 files and directories currently installed.)
Preparing to unpack .../links_2.12-1_amd64.deb ...
Unpacking links (2.12-1) ...
Processing triggers for man-db (2.7.5-1) ...
Processing triggers for mime-support (3.59ubuntu1) ...
Setting up links (2.12-1) ...
然后直接使用links url访问:
![8e3655dfcda3ae73bfa253a2f68ea85f.png](https://i-blog.csdnimg.cn/blog_migrate/4c1d34a9dffceaddd5e346e479481dd8.png)
如上对新浪首页进行访问,受限于页面高度只可以看到部分内容,显示了部分文本。在执行links后使用键盘上的ESC按键,可以显示links的菜单:
![d150b375b93e199c2c9cfc2042c563f5.png](https://i-blog.csdnimg.cn/blog_migrate/8db0c972522f3a2c9e8336f3b3e7f08b.jpeg)
这样就可以对页面内容做更多操作了,包括保存、下载和链接等。如果想对links的使用有更多了解,还可以直接在命令上输入: links ---help查看帮助文档:
root@VM-0-10-ubuntu:~# links --help
links [options] URL
Options are:
-help
Prints this help screen
-version
Prints the links version number and exit.
-lookup <hostname>
Does name lookup, like command "host".
-g
Run in graphics mode.
-no-g
Run in text mode (overrides previous -g).
-driver <driver name>
Graphics driver to use. Drivers are: x, svgalib, fb, directfb, pmshell,
atheos.
List of drivers will be shown if you give it an unknown driver.
Available drivers depend on your operating system and available libraries.
-mode <graphics mode>
Graphics mode. For SVGALIB it is in format COLUMNSxROWSxCOLORS --
......省略
3. lynx的使用
官方:Lynx Information
Lynx is a customizable text-based web browser for use on cursor-addressable character cell terminals. As of January 2019, it is the oldest web browser still in general use and active development, having started in 1992.
从英文介绍中可以看出lynx是最古老的网页浏览工具,从1992年就开始使用,而且一直经久不衰。我们可以先在命令行使用lynx --help查看帮助文档:
root@VM-0-10-ubuntu:~# lynx --help
USAGE: lynx [options] [file]
Options are:
- receive options and arguments from stdin
-accept_all_cookies
accept cookies without prompting if Set-Cookie handling
is on (off)
-anonymous apply restrictions for anonymous account,
see also -restrictions
-assume_charset=MIMEname
charset for documents that don't specify it
-assume_local_charset=MIMEname
charset assumed for local files
-assume_unrec_charset=MIMEname
use this instead of unrecognized charsets
-auth=id:pw authentication information for protected documents
-base prepend a request URL comment and BASE tag to text/html
outputs for -source dumps
-bibhost=URL local bibp server (default http://bibhost/)
-book use the bookmark page as the startfile (off)
-buried_news toggles scanning of news articles for buried references (on)
-cache=NUMBER NUMBER of documents cached in memory
-case enable case sensitive user searching (off)
-center toggle center alignment in HTML TABLE (off)
-cfg=FILENAME specifies a lynx.cfg file other than the default
-child exit on left-arrow in startfile, and disable save to disk
-child_relaxed exit on left-arrow in startfile (allows save to disk)
-cmd_log=FILENAME log keystroke commands to the given file
-cmd_script=FILENAME
read keystroke commands from the given file
(see -cmd_log)
-connect_timeout=N
set the N-second connection timeout (18000)
-cookie_file=FILENAME
specifies a file to use to read cookies
-cookie_save_file=FILENAME
specifies a file to use to store cookies
-cookies toggles handling of Set-Cookie headers (on)
...省略
非常多的参数,表明该工具可以处理很多方面的http请求和网页内容。我们可以先简单测试一下,使用使用lynx url访问一个网页页面:
root@VM-0-10-ubuntu:~# lynx www.baidu.com
Looking up 'www.baidu.com' first
www.baidu.com cookie: BAIDUID=16F1058D693B97613F35A4F360117CDA:FG=1 Allow? (Y/N/Always/neVer)
(选择Y后)
www.baidu.com cookie: PSTM=1606870861 Allow? (Y/N/Always/neVer)
(选择Y后)
#百度搜索
REFRESH(0 sec): http://www.baidu.com/baidu.html?from=noscript
<style data-for="result" type="text/css" >body{color:#333;background:#fff;padding:6px 0 0;margin:0;position:relative}body,th,td,.p1,.p2{font-family:
____________________________________________________________________________________________________________________________________________________
____________________________________________________________________________________________________________________________________________________
____________________________________________________________________________________________________________________________________________________
<style data-for="result" id="css_result" type="text/css">#ftCon{display:none}_______________________________________________________________________
#qrcode{display:none}_______________________________________________________________________________________________________________________________
#pad-version{display:none}__________________________________________________________________________________________________________________________
#index_guide{display:none}__________________________________________________________________________________________________________________________
#index_logo{display:none}___________________________________________________________________________________________________________________________
#u1{display:none}___________________________________________________________________________________________________________________________________
.s-top-left{display:none}___________________________________________________________________________________________________________________________
.s_ipt_wr{height:32px}______________________________________________________________________________________________________________________________
body{padding:0}_____________________________________________________________________________________________________________________________________
#head .c-icon-bear-round{display:none}______________________________________________________________________________________________________________
.index_tab_top{display:none}________________________________________________________________________________________________________________________
.index_tab_bottom{display:none}_____________________________________________________________________________________________________________________
#lg{display:none}___________________________________________________________________________________________________________________________________
#m{display:none}____________________________________________________________________________________________________________________________________
#ftCon{display:none}________________________________________________________________________________________________________________________________
#bottom_layer,#bottom_space,#s_wrap{display:none}___________________________________________________________________________________________________
.s-isindex-wrap{display:none}_______________________________________________________________________________________________________________________
#nv{display:none!important}_________________________________________________________________________________________________________________________
(NORMAL LINK) Use right-arrow or <return> to activate.
Arrow keys: Up and Down to move. Right to follow a link; Left to go back.
H)elp O)ptions P)rint G)o M)ain screen Q)uit /=search [delete]=history list
当访问百度的时候,需要先同意发送cookie信息,然后就可以看到百度的页面了。同时也是内容较多,需要用方向键往下翻可以看到一些具体内容:
![681210a67bd60a2925d6b8e7b4998e3e.png](https://i-blog.csdnimg.cn/blog_migrate/9a9fba72a012f08f706c187d6548453f.jpeg)
4. 小结一下
习惯了使用浏览器来打开网页,干嘛非得使用linux命令行来进行操作呢?那这些lynx、links命令许多情况下不是为了访问外网,更多的是对服务器内部搭建的web系统进行测试或者相应的控制。例如可以将lynx与crontab结合起来制定定时计划任务,如可以每天定时盘点统计,就可以在crontab -e中加入:
30 22 * * * lynx http://localhost/finance/statsDo #每日22点30访问本机web服务