ubuntu cp命令_linux命令行浏览器 --links和lynx

268379de37fdd149c3deaf6eca203c5c.png

如果想在linux服务器里,如centos或ubuntu命令行直接访问网页,尤其是无GUI界面的环境中,可以使用curl,但curl只能抓取页面,如果想读页面的内容,还得将其保存下来然后查看,也就是说curl相当于一个爬虫下载工具。如果想直接读到网页中的内容,就得另请高明了。下面介绍两种工具,links和lynx,实现从命令行浏览器来查看页面内容。

1、先来看一下curl的使用

直接在命令上开始使用:curl www.baidu.com,如下显示

root@VM-0-10-ubuntu:~# curl www.baidu.com
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8>
<meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer>
<link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css>
<title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> 
<div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> 
<img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> 
<form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1>
 <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> 
<input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1>
 <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr">
<input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span>
<span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form>
 </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新闻</a>
 <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地图
</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>视频</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>贴吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登录</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">登录</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多产品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>关于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbs
p;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必读</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意见反馈</a>&nbsp;京ICP证030173号&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

从上面运行结果可以看到,返回的就是百度首页的源代码,而不是我们看到的百度首页页面。这应该类似于爬虫的操作了,就是爬取源代码文件。如果想保存上述的内容,可以在curl后面增加一个-o参数:

root@VM-0-10-ubuntu:~# curl -o baidu.dat www.baidu.com   #获取百度首页源代码并保存到baidu.dat
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2381  100  2381    0     0  14152      0 --:--:-- --:--:-- --:--:-- 14172
root@VM-0-10-ubuntu:~# ll
total 80
drwx------  6 root root  4096 Dec  2 08:28 ./
drwxr-xr-x 25 root root  4096 Dec  2 08:29 ../
-rw-r--r--  1 root root  2381 Dec  2 08:28 baidu.dat
root@VM-0-10-ubuntu:~# more baidu.dat  #查看baidu.dat内容
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8>
<meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=alw
ays name=referrer><link rel=stylesheet type=text
......省略

如果想显示curl访问网页时的通信过程,可以增加一个-v参数,如下:

root@VM-0-10-ubuntu:~# curl -v www.baidu.com
* Rebuilt URL to: www.baidu.com/
*   Trying 119.63.197.139...
* Connected to www.baidu.com (119.63.197.139) port 80 (#0)
> GET / HTTP/1.1
> Host: www.baidu.com
> User-Agent: curl/7.47.0
> Accept: */*
> 
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Connection: keep-alive
< Content-Length: 2381
< Content-Type: text/html
< Date: Wed, 02 Dec 2020 00:34:06 GMT
< Etag: "588604ec-94d"
< Last-Modified: Mon, 23 Jan 2017 13:28:12 GMT
< Pragma: no-cache
< Server: bfe/1.0.8.18
< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/
< 
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1

所以许多文献在介绍curl时,总结该命令为:curl是一种HTTP命令行工具,作用是发出网络请求,然后得到和提取数据。所以如果在linux下来爬取网络数据,curl是一个非常好的选项。

2. links的使用

curl获取的是网页源代码,而不是网页呈现的内容。links就可以直接查看内容,我们先需要安装一下:

root@VM-0-10-ubuntu:~# apt install links   #安装links
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  links
0 upgraded, 1 newly installed, 0 to remove and 245 not upgraded.
Need to get 394 kB of archives.
After this operation, 1,348 kB of additional disk space will be used.
Get:1 http://mirrors.tencentyun.com/ubuntu xenial/universe amd64 links amd64 2.12-1 [394 kB]
Fetched 394 kB in 0s (4,296 kB/s)
Selecting previously unselected package links.
(Reading database ... 99048 files and directories currently installed.)
Preparing to unpack .../links_2.12-1_amd64.deb ...
Unpacking links (2.12-1) ...
Processing triggers for man-db (2.7.5-1) ...
Processing triggers for mime-support (3.59ubuntu1) ...
Setting up links (2.12-1) ...

然后直接使用links url访问:

8e3655dfcda3ae73bfa253a2f68ea85f.png

如上对新浪首页进行访问,受限于页面高度只可以看到部分内容,显示了部分文本。在执行links后使用键盘上的ESC按键,可以显示links的菜单:

d150b375b93e199c2c9cfc2042c563f5.png

这样就可以对页面内容做更多操作了,包括保存、下载和链接等。如果想对links的使用有更多了解,还可以直接在命令上输入: links ---help查看帮助文档:

root@VM-0-10-ubuntu:~# links --help
links [options] URL

Options are:

 -help
  Prints this help screen

 -version
  Prints the links version number and exit.

 -lookup <hostname>
  Does name lookup, like command "host".

 -g
  Run in graphics mode.

 -no-g
  Run in text mode (overrides previous -g).

 -driver <driver name>
  Graphics driver to use. Drivers are: x, svgalib, fb, directfb, pmshell,
    atheos.
  List of drivers will be shown if you give it an unknown driver.
  Available drivers depend on your operating system and available libraries.

 -mode <graphics mode>
  Graphics mode. For SVGALIB it is in format COLUMNSxROWSxCOLORS --
......省略

3. lynx的使用

官方:Lynx Information

Lynx is a customizable text-based web browser for use on cursor-addressable character cell terminals. As of January 2019, it is the oldest web browser still in general use and active development, having started in 1992.

从英文介绍中可以看出lynx是最古老的网页浏览工具,从1992年就开始使用,而且一直经久不衰。我们可以先在命令行使用lynx --help查看帮助文档:

root@VM-0-10-ubuntu:~# lynx --help
USAGE: lynx [options] [file]
Options are:
  -                 receive options and arguments from stdin
  -accept_all_cookies 
                    accept cookies without prompting if Set-Cookie handling
                    is on (off)
  -anonymous        apply restrictions for anonymous account,
                    see also -restrictions
  -assume_charset=MIMEname
                    charset for documents that don't specify it
  -assume_local_charset=MIMEname
                    charset assumed for local files
  -assume_unrec_charset=MIMEname
                    use this instead of unrecognized charsets
  -auth=id:pw       authentication information for protected documents
  -base             prepend a request URL comment and BASE tag to text/html
                    outputs for -source dumps
  -bibhost=URL      local bibp server (default http://bibhost/)
  -book             use the bookmark page as the startfile (off)
  -buried_news      toggles scanning of news articles for buried references (on)
  -cache=NUMBER     NUMBER of documents cached in memory
  -case             enable case sensitive user searching (off)
  -center           toggle center alignment in HTML TABLE (off)
  -cfg=FILENAME     specifies a lynx.cfg file other than the default
  -child            exit on left-arrow in startfile, and disable save to disk
  -child_relaxed    exit on left-arrow in startfile (allows save to disk)
  -cmd_log=FILENAME log keystroke commands to the given file
  -cmd_script=FILENAME
                    read keystroke commands from the given file
                    (see -cmd_log)
  -connect_timeout=N
                    set the N-second connection timeout (18000)
  -cookie_file=FILENAME
                    specifies a file to use to read cookies
  -cookie_save_file=FILENAME
                    specifies a file to use to store cookies
  -cookies          toggles handling of Set-Cookie headers (on)
...省略

非常多的参数,表明该工具可以处理很多方面的http请求和网页内容。我们可以先简单测试一下,使用使用lynx url访问一个网页页面:

root@VM-0-10-ubuntu:~# lynx www.baidu.com
Looking up  'www.baidu.com' first
www.baidu.com cookie: BAIDUID=16F1058D693B97613F35A4F360117CDA:FG=1 Allow? (Y/N/Always/neVer)
(选择Y后)
www.baidu.com cookie: PSTM=1606870861 Allow? (Y/N/Always/neVer)
(选择Y后)
   #百度搜索
   REFRESH(0 sec): http://www.baidu.com/baidu.html?from=noscript

   <style data-for="result" type="text/css" >body{color:#333;background:#fff;padding:6px 0 0;margin:0;position:relative}body,th,td,.p1,.p2{font-family:
   ____________________________________________________________________________________________________________________________________________________
   ____________________________________________________________________________________________________________________________________________________
   ____________________________________________________________________________________________________________________________________________________

   <style data-for="result" id="css_result" type="text/css">#ftCon{display:none}_______________________________________________________________________
   #qrcode{display:none}_______________________________________________________________________________________________________________________________
   #pad-version{display:none}__________________________________________________________________________________________________________________________
   #index_guide{display:none}__________________________________________________________________________________________________________________________
   #index_logo{display:none}___________________________________________________________________________________________________________________________
   #u1{display:none}___________________________________________________________________________________________________________________________________
   .s-top-left{display:none}___________________________________________________________________________________________________________________________
   .s_ipt_wr{height:32px}______________________________________________________________________________________________________________________________
   body{padding:0}_____________________________________________________________________________________________________________________________________
   #head .c-icon-bear-round{display:none}______________________________________________________________________________________________________________
   .index_tab_top{display:none}________________________________________________________________________________________________________________________
   .index_tab_bottom{display:none}_____________________________________________________________________________________________________________________
   #lg{display:none}___________________________________________________________________________________________________________________________________
   #m{display:none}____________________________________________________________________________________________________________________________________
   #ftCon{display:none}________________________________________________________________________________________________________________________________
   #bottom_layer,#bottom_space,#s_wrap{display:none}___________________________________________________________________________________________________
   .s-isindex-wrap{display:none}_______________________________________________________________________________________________________________________
   #nv{display:none!important}_________________________________________________________________________________________________________________________
(NORMAL LINK) Use right-arrow or <return> to activate.
  Arrow keys: Up and Down to move.  Right to follow a link; Left to go back.
 H)elp O)ptions P)rint G)o M)ain screen Q)uit /=search [delete]=history list

当访问百度的时候,需要先同意发送cookie信息,然后就可以看到百度的页面了。同时也是内容较多,需要用方向键往下翻可以看到一些具体内容:

681210a67bd60a2925d6b8e7b4998e3e.png

4. 小结一下

习惯了使用浏览器来打开网页,干嘛非得使用linux命令行来进行操作呢?那这些lynx、links命令许多情况下不是为了访问外网,更多的是对服务器内部搭建的web系统进行测试或者相应的控制。例如可以将lynx与crontab结合起来制定定时计划任务,如可以每天定时盘点统计,就可以在crontab -e中加入:

30 22 * * * lynx http://localhost/finance/statsDo  #每日22点30访问本机web服务
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值