1.深入理解URL
1.1 URL
(uniform resource locator,统一资源定位器)是URI(uniform resource identifier,统一资源标识符)的子集,用来描述各种信息资源包括文件,服务器的地址、目录等。
1.2 格式
第一部分是协议或称为服务方式。
第二部分是存有该资源的主机IP地址(有时包括端口号)。
第三部分是主机资源的具体地址,比如目录和文件名等。
一二部分用 :// 隔开,二三部分用 / 隔开,一二部分必须有,第三部分可以没有。
1.3 通过指定的URL抓取页面的内容
- 通过java.net中的API直接获取内容:
//构造URL对象
URL pageUrl=new URL(path);
//通过URL对象来取得网络流,类似于操作本地文件
InputStream stream=pageUrl.openStream();
但是用API来模拟IE客户端的工作量非常大除了一般的网页内容以外还要吃力Http返回的状态码,设置Http代理,处理HTTPS协议等工作。
- HttpClient(Apache的http客户端开源项目有)来模拟IE客户端:
首先需要在项目中引入HttpClient.jar包(点击可下载)
//创建一个客户端
HttpClient httpclient=new HttpClient();
//创建一个get方法,类似于在浏览器中输入地址
GetMethod getMethod=new GetMethod("http://www......com");
//建立连接,获得响应状态码,类似于回车
int statusCode=httpclient.excuteMethod(getMethod);
//查看命中情况,还可以获得其他的信息,比如head,cookies等
System.out.println("response="+getMethod.getResponseBodyAsString());
//释放链接
getMethod.releaseConnection();
最后一步释放链接,关闭网络是为了防止多余的资源消耗。
get请求方式主要把http参数作为url的一部分,但是http对url的长度有限制,不能传递过多地参数,所以换成post请求方式。
//创建一个post方法
PostMethod postMethod=new PostMethod("http://www......com");
//使用数组来传递参数
NameValuePair[] postData=new NameValuePair[2];
//设置参数
POSTData[0]=new NameValuePair("武器","枪");
POSTData[1]=new NameValuePair("什么枪","神枪");
postMethod.addParameters(postData);
//建立连接,获得响应状态码
int statusCode=httpclient.excuteMethod(postMethod);
//查看命中情况,还可以获得其他的信息,比如head,cookies等
System.out.println("response="+postMethod.getResponseBodyAsString());
//释放链接
PostMethod.releaseConnection();
原书这里存在错误:把postMethod都写成了getMethod
设置参数类型都是NameValuePair(键值对,可以根据索引值索引)类型的,因此可以参数个数不受限制。
自己尝试以后出现了一个错误
addParameter函数的参数只能是NameValuesPair类型的值而不能是它的数组NameValuesPair[],也就是说原文中直接添加数组的方式是不行,一旦将参数放入一个NameValuesPair[]数组里面,添加参数的时候还是需要一个个循环添加,明显是浪费时间。
所以将上述代码改成:
//.......
//设置参数
postMethod.addParameters("武器","枪");
postMethod.addParameters("什么枪","神枪");
//.......
1.4 Http代理的设置
//新建一个客户端
HttpClient httpClient=new HttpClient();
//设置代理服务器的IP地址和端口
httpClient.getHostConfiguration().setProxy("192.168.0.1",9527);
//通知http使用抢先认证,否则会收到“没有权限”的错误
httpClient.getParams().setAuthenticationPreemptive(true);
//MyProxyCredentialsProvider返回代理的credentials(username/password)
/********************************************
* MyProxyCredentialsProvider 不是一个类,而是实现了CredentialsProvider接口的一个对象
* 需要自己来具体实现
******************************************/
httpClient.getParams().setParameter(CredentialsProvider.PROVIDER,new MyProxyCredentialsProvider());
//设置代理服务器的用户名和密码
httpClient.getState().setProxyCredentials(new AuthScope("192.168.0.1",AuthScope.ANY_PORT, ANY_REALM),
new UsernamePasswordCredentials("username","password"));
1.5一个简单的例子
网页地址是书上的例子www.lietu.com,他大概长这个样子:
获取网页信息的代码实现:
public class RetrivePage {
private static HttpClient httpClient = new HttpClient();
//设置代理服务器
static {
//设置代理服务器的IP地址和端口
// 你的网络可以直接上网的话就不需要设置代理le [这一步是可以省略的,]httpClient.getHostConfiguration().setProxy("172.17.18.84",8080);
}
public static boolean downloadPage(String path) throws HttpException, IOException {
InputStream input = null;
OutputStream output = null;
//得到post方法
PostMethod postMethod = new PostMethod(path);
//设置post方法的参数
postMethod.addParameter("name", "lietu");
postMethod.addParameter("password", "*****");
//执行得到返回码
int statusCode = httpClient.executeMethod(postMethod);
//System.out.println("statusCode="+statusCode);
//针对状态码进行处理(简单起见只处理返回值为200的状态码)
if (statusCode == HttpStatus.SC_OK) {
input = postMethod.getResponseBodyAsStream();
//得到报文名
String filename = path.substring(path.lastIndexOf('/') + 1);
//获得文件输出流
output = new FileOutputStream(filename);
//输出到文件
int tempByte = -1;
while ((tempByte = input.read()) > 0) {
output.write(tempByte);
char c=(char)(tempByte);
System.out.print(c);
}
//关闭输入输出流
if (input != null) {
input.close();
}
if (output != null) {
output.close();
}
return true;
}
return false;
}
}
测试用例:
public static void main(String[] args) {
//抓取lietu首页输出
try {
RetrivePage.downloadPage("http://www.lietu.com");
} catch (HttpException e) {
e.printStackTrace();
//System.out.println("HttpException");
} catch (IOException e) {
e.printStackTrace();
//System.out.println("IOException");
}
}
书上的代码有个错误就是把网址写成了http://www.lietu.com/这样会导致在写入文件的时候文件名为空的情况,因为他获取文件是String filename = path.substring(path.lastIndexOf('/') + 1);
这一句话,表示最后一个’/’后面的是文件名,但是http://www.lietu.com/网络名这样会导致最后获得是\0,所以文件直接就发生错误了IOException :could not find file。
另外一个就是,书上在存入文件的时候
int tempByte = -1;
while ((tempByte = input.read()) > 0) {
output.write(tempByte);
}
直接把int型字符存入文件,这样打开文件的时候谁都看不懂,所以需要转成char型字符存入文件。
int tempByte = -1;
while ((tempByte = input.read()) > 0) {
char c=(char)(tempByte);
output.write(c);
System.out.print(c);
}
输出以后的结果是:
???<html>
<style>
div.auto_complete {
width: 315px;
background: #fff;
}
div.auto_complete ul {
border:1px solid #888;
margin:0;
padding:0;
width:100%;
list-style-type:none;
}
div.auto_complete ul li {
margin:0;
padding:3px;
}
div.auto_complete ul li.selected {
background-color: #ffb;
}
div.auto_complete ul strong.highlight {
color: #800;
margin:0;
padding:0;
}
.k { BORDER-RIGHT: #666666 thin; BORDER-TOP: #666666 thin; BORDER-LEFT: #666666 thin; BORDER-BOTTOM: #666666 thin
}
.line2 { BORDER-TOP-WIDTH: 1px; BORDER-LEFT-WIDTH: 1px; BORDER-LEFT-COLOR: #999999; BORDER-TOP-COLOR: #999999; BORDER-BOTTOM: #999999 1px dotted; BORDER-RIGHT-WIDTH: 1px; BORDER-RIGHT-COLOR: #999999
}
.z12 { FONT-SIZE: 12px
}
</style>
<script language="JavaScript" type="text/javascript" src="scripts/prototype/prototype.js"></script>
<script language="JavaScript" type="text/javascript" src="scripts/scriptaculous/scriptaculous.js"></script>
<script language="JavaScript" type="text/javascript" src="scripts/scriptaculous/controls.js"></script>
<script language="JavaScript" type="text/javascript" src="scripts/scriptaculous/effects.js"></script>
<head>
<title>????????????</title>
<META http-equiv=Content-Type content="text/html; charset=utf-8">
<SCRIPT LANGUAGE = "JavaScript">
var isCE = navigator.appVersion.indexOf("Windows CE")>0;
if (isCE)
{
window.location.href="pda.htm";
}
</script>
<STYLE>.SELECT {
BORDER-RIGHT: #cccccc 1pxsolid; BORDER-TOP: #cccccc 1px solid; BORDER-LEFT: #cccccc 1px solid; BORDER-BOTTOM: #cccccc 1px solid
}
.form-button {
BORDER-RIGHT: #6699cc 1px solid; BORDER-TOP: #6699cc 1px solid; BORDER-LEFT: #6699cc 1px solid; BORDER-BOTTOM: #6699cc 1px solid
}
.form-button-hover {
BORDER-RIGHT: #6699cc 1px solid; BORDER-TOP: #ffffff 1px solid; BORDER-LEFT: #ffffff 1px solid; BORDER-BOTTOM: #6699cc 1px solid
}
.form-text {
BORDER-RIGHT: #cccccc 1px solid; BORDER-TOP: #cccccc 1px solid; BORDER-LEFT: #cccccc 1px solid; BORDER-BOTTOM: #cccccc 1px solid; FONT-FAMILY: Arial
}
A.alpha {
COLOR: #000000; TEXT-DECORATION: none
}
A.alpha:hover {
COLOR: #000000; TEXT-DECORATION: underline
}
TR.alpha {
BACKGROUND-COLOR: #6699cc
}
TD.alpha {
BACKGROUND-COLOR: #6699cc
}
FONT.alpha {
COLOR: #000000; FONT-FAMILY: Tahoma, Arial
}
.alpha-neg-alert {
COLOR: #ff0000
}
.alpha-pos-alert {
COLOR: #007f00
}
A.beta {
COLOR: #000000; TEXT-DECORATION: none
}
A.beta:hover {
COLOR: #000000; TEXT-DECORATION: underline
}
TR.beta {
BACKGROUND-COLOR: #b6cbeb
}
TD.beta {
BACKGROUND-COLOR: #b6cbeb
}
FONT.beta {
COLOR: #000000; FONT-FAMILY: Tahoma, Arial
}
.beta-neg-alert {
COLOR: #ff0000
}
.beta-pos-alert {
COLOR: #007f00
}
A.gamma {
COLOR: #000000; TEXT-DECORATION: none
}
A.gamma:hover {
COLOR: #000000; TEXT-DECORATION: underline
}
TR.gamma {
BACKGROUND-COLOR: #eeeeee
}
TD.gamma {
BACKGROUND-COLOR: #eeeeee
}
FONT.gamma {
COLOR: #000000; FONT-FAMILY: Tahoma, Arial
}
.gamma-neg-alert {
COLOR: #ff0000
}
.gamma-pos-alert {
COLOR: #007f00
}
A.bg {
COLOR: #000000; TEXT-DECORATION: none
}
A.bg:hover {
COLOR: #000000; TEXT-DECORATION: underline
}
TR.bg {
BACKGROUND-COLOR: #ffffff
}
TD.bg {
BACKGROUND-COLOR: #ffffff
}
FONT.bg {
COLOR: #000000; FONT-FAMILY: Tahoma, Arial
}
.bg-neg-alert {
COLOR: #ff0000
}
.bg-pos-alert {
COLOR: #007f00
}
.resultInfo {
color:#f80;
background-color: transparent;
text-transform:Uppercase;
padding: 5px 5px 5px 0px;
margin: 0;
font-size: .7em;
}
.rnav {
padding: 0;
font-family: Verdana, Arial, Helvetica, Sans-serif;
font-size: 1em;
color:#333;
background-color:#fff;
font-weight:bold;
font-size: .7em;
}
.rnavLabel {
text-transform: Uppercase;
color:#f80;
background-color: transparent;
}
a.rnavLink {
color: #415481;
background-color: transparent;
}
a:visited.rnavLink {
color: #8A9CBD;
background-color: transparent;
a:hover.rnavLink {
color: #f80;
text-decoration: none;
background-color: transparent;
}
#login {
PADDING-RIGHT: 10px; FONT-SIZE: 12px; FONT-FAMILY: Arial; WHITE-SPACE: nowrap; TEXT-ALIGN: right
}
</STYLE>
</head>
<body>
<br>
<br>
<br>
<center><A href="http://www.lietu.com/"><IMG src="/images/logo.gif" border=0></A></center>
<center>
<form method="get" action="search.jsp">
<TABLE width="400" border=0 bgcolor="#e5ecf9">
<TR align="center"><br>
<TD colSpan=3 height=30>
<FONT style="FONT-SIZE: 14px">
;???é?? ;
<a href="http://www.lietu.com/train">???è??</a> ;
<a href="http://www.lietu.com/job/">???è??</a> ;
<a href="http://www.lietu.com/trip/">??????</a> ;
<a href="more.htm">????¤?»</a>
</FONT>
</TD>
</TR>
<TR>
<TD>
<input autocomplete="off" type="text" name="query" id="zip" style="width:315px;" class="wd" />
<div class="auto_complete" id="zip_values"></div>
<script type="text/javascript">
new Ajax.Autocompleter('zip', 'zip_values',
'autoComplete', {afterUpdateElement : getSelectionId});
function getSelectionId(text, li) {
window.location = "./search.jsp?query="+encodeURIComponent(text.value);
}
</script>
</TD>
<TD>
<INPUT type=submit value=????????????>
</TD>
<TD>
<!--
<A class=c href="./AdvancedSearch.htm"><FONT style="FONT-SIZE: 14px">é????§??????</font></A>
-->
</TD>
</TR>
</TABLE>
</form>
</center>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<center>
<TABLE border=0>
<TR><TD>
<FONT style="FONT-SIZE: 14px">
<A class=c
href="./demo/index.jsp">????????§???</A>
<A class=c
href="./doc/index.jsp">????????????</A>
<A class=c
href="./case/index.jsp">????????????</A>
<A class=c
href="./news/index.jsp">????????°é??</A>
<A class=c
href="./AboutUs.jsp">è???????????</A>
<A class=c
href="./en/">ENGLISH</A><BR>
</FONT>
</TD></TR>
</TABLE>
</center>
</body>
</html>
这个时候发现中文乱码:原因是读取的时候没有指定编码而且是以字节流的方式读取的,存储的时候char是utf-16的编码,而String是utf-8,所以都会出现乱码。所以把存储和读取的代码改为如下:
input = postMethod.getResponseBodyAsStream();
BufferedReader br= new BufferedReader((new InputStreamReader(input,"UTF-8")));
//得到报文名
String filename = path.substring(path.lastIndexOf('/') + 1);
// 获得文件输出流
output = new FileOutputStream(filename);
//输出到文件
//int tempByte=-1;
String temp="";
while ((temp=br.readLine()) !=null) {
output.write(temp.getBytes());
System.out.println(temp);
}
输出结果是
?<html>
<style>
div.auto_complete {
width: 315px;
background: #fff;
}
div.auto_complete ul {
border:1px solid #888;
margin:0;
padding:0;
width:100%;
list-style-type:none;
}
div.auto_complete ul li {
margin:0;
padding:3px;
}
div.auto_complete ul li.selected {
background-color: #ffb;
}
div.auto_complete ul strong.highlight {
color: #800;
margin:0;
padding:0;
}
.k { BORDER-RIGHT: #666666 thin; BORDER-TOP: #666666 thin; BORDER-LEFT: #666666 thin; BORDER-BOTTOM: #666666 thin
}
.line2 { BORDER-TOP-WIDTH: 1px; BORDER-LEFT-WIDTH: 1px; BORDER-LEFT-COLOR: #999999; BORDER-TOP-COLOR: #999999; BORDER-BOTTOM: #999999 1px dotted; BORDER-RIGHT-WIDTH: 1px; BORDER-RIGHT-COLOR: #999999
}
.z12 { FONT-SIZE: 12px
}
</style>
<script language="JavaScript" type="text/javascript" src="scripts/prototype/prototype.js"></script>
<script language="JavaScript" type="text/javascript" src="scripts/scriptaculous/scriptaculous.js"></script>
<script language="JavaScript" type="text/javascript" src="scripts/scriptaculous/controls.js"></script>
<script language="JavaScript" type="text/javascript" src="scripts/scriptaculous/effects.js"></script>
<head>
<title>猎兔搜索</title>
<META http-equiv=Content-Type content="text/html; charset=utf-8">
<SCRIPT LANGUAGE = "JavaScript">
var isCE = navigator.appVersion.indexOf("Windows CE")>0;
if (isCE)
{
window.location.href="pda.htm";
}
</script>
<STYLE>.SELECT {
BORDER-RIGHT: #cccccc 1pxsolid; BORDER-TOP: #cccccc 1px solid; BORDER-LEFT: #cccccc 1px solid; BORDER-BOTTOM: #cccccc 1px solid
}
.form-button {
BORDER-RIGHT: #6699cc 1px solid; BORDER-TOP: #6699cc 1px solid; BORDER-LEFT: #6699cc 1px solid; BORDER-BOTTOM: #6699cc 1px solid
}
.form-button-hover {
BORDER-RIGHT: #6699cc 1px solid; BORDER-TOP: #ffffff 1px solid; BORDER-LEFT: #ffffff 1px solid; BORDER-BOTTOM: #6699cc 1px solid
}
.form-text {
BORDER-RIGHT: #cccccc 1px solid; BORDER-TOP: #cccccc 1px solid; BORDER-LEFT: #cccccc 1px solid; BORDER-BOTTOM: #cccccc 1px solid; FONT-FAMILY: Arial
}
A.alpha {
COLOR: #000000; TEXT-DECORATION: none
}
A.alpha:hover {
COLOR: #000000; TEXT-DECORATION: underline
}
TR.alpha {
BACKGROUND-COLOR: #6699cc
}
TD.alpha {
BACKGROUND-COLOR: #6699cc
}
FONT.alpha {
COLOR: #000000; FONT-FAMILY: Tahoma, Arial
}
.alpha-neg-alert {
COLOR: #ff0000
}
.alpha-pos-alert {
COLOR: #007f00
}
A.beta {
COLOR: #000000; TEXT-DECORATION: none
}
A.beta:hover {
COLOR: #000000; TEXT-DECORATION: underline
}
TR.beta {
BACKGROUND-COLOR: #b6cbeb
}
TD.beta {
BACKGROUND-COLOR: #b6cbeb
}
FONT.beta {
COLOR: #000000; FONT-FAMILY: Tahoma, Arial
}
.beta-neg-alert {
COLOR: #ff0000
}
.beta-pos-alert {
COLOR: #007f00
}
A.gamma {
COLOR: #000000; TEXT-DECORATION: none
}
A.gamma:hover {
COLOR: #000000; TEXT-DECORATION: underline
}
TR.gamma {
BACKGROUND-COLOR: #eeeeee
}
TD.gamma {
BACKGROUND-COLOR: #eeeeee
}
FONT.gamma {
COLOR: #000000; FONT-FAMILY: Tahoma, Arial
}
.gamma-neg-alert {
COLOR: #ff0000
}
.gamma-pos-alert {
COLOR: #007f00
}
A.bg {
COLOR: #000000; TEXT-DECORATION: none
}
A.bg:hover {
COLOR: #000000; TEXT-DECORATION: underline
}
TR.bg {
BACKGROUND-COLOR: #ffffff
}
TD.bg {
BACKGROUND-COLOR: #ffffff
}
FONT.bg {
COLOR: #000000; FONT-FAMILY: Tahoma, Arial
}
.bg-neg-alert {
COLOR: #ff0000
}
.bg-pos-alert {
COLOR: #007f00
}
.resultInfo {
color:#f80;
background-color: transparent;
text-transform:Uppercase;
padding: 5px 5px 5px 0px;
margin: 0;
font-size: .7em;
}
.rnav {
padding: 0;
font-family: Verdana, Arial, Helvetica, Sans-serif;
font-size: 1em;
color:#333;
background-color:#fff;
font-weight:bold;
font-size: .7em;
}
.rnavLabel {
text-transform: Uppercase;
color:#f80;
background-color: transparent;
}
a.rnavLink {
color: #415481;
background-color: transparent;
}
a:visited.rnavLink {
color: #8A9CBD;
background-color: transparent;
}
a:hover.rnavLink {
color: #f80;
text-decoration: none;
background-color: transparent;
}
#login {
PADDING-RIGHT: 10px; FONT-SIZE: 12px; FONT-FAMILY: Arial; WHITE-SPACE: nowrap; TEXT-ALIGN: right
}
</STYLE>
</head>
<body>
<br>
<br>
<br>
<center><A href="http://www.lietu.com/"><IMG src="/images/logo.gif" border=0></A></center>
<center>
<form method="get" action="search.jsp">
<TABLE width="400" border=0 bgcolor="#e5ecf9">
<TR align="center"><br>
<TD colSpan=3 height=30>
<FONT style="FONT-SIZE: 14px">
网页
<a href="http://www.lietu.com/train">培训</a>
<a href="http://www.lietu.com/job/">招聘</a>
<a href="http://www.lietu.com/trip/">旅游</a>
<a href="more.htm">更多</a>
</FONT>
</TD>
</TR>
<TR>
<TD>
<input autocomplete="off" type="text" name="query" id="zip" style="width:315px;" class="wd" />
<div class="auto_complete" id="zip_values"></div>
<script type="text/javascript">
new Ajax.Autocompleter('zip', 'zip_values',
'autoComplete', {afterUpdateElement : getSelectionId});
function getSelectionId(text, li) {
window.location = "./search.jsp?query="+encodeURIComponent(text.value);
}
</script>
</TD>
<TD>
<INPUT type=submit value=猎兔搜索>
</TD>
<TD>
<!--
<A class=c href="./AdvancedSearch.htm"><FONT style="FONT-SIZE: 14px">高级搜索</font></A>
-->
</TD>
</TR>
</TABLE>
</form>
</center>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<center>
<TABLE border=0>
<TR><TD>
<FONT style="FONT-SIZE: 14px">
<A class=c
href="./demo/index.jsp">搜索产品</A>
<A class=c
href="./doc/index.jsp">技术文档</A>
<A class=c
href="./case/index.jsp">成功案例</A>
<A class=c
href="./news/index.jsp">猎兔新闻</A>
<A class=c
href="./AboutUs.jsp">联系猎兔</A>
<A class=c
href="./en/">ENGLISH</A><BR>
</FONT>
</TD></TR>
</TABLE>
</center>
</body>
</html>
1.6 总结
书上出现的几处错误:
1.Page 7 将PostMethod写成了GetMethod
2.Page 7 PostMethod添加参数的方式不能直接将 数组 作为参数
3.Page 9 要获取的网页地址写错,没有注明不需要代理
4.最严重的是没有注明MyProxyCredentialsProvider 不是一个类,而是实现了CredentialsProvider接口的一个对象
5.目前只能是获取一个网页的信息还没有到可以解析的地步。