题记:
今天闲着蛋疼,想弄个自己的博客,于是前台需要一个美观的页面,后台就需要爬爬XXX,因为看到XXX有RSS,原以为抓抓网页就省事了,可没想到.....更没想到...
Page:
先搞了个page,向CSS牛人学习下。
Rot:
原以为URLConnection抓到xml页面就可以了,可悲剧发生了,直接遭到XXX的拒绝。
<body>
<div style="padding:50px 0 0 300px">
<h1>您的访问被拒绝</h1>
<p>您可能使用了网络爬虫!</p>
XXXXXXXXX
</div>
</body>
- -! 于是就自然而然的自己构造http包,对XXX的80端口直接发送http包,折腾了几个小时,弄完后虽然没有被XXX直接拒收,但由于对HTTP协议不够深入,请求页面没被执行成功,如下:
www.XXXXX.com/XXX.XXX.XXX.XXX
80
HTTP/1.1 400 Bad Request
Connection: close
Content-Type: text/html
Content-Length: 349
Date: Sat, 24 Jul 2010 16:52:47 GMT
Server: lighttpd/1.4.20
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>400 - Bad Request</title>
</head>
<body>
<h1>400 - Bad Request</h1>
</body>
</html>
无奈,不想弄HTTP包了,用URLConnection伪装个User-Agent,结果竟然被抓出来了,汗一个!!
<?xml version="1.0" encoding="UTF-8" ?> <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"> xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx </rss> </xml>
XML(待续)
拿到博客的InputStream后,开始解析XML流并入后台数据库。
package org.blog.xml;
import java.io.IOException;
import java.io.InputStream;
import java.util.Map;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
/**
*
* @author cjcj
*
*/
public class XMLParser {
public Document parser(InputStream is) throws ParserConfigurationException, SAXException, IOException{
DocumentBuilderFactory f=DocumentBuilderFactory.newInstance();
DocumentBuilder builder=f.newDocumentBuilder();
Document doc=builder.parse(is);
getItems(doc.getDocumentElement());
return doc;
}
private Map<String,String> getItems(Element n){
if(n==null)throw new NullPointerException();
// get the item..
NodeList nl=n.getElementsByTagName("item");
for(int i=0;nl!=null&&i<nl.getLength();++i){
Element et=(Element) nl.item(i);
System.out.println(getTextValue(et,"title"));// get the title....
}
return null;
}
private String getTextValue(Element e,String tagNm){
NodeList nl=e.getElementsByTagName(tagNm);
return nl!=null&&nl.getLength()>0?nl.item(0).getFirstChild().getNodeValue():null;
}
}
package org.blog.xml;
import java.io.IOException;
import java.io.InputStream;
import java.util.Map;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.xml.sax.SAXException;
/**
*
* @author cjcj
*
*/
public class XMLParser {
public Document parser(InputStream is) throws ParserConfigurationException, SAXException, IOException{
DocumentBuilderFactory f=DocumentBuilderFactory.newInstance();
DocumentBuilder builder=f.newDocumentBuilder();
Document doc=builder.parse(is);
getItems(doc);
return doc;
}
public Map<String,String> getItems(Node n){
if(n==null)throw new NullPointerException();
//Map<String,String> items=new HashMap<String,String>();
//NodeList lists=doc.getChildNodes();
System.out.println(n.getNodeName());
System.out.println(n.getNodeValue());
//NamedNodeMap map=n.getAttributes();
//Node lists=map.getNamedItem("item");
return null;
}
}
Filter
压缩
DB
智能检测更新与定时器
方案一:通过比对<pubDate></pubDate>标签来判定更新。