很多网站的防采集的办法,就是判断浏览器来源referer和cookie以及userAgent,道高一尺魔高一丈.
在Java中获取一个网站的HTML内容可以通过HttpURLConnection来获取.我们在HttpURLConnection中可以设置referer来伪造referer,轻松绕过这类防采集的网站
1 | HttpURLConnection conn = (HttpURLConnection) new URL(path).openConnection(); |
2 | conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon;)"); |
3 | conn.setRequestProperty("Accept-Encoding", "gzip"); |
4 | conn.setRequestProperty("referer", "http://www.popo4j.com"); |
5 | conn.setRequestProperty("cookie", "http://www.popo4j.com"); |
6 | InputStream inputStream = conn.getInputStream(); |
7 | //保存inputstream中的东西就OK了 |