Jsoup 抓取百度收录的真实页面地址url
发布时间:2018-05-22作者:laosun阅读(2972)
百度收录查询,或者从百度搜索出来的网址站点,去查看的时候,百度都做了一层跳转,说是加密也是加密,其实更重要的是统计.
例如:下边这个地址:
打开后真正的链接地址其实是本站的首页。那么我们在使用爬虫抓取的时候,如何获取跳转后的真实地址呢,其实百度的原理很简单,点击这个url,中间跳转的时候,在Header的location中保存着真实url。
下边我们使用jsoup包来测试一下
直接上代码:import java.io.IOException;
import org.jsoup.Connection;
import org.jsoup.Connection.Method;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Main {
static String url = "https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&tn=baidu&wd=site%3Awww.sunjs.com&oq=site%253Awww.sunjs.com&rsv_pq=dcfd424300025550&rsv_t=73eaxwIdrsFZsM%2F9CZpT9hbGRVlnirV%2FkDFuntgwz2ra43tSXYKtn4nyprk&rqlang=cn&rsv_enter=0";
public static void main(String[] args) {
try {
Document doc = Jsoup.connect(url).get();
Elements listHtml = doc.select(".c-container");
if (listHtml != null && listHtml.size() > 0) {
for (Element sign : listHtml) {
String href = sign.selectFirst("a").attr("href");
int itimeout = 60000;
try {
Connection.Response res = Jsoup.connect(href).timeout(itimeout).method(Method.GET).followRedirects(false).execute();
String realUrl = res.header("Location");
System.out.println(realUrl);
} catch (IOException e) {
e.printStackTrace();
}
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
关键代码就下边这两句:
Connection.Response res = Jsoup.connect(href).timeout(itimeout).method(Method.GET).followRedirects(false).execute();
String realUrl = res.header("Location");
运行结果:
https://www.sunjs.com/
https://www.sunjs.com/article/detail/6b1aeaed4104476bbb8ba8babc1d314f.html
https://www.sunjs.com/article/detail/6ec78db2139a468d933c40ed38322ecf.html
https://www.sunjs.com/article/detail/c5ec29a15f2c45908b42a4f26d9d355d.html
https://www.sunjs.com/article/detail/42ffaaee8f9e40d3b10cb5f9033bcdde.html
https://www.sunjs.com/article/detail/990cf56a52b147c394a4b2d4df4d7278.html
https://www.sunjs.com/article/detail/cefba55bc616442eb936135e6574d021.html
https://www.sunjs.com/article/detail/1450bac401114ce8a51099f38a743eb6.html
https://www.sunjs.com/tag/search.action?tag=mysql
https://www.sunjs.com/article/search.action?keyword=mac
4 +1
版权声明
分享到:
发表评论
请文明留言
发表
共 0 条评论