jsoup简单爬取代理ip
-
简单看看jsoup入门教程
-
新建一个springboot项目
-
maven引入依赖
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.2</version> </dependency>
-
寻找爬取目标,百度代理ip即可
http://www.data5u.com/
-
分析网页,模拟请求,发现只需User-Agent即可访问到页面内容
-
创建dom对象,通过select选择器完成爬取
public void ipProxy() throws Exception { Document doc = Jsoup.connect(URL_IP).userAgent(USER_AGENT).get(); Elements ips = doc.select("body > div:nth-child(8) > ul > li:nth-child(2) > ul.l2").next(); for (int i = 0; i <= ips.size(); i++) { String ipaddr = ips.select("ul:nth-child(" + (i + 2) + ") > span:nth-child(1) > li").text(); String proxy = ips.select("ul:nth-child(" + (i + 2) + ") > span:nth-child(2) > li").text(); String speed = ips.select("ul:nth-child(" + (i + 2) + ") > span:nth-child(8) > li").text(); log.info("ip: {}----端口: {} ----速度:{} ", ipaddr, proxy, speed); } }
- select选择器使用技巧,Chrome浏览器F12>选中元素>右键>copy selector即可